### PIB CoPilot
PIBs are also used to create a pitchbook by assessing a company's strategy, competitive positioning, review of financial statements, industry dynamics, and trends within the industry. 

- News releases: News articles that may affect a company's stock price or growth prospect would be something that analysts look into, particularly within a 6-12 month time horizon.
- SEC filings: These regulatory documents require a company to file Form 10-K and Form 10-Q with the SEC on an ongoing basis. Form 10-K is a financial overview and commentary for the last year, usually found on the company's website. Form 10-Q is similar to form 10-K, but it is a report for the last quarter instead of the previous year.
- Equity research reports: Look into key forecasts for metrics like Revenue, EBITDA, and EPS for the company or competing firms to form a consensus estimate. 
- Investor Presentations: Companies provide historical information as an important foundation from which forecasts are made to guide key forecasting drivers. 
- Press Release: Can be found in the investor relations section of most companies' websites and contains the financial statements which are used in forms 10-K and 10-Q. 
- Conference calls: The same day a company issues its quarterly press release, it will also hold a conference call. On the call, analysts often learn details about management guidance. These conference calls are transcribed by several service providers and can be accessed by subscribers of large financial data providers.

In [1]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"https://{OpenAiService}.openai.azure.com"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in openAiEndPoint.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = openAiEndPoint
davincimodel = OpenAiDavinci


In [2]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from IPython.display import display, HTML
from langchain.chains.summarize import load_summarize_chain
from langchain.utilities import BingSearchAPIWrapper
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd
from langchain.prompts import PromptTemplate
from datetime import datetime
from pytz import timezone
from dateutil.relativedelta import relativedelta
from datetime import timedelta
from Utilities.pibCopilot import indexDocs, createPressReleaseIndex, createStockNewsIndex, mergeDocs
from Utilities.pibCopilot import indexEarningCallSections, createEarningCallVectorIndex, createEarningCallIndex, performCogSearch, createSecFilingIndex, findSecFiling
import typing
from Utilities.fmp import *
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI

# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"
temperature = 0
tokenLength = 1000

if (embeddingModelType == 'azureopenai'):
    openai.api_type = "azure"
    openai.api_key = OpenAiKey
    openai.api_version = OpenAiVersion
    openai.api_base = OpenAiBase

    llm = AzureOpenAI(deployment_name=OpenAiDavinci,
            temperature=temperature,
            openai_api_key=OpenAiKey,
            max_tokens=tokenLength,
            batch_size=10, 
            max_retries=12)
    
    llmChat = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
    
    logging.info("LLM Setup done")
    embeddings = OpenAIEmbeddings(deployment=OpenAiEmbedding, chunk_size=1, openai_api_key=OpenAiKey)
elif embeddingModelType == "openai":
    openai.api_type = "open_ai"
    openai.api_base = "https://api.openai.com/v1"
    openai.api_version = '2020-11-07' 
    openai.api_key = OpenAiApiKey
    llm = OpenAI(temperature=temperature,
            openai_api_key=OpenAiApiKey,
            max_tokens=tokenLength)
    embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)

    llmChat = ChatOpenAI(temperature=temperature,
        openai_api_key=OpenAiApiKey,
        model_name="gpt-3.5-turbo",
        max_tokens=tokenLength)
    
    embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)

In [3]:
apikey = FmpKey
symbol: str = "AAPL"
cik = "320193"
#symbols: typing.List[str] = ["AAPL", "CSCO", "QQQQ"]
#exchange: str = "NYSE"
#exchanges: typing.List[str] = ["NYSE", "NASDAQ"]
#query: str = "AA"
#limit: int = 3
#period: str = "quarter"
#download: bool = True

In [4]:
central = timezone('US/Central')
today = datetime.now(central)
currentYear = today.year
historicalDate = today - relativedelta(years=3)
historicalYear = historicalDate.year
historicalDate = historicalDate.strftime("%Y-%m-%d")
totalYears = currentYear - historicalYear

#### Paid Data - Company Profile and Key Executives

In [50]:
profile = companyProfile(apikey=apikey, symbol=symbol)
df = pd.DataFrame.from_dict(pd.json_normalize(profile))
print(df[['symbol', 'mktCap', 'companyName', 'currency', 'cik', 'isin', 'exchange', 'industry', 'sector', 'address', 'city', 'state', 'zip', 'website', 'description']])

  symbol         mktCap companyName currency         cik          isin  \
0   AAPL  2966590185462  Apple Inc.      USD  0000320193  US0378331005   

               exchange              industry      sector             address  \
0  NASDAQ Global Select  Consumer Electronics  Technology  One Apple Park Way   

        city state    zip                website  \
0  Cupertino    CA  95014  https://www.apple.com   

                                         description  
0  Apple Inc. designs, manufactures, and markets ...  


In [51]:
executives = keyExecutives(apikey=apikey, symbol=symbol)
df = pd.DataFrame.from_dict(pd.json_normalize(executives),orient='columns')
print(df[['title', 'name']])

                                               title                     name
0           Senior Vice President of People & Retail     Ms. Deirdre  O'Brien
1                 Chief Executive Officer & Director      Mr. Timothy D. Cook
2                          Chief Information Officer          Ms. Mary  Demby
3          Senior Director of Corporation Accounting         Mr. Chris  Kondo
4                           Chief Technology Officer        Mr. James  Wilson
5                            Chief Operating Officer  Mr. Jeffrey E. Williams
6    Chief Financial Officer & Senior Vice President        Mr. Luca  Maestri
7                    Senior Vice President of Retail     Ms. Deirdre  O'Brien
8         Senior Vice President, Gen. Counsel & Sec.   Ms. Katherine L. Adams
9       Senior Vice President of Worldwide Marketing        Mr. Greg  Joswiak
10  Senior Director of Investor Relations & Treasury        Ms. Nancy  Paxton


#### With the company profile and key executives, we can ask Bing Search to get the biography of the all Key executives and ask OpenAI to summarize it - Public Data

In [52]:
os.environ['BING_SUBSCRIPTION_KEY'] = BingKey
os.environ['BING_SEARCH_URL'] = BingUrl
tools = []
topK = 1

for executive in executives:
    name = executive['name']
    title = executive['title']
    query = f"Give me brief biography of {name} who is {title} at {symbol} as it relates to {symbol}"
    bingSearch = BingSearchAPIWrapper(k=topK)
    results = bingSearch.run(query=query)
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    docs = [Document(page_content=results)]
    summary = chain.run(docs)
    print("Summary for ", name, "is : ", summary)

Summary for  Ms. Deirdre  O'Brien is :   Deirdre O'Brien (born c. 1966) is an American businesswoman and Senior Vice President of Retail and People at Apple Inc. She is responsible for talent development, recruiting, employee relations, business partnerships, benefits, compensation, and inclusion and diversity initiatives.
Summary for  Mr. Timothy D. Cook is :   Tim Cook is the CEO of Apple Inc. since 2011 and was born on November 1, 1960 in Robertsdale, Alabama.
Summary for  Ms. Mary  Demby is :   Two senior executives at Apple, Anna Wojcicki and Matt Fischer, have announced their departure from the company, according to Bloomberg.
Summary for  Mr. Chris  Kondo is :   Chris Kondo is a Senior Director of Corporate Accounting at Apple in Santa Clara, California, with 483 followers on LinkedIn.
Summary for  Mr. James  Wilson is :   Apple Inc. is a leading American technology company that specializes in consumer electronics, computer software, and online services. It is one of the Big Fiv

#### Paid Data -  Get the Earnings Call Transcript for each quarter for last 3 years

In [5]:
# Call the paid data (FMP) API
# Get the earning call transcripts for the last 3 years and merge documents into the index.
i = 0
earningsData = []
earningIndexName = 'earningcalls'
symbol = 'AMZN'
# Create the index if it does not exist
createEarningCallIndex(SearchService, SearchKey, earningIndexName)
for i in range(totalYears + 1):
    print(f"Processing ticker : {symbol}")
    processYear = historicalYear + i
    Quarters = ['Q1', 'Q2', 'Q3', 'Q4']
    for quarter in Quarters:
        print(f"Processing year and Quarter : {processYear}-{quarter}")
        earningTranscript = earningCallTranscript(apikey=apikey, symbol=symbol, year=str(processYear), quarter=quarter)
        for transcript in earningTranscript:
            symbol = transcript['symbol']
            quarter = transcript['quarter']
            year = transcript['year']
            callDate = transcript['date']
            content = transcript['content']
            todayYmd = today.strftime("%Y-%m-%d")
            id = f"{symbol}-{year}-{quarter}"
            earningsData.append({
                "id": id,
                "symbol": symbol,
                "quarter": str(quarter),
                "year": str(year),
                "callDate": callDate,
                "content": content,
                #"inserteddate": datetime.now(central).strftime("%Y-%m-%d"),
            })
# Index the documents in the earning calls index
mergeDocs(SearchService, SearchKey, earningIndexName, earningsData)

Search index earningcalls already exists
Processing ticker : AMZN
Processing year and Quarter : 2020-Q1
Processing year and Quarter : 2020-Q2
Processing year and Quarter : 2020-Q3
Processing year and Quarter : 2020-Q4
Processing ticker : AMZN
Processing year and Quarter : 2021-Q1
Processing year and Quarter : 2021-Q2
Processing year and Quarter : 2021-Q3
Processing year and Quarter : 2021-Q4
Processing ticker : AMZN
Processing year and Quarter : 2022-Q1
Processing year and Quarter : 2022-Q2
Processing year and Quarter : 2022-Q3
Processing year and Quarter : 2022-Q4
Processing ticker : AMZN
Processing year and Quarter : 2023-Q1
Processing year and Quarter : 2023-Q2
Processing year and Quarter : 2023-Q3
Processing year and Quarter : 2023-Q4
Total docs: 13
	Indexed 13 sections, 13 succeeded


#### Split the transcripts as per Split Method, Chunk Size and Overlap

In [6]:
# Let's just use the latest earnings call transcript to create the documents that we want to use it for generative AI tasks
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)

print("Last earning call transcripts was on :", earningsData[-1]['callDate'])
rawDocs = splitter.create_documents([earningsData[-1]['content']])
docs = splitter.split_documents(rawDocs)
print("Number of documents chunks generated from Call transcript : ", len(docs))


Last earning call transcripts was on : 2023-04-27 19:55:02
Number of documents chunks generated from Call transcript :  38


#### Create the vector store embedding data for chunked sections

In [7]:
# Store the last index of the earning call transcript in vector Index
earningVectorIndexName = 'latestearningcalls'
createEarningCallVectorIndex(SearchService, SearchKey, earningVectorIndexName)

indexEarningCallSections(OpenAiService, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey,
                         embeddingModelType, OpenAiEmbedding, earningVectorIndexName, docs,
                         earningsData[-1]['callDate'], earningsData[-1]['symbol'], earningsData[-1]['year'],
                         earningsData[-1]['quarter'])

Search index latestearningcalls already exists
Total docs: 38
Found 38 sections for AMZN 2023 Q1
Already indexed 38 sections for AMZN 2023 Q1


In [36]:
# Helper function to find the answer to a question
def findAnswer(chainType, topK, question, indexName):
    # Since we already index our document, we can perform the search on the query to retrieve "TopK" documents
    r = performCogSearch(OpenAiService, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, question, 
                         indexName, topK, returnFields=['id', 'symbol', 'quarter', 'year', 'callDate', 'content'])

    if r == None:
        docs = [Document(page_content="No results found")]
    else :
        docs = [
            Document(page_content=doc['content'], metadata={"id": doc['id'], "source": ''})
            for doc in r
            ]

    if chainType == "map_reduce":
        # Prompt for MapReduce
        qaTemplate = """Use the following portion of a long document to see if any of the text is relevant to answer the question.
                Return any relevant text.
                {context}
                Question: {question}
                Relevant text, if any :"""

        qaPrompt = PromptTemplate(
            template=qaTemplate, input_variables=["context", "question"]
        )

        combinePromptTemplate = """Given the following extracted parts of a long document and a question, create a final answer.
        If you don't know the answer, just say that you don't know. Don't try to make up an answer.
        If the answer is not contained within the text below, say \"I don't know\".

        QUESTION: {question}
        =========
        {summaries}
        =========
        """
        combinePrompt = PromptTemplate(
            template=combinePromptTemplate, input_variables=["summaries", "question"]
        )

        qaChain = load_qa_with_sources_chain(llm, chain_type=chainType, question_prompt=qaPrompt, 
                                            combine_prompt=combinePrompt, 
                                            return_intermediate_steps=True)
        answer = qaChain({"input_documents": docs, "question": question})
        outputAnswer = answer['output_text']

    elif chainType == "stuff":
    # Prompt for ChainType = Stuff
        template = """
                Given the following extracted parts of a long document and a question, create a final answer. 
                If you don't know the answer, just say that you don't know. Don't try to make up an answer. 
                If the answer is not contained within the text below, say \"I don't know\".

                QUESTION: {question}
                =========
                {summaries}
                =========
                """
        qaPrompt = PromptTemplate(template=template, input_variables=["summaries", "question"])
        qaChain = load_qa_with_sources_chain(llm, chain_type=chainType, prompt=qaPrompt)
        answer = qaChain({"input_documents": docs, "question": question}, return_only_outputs=True)
        outputAnswer = answer['output_text']
    elif chainType == "default":
        # Default Prompt
        qaChain = load_qa_with_sources_chain(llm, chain_type="stuff")
        answer = qaChain({"input_documents": docs, "question": question}, return_only_outputs=True)
        outputAnswer = answer['output_text']

    return outputAnswer

#### 10 best questions to ask during earning call - Let's see if we can find the answers to these questions in the transcripts
- What are some of the current and looming threats to the business?
- What is the debt level or debt ratio of the company right now?
- How do you feel about the upcoming product launches or new products?
- How are you managing or investing in your human capital?
- How do you track the trends in your industry?
- Are there major slowdowns in the production of goods?
- How will you maintain or surpass this performance in the next few quarters?
- What will your market look like in five years as a result of using your product or service?
- How are you going to address the risks that will affect the long-term growth of the company?
- How is the performance this quarter going to affect the long-term goals of the company?

In [33]:
answer1 = findAnswer('map_reduce', 3, "What are some of the current and looming threats to the business?", earningVectorIndexName)
answer2 = findAnswer('map_reduce', 3, "What is the debt level or debt ratio of the company right now?", earningVectorIndexName)
answer3 = findAnswer('map_reduce', 3, "How do you feel about the upcoming product launches or new products?", earningVectorIndexName)
answer4 = findAnswer('map_reduce', 3, "How are you managing or investing in your human capital?", earningVectorIndexName)
answer5 = findAnswer('map_reduce', 3, "How do you track the trends in your industry?", earningVectorIndexName)
answer6 = findAnswer('map_reduce', 3, "Are there major slowdowns in the production of goods?", earningVectorIndexName)
answer7 = findAnswer('map_reduce', 3, "How will you maintain or surpass this performance in the next few quarters?", earningVectorIndexName)
answer8 = findAnswer('map_reduce', 3, "What will your market look like in five years as a result of using your product or service?", earningVectorIndexName)
answer9 = findAnswer('map_reduce', 3, "How are you going to address the risks that will affect the long-term growth of the company?", earningVectorIndexName)
answer10 = findAnswer('map_reduce', 3, "How is the performance this quarter going to affect the long-term goals of the company?", earningVectorIndexName)
print("Answer 1 : ", answer1)
print("Answer 2 : ", answer2)
print("Answer 3 : ", answer3)
print("Answer 4 : ", answer4)
print("Answer 5 : ", answer5)
print("Answer 6 : ", answer6)
print("Answer 7 : ", answer7)
print("Answer 8 : ", answer8)
print("Answer 9 : ", answer9)
print("Answer 10 : ", answer10)

Answer 1 :  
Some of the current and looming threats to the business include changes in global economic and geopolitical conditions, recessionary fears, inflation, interest rates, regional labor market constraints, world events, the rate of growth of the internet, online commerce and cloud services, and the potential for difficult decisions such as eliminating corporate roles.
Answer 2 :  
I don't know.
Answer 3 :  
I don't know.
Answer 4 :  
I don't know.
Answer 5 :  
I don't know.
Answer 6 :  
I don't know.
Answer 7 :  
I don't know.
Answer 8 :  
I don't know.
Answer 9 :  
I don't know.
Answer 10 :  
I don't know.


#### Another specific question to ask
- Revenue: Provide key information about revenue for the quarter
- Profitability: Provide key information about profits and losses (P&L) for the quarter
- Industry Trends: Provide key information about industry trends for the quarter
- Trend: Provide key information about business trends discussed on the call
- Risk: Provide key information about risk discussed on the call
- AI: Provide key information about AI discussed on the call
- M&A: Provide any information about mergers and acquisitions (M&A) discussed on the call.
- Guidance: Provide key information about guidance discussed on the call

In [38]:
answer1 = findAnswer('map_reduce', 3, "Provide key information about revenue for the quarter", earningVectorIndexName)
answer2 = findAnswer('map_reduce', 3, "Provide key information about profits and losses (P&L) for the quarter", earningVectorIndexName)
answer3 = findAnswer('map_reduce', 3, "Provide key information about industry trends for the quarter", earningVectorIndexName)
answer4 = findAnswer('map_reduce', 3, "Provide key information about business trends discussed on the call", earningVectorIndexName)
answer5 = findAnswer('map_reduce', 3, "Provide key information about risk discussed on the call", earningVectorIndexName)
answer6 = findAnswer('map_reduce', 3, "Provide key information about AI discussed on the call", earningVectorIndexName)
answer7 = findAnswer('map_reduce', 3, "Provide any information about mergers and acquisitions (M&A) discussed on the call.", earningVectorIndexName)
answer8 = findAnswer('map_reduce', 3, "Provide key information about guidance discussed on the call", earningVectorIndexName)
print("Answer 1 : ", answer1)
print("Answer 2 : ", answer2)
print("Answer 3 : ", answer3)
print("Answer 4 : ", answer4)
print("Answer 5 : ", answer5)
print("Answer 6 : ", answer6)
print("Answer 7 : ", answer7)
print("Answer 8 : ", answer8)


Answer 1 :  
The answer to the question is: For the first quarter, our worldwide net sales were $127.4 billion, up 9% year-over-year, or 11% excluding approximately 210 basis points of unfavorable impact from changes in foreign exchange rates. We reported $4.8 billion in operating income, above the top end of our guidance range.
Answer 2 :  
The key information about profits and losses (P&L) for the quarter is that the company reported $4.8 billion in operating income, which was negatively impacted by an estimated employee severance charge of approximately $470 million.
Answer 3 :  
The key information about industry trends for the quarter is that sellers comprised 59% of overall unit sales in Q1, up from 55% one year ago. Revenue for advertising services was up 23% year-over-year, excluding the impact from changes in foreign exchange rates. In AWS, net sales were $21.4 billion in the first quarter, up 16% year-over-year and representing an annualized sales run rate of more than $85 bi

#### Since we have the lastest transcripts in the document format, let's summarize the information with following specific summary

In [8]:
# With the data indexed, let's summarize the information
# While we are using the standard prompt by langchain, you can modify the prompt to suit your needs
        # 8. Risk Increase: Please provide a summary of the risks that have increased.
        # 9. Risk Decrease: Please provide a summary of the risks that have decreased.
        # 10. Opportunity Increase: Please provide a summary of the opportunities that have increased.
        # 11. Opportunity Decrease: Please provide a summary of the opportunities that have decreased.
promptTemplate = """You are an AI assistant tasked with summarizing earning call transcript. 
        Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
        Please generate a concise and comprehensive summary that includes following bulleted numbered format. 
        1. Financial Results Summary: Please provide a summary of the financial results.
        2. Business Highlights: Please provide a summary of the business highlights.
        3. Future Outlook: Please provide a summary of the future outlook.
        4. Business Risks: Please provide a summary of the business risks.
        5. Management Positive Sentiment: Please provide a summary of the what management is confident about.
        6. Management Negative Sentiment: Please provide a summary of the what management is concerned about.
        7. Future Growth Strategies : Please generate a concise and comprehensive strategies summary that includes the information in  bulleted format. 
        Please remember to use clear language and maintain the integrity of the original information without missing any important details:
        {text}
        """
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)

 Financial Results Summary: Amazon reported a net sales increase of 37% to $108.5 billion in the fourth quarter of 2020, compared to $80.0 billion in fourth quarter of 2019. Operating income increased to $4.4 billion in the fourth quarter of 2020, compared to $3.9 billion in fourth quarter of 2019.

Business Highlights: Amazon is investing in brand protection efforts, including industry-leading technology, to provide a great selling experience free from bad actors. They are also transitioning their U.S. fulfillment network to a regionalized model, which they believe will improve both delivery speed and their cost to serve customers over time.

Future Outlook: Amazon is focused on helping customers optimize their AWS spend, reducing linehaul shipping rates, and continuing to build customer relationships and a business that will outlast all of us.

Business Risks: Amazon is aware of the risks associated with their business, including the potential for increased competition, the potential

#### In case if we wanted to see summary of summary, run code below

In [5]:
# # For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# # This way you can inspect the results from map_reduce chain type, each top similar chunk summary
# intermediateSteps = summary['intermediate_steps']
# for step in intermediateSteps:
#         display(HTML("<b>Chunk Summary:</b> " + step))

In [5]:
# For now we are calling API to get data, but otherwise we need to ensure the data is not persisted in our 
# index repository before calling again, if it is persisted then we need to delete it first
counter = 0
pressReleasesList = []
pressReleaseIndexName = 'pressreleases'
# Create the index if it does not exist
createPressReleaseIndex(SearchService, SearchKey, pressReleaseIndexName)
print(f"Processing ticker : {symbol}")
pr = pressReleases(apikey=apikey, symbol=symbol, limit=200)
for pressRelease in pr:
    symbol = pressRelease['symbol']
    releaseDate = pressRelease['date']
    title = pressRelease['title']
    content = pressRelease['text']
    todayYmd = today.strftime("%Y-%m-%d")
    id = f"{symbol}-{counter}"
    pressReleasesList.append({
        "id": id,
        "symbol": symbol,
        "releaseDate": releaseDate,
        "title": title,
        "content": content,
    })
    counter = counter + 1

mergeDocs(SearchService, SearchKey, pressReleaseIndexName, pressReleasesList)

Search index pressreleases already exists
Processing ticker : AAPL
Total docs: 164
	Indexed 164 sections, 164 succeeded


In [6]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)
rawPressReleasesDoc = [Document(page_content=t['content']) for t in pressReleasesList[:25]]
pressReleasesDocs = splitter.split_documents(rawPressReleasesDoc)
print("Number of documents chunks generated from Press releases : ", len(pressReleasesDocs))

Number of documents chunks generated from Press releases :  25


In [39]:
# With the data indexed, let's summarize the information
promptTemplate = """You are an AI assistant tasked with summarizing press releases and performing sentiments on those. 
        Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
        Please generate a concise and comprehensive summary and sentiment with score with range of 0 to 10. Your response should be in JSON format with following keys..
        summary: 
        sentiment:
        sentiment score: 
        Please remember to use clear language and maintain the integrity of the original information without missing any important details
        {text}
        """
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": pressReleasesDocs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)




In [40]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

### Get Stock News - Limit it to 5000 and most likely that will cover for current year

In [14]:
# For now we are calling API to get data, but otherwise we need to ensure the data is not persisted in our 
# index repository before calling again, if it is persisted then we need to delete it first
counter = 0
stockNewsList = []
stockNewsIndexName = 'stocknews'
# Create the index if it does not exist
createStockNewsIndex(SearchService, SearchKey, stockNewsIndexName)
print(f"Processing ticker : {symbol}")
sn = stockNews(apikey=apikey, tickers=symbol, limit=5000)
for news in sn:
    symbol = news['symbol']
    publishedDate = news['publishedDate']
    title = news['title']
    image = news['image']
    site = news['site']
    content = news['text']
    url = news['url']
    todayYmd = today.strftime("%Y-%m-%d")
    id = f"{symbol}-{todayYmd}-{counter}"
    stockNewsList.append({
        "id": id,
        "symbol": symbol,
        "publishedDate": publishedDate,
        "title": title,
        "image": image,
        "site": site,
        "content": content,
        "url": url,
    })
    counter = counter + 1
mergeDocs(SearchService, SearchKey, stockNewsIndexName, stockNewsList)

Search index stocknews already exists
Processing ticker : AAPL
Total docs: 5000
	Indexed 1000 sections, 1000 succeeded
	Indexed 1000 sections, 1000 succeeded
	Indexed 1000 sections, 1000 succeeded
	Indexed 1000 sections, 1000 succeeded
	Indexed 1000 sections, 1000 succeeded


In [37]:
stocksDf = pd.DataFrame.from_dict(pd.json_normalize(stockNewsList))
stocksDf['publishedDate'] = pd.to_datetime(stocksDf['publishedDate']).dt.date
stocksNewsDailyDf = stocksDf.sort_values('publishedDate').groupby('publishedDate')['content'].apply('\n'.join).reset_index()
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)
rawNewsDocs = [Document(page_content=row['content']) for index, row in stocksNewsDailyDf.tail(10).iterrows()]
newsDocs = splitter.split_documents(rawNewsDocs)
print("Number of documents chunks generated from Press releases : ", len(newsDocs))

# With the data indexed, let's summarize the information
promptTemplate = """You are an AI assistant tasked with summarizing news related to company and performing sentiments on those. 
        Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
        Please generate a concise and comprehensive summary and sentiment with score with range of 0 to 10. Your response should be in JSON format with following keys.
        summary: 
        sentiment:
        sentiment score:
        Please remember to use clear language and maintain the integrity of the original information without missing any important details.
        {text}
        """
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": newsDocs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)




In [38]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

In [6]:
filingType = "10-K"
secFilingsList = secFilings(apikey=apikey, symbol=symbol, filing_type=filingType)

In [18]:
latestFilingDateTime = datetime.strptime(secFilingsList[0]['fillingDate'], '%Y-%m-%d %H:%M:%S')
latestFilingDate = latestFilingDateTime.strftime("%Y-%m-%d")
secFilingIndexName = 'secdata'
secFilingList = []
emptyBody = {
        "values": [
            {
                "recordId": 0,
                "data": {
                    "text": ""
                }
            }
        ]
}

secExtractBody = {
    "values": [
        {
            "recordId": 0,
            "data": {
                "text": {
                    "edgar_crawler": {
                        "start_year": int(historicalYear),
                        "end_year": int(currentYear),
                        "quarters": [1,2,3,4],
                        "filing_types": [
                            "10-K"
                        ],
                        "cik_tickers": [cik],
                        "user_agent": "Your name (your email)",
                        "raw_filings_folder": "RAW_FILINGS",
                        "indices_folder": "INDICES",
                        "filings_metadata_file": "FILINGS_METADATA.csv",
                        "skip_present_indices": True
                    },
                    "extract_items": {
                        "raw_filings_folder": "RAW_FILINGS",
                        "extracted_filings_folder": "EXTRACTED_FILINGS",
                        "filings_metadata_file": "FILINGS_METADATA.csv",
                        "items_to_extract": ["1","1A","1B","2","3","4","5","6","7","7A","8","9","9A","9B","10","11","12","13","14","15"],
                        "remove_tables": True,
                        "skip_extracted_filings": True
                    }
                }
            }
        }
    ]
}

# Check if we have already processed the latest filing, if yes then skip
createSecFilingIndex(SearchService, SearchKey, secFilingIndexName)
r = findSecFiling(SearchService, SearchKey, secFilingIndexName, cik, filingType, latestFilingDate, returnFields=['id', 'cik', 'company', 'filingType', 'filingDate',
                                                                                                                 'periodOfReport', 'sic', 'stateOfInc', 'fiscalYearEnd',
                                                                                                                 'filingHtmlIndex', 'htmFilingLink', 'completeTextFilingLink',
                                                                                                                 'item1', 'item1A', 'item1B', 'item2', 'item3', 'item4', 'item5',
                                                                                                                 'item6', 'item7', 'item7A', 'item8', 'item9', 'item9A', 'item9B',
                                                                                                                 'item10', 'item11', 'item12', 'item13', 'item14', 'item15',
                                                                                                                 'sourcefile'])
if r.get_count() == 0:
    # Call Azure Function to perform Web-scraping and store the JSON in our blob
    secExtract = requests.post(SecExtractionUrl, json = secExtractBody)
    # Once the JSON is created, call the function to process the JSON and store the data in our index
    docPersistUrl = SecDocPersistUrl + "&indexType=cogsearchvs&indexName=" + secFilingIndexName + "&embeddingModelType=" + embeddingModelType
    secPersist = requests.post(docPersistUrl, json = emptyBody)
    r = findSecFiling(SearchService, SearchKey, secFilingIndexName, cik, filingType, latestFilingDate, returnFields=['id', 'cik', 'company', 'filingType', 'filingDate',
                                                                                                                 'periodOfReport', 'sic', 'stateOfInc', 'fiscalYearEnd',
                                                                                                                 'filingHtmlIndex', 'htmFilingLink', 'completeTextFilingLink',
                                                                                                                 'item1', 'item1A', 'item1B', 'item2', 'item3', 'item4', 'item5',
                                                                                                                 'item6', 'item7', 'item7A', 'item8', 'item9', 'item9A', 'item9B',
                                                                                                                 'item10', 'item11', 'item12', 'item13', 'item14', 'item15',
                                                                                                                 'sourcefile'])

# Retrieve the latest filing from our index
for filing in r:
    secFilingList.append({
        "id": filing['id'],
        "cik": filing['cik'],
        "company": filing['company'],
        "filingType": filing['filingType'],
        "filingDate": filing['filingDate'],
        "periodOfReport": filing['periodOfReport'],
        "sic": filing['sic'],
        "stateOfInc": filing['stateOfInc'],
        "fiscalYearEnd": filing['fiscalYearEnd'],
        "filingHtmlIndex": filing['filingHtmlIndex'],
        "completeTextFilingLink": filing['completeTextFilingLink'],
        "item1": filing['item1'],
        "item1A": filing['item1A'],
        "item1B": filing['item1B'],
        "item2": filing['item2'],
        "item3": filing['item3'],
        "item4": filing['item4'],
        "item5": filing['item5'],
        "item6": filing['item6'],
        "item7": filing['item7'],
        "item7A": filing['item7A'],
        "item8": filing['item8'],
        "item9": filing['item9'],
        "item9A": filing['item9A'],
        "item9B": filing['item9B'],
        "item10": filing['item10'],
        "item11": filing['item11'],
        "item12": filing['item12'],
        "item13": filing['item13'],
        "item14": filing['item14'],
        "item15": filing['item15'],
        "sourcefile": filing['sourcefile']
    })

In [29]:
def generateSummaries(docs):
    # With the data indexed, let's summarize the information
    promptTemplate = """You are an AI assistant tasked with summarizing financial report related to company. 
            Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
            Please generate a concise and comprehensive summary of the following document.
            Please remember to use clear language and maintain the integrity of the original information without missing any important details.
            Summarize it at an average of 10 lines.
            {text}
            """
    customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
    chainType = "map_reduce"
    summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                        map_prompt=customPrompt, combine_prompt=customPrompt)
    summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
    return summary

In [31]:
# For different section of extracted data, process summarization and generate common answers to questions
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)

# Item 1 - Describes the business of the company
rawItemDocs = [Document(page_content=secFilingList[0]['item1'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item1 : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Business Description : " + outputAnswer)

# Item 1A - Risk Factors
rawItemDocs = [Document(page_content=secFilingList[0]['item1A'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item1A : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Risk Factors : " + outputAnswer)

# Item 6 - Consolidated Financial Data
rawItemDocs = [Document(page_content=secFilingList[0]['item6'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item6 : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Financial Data : " + outputAnswer)

# Item 7 - Management's Discussion and Analysis of Financial Condition and Results of Operations
rawItemDocs = [Document(page_content=secFilingList[0]['item7'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item7 : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Management Discussion : " + outputAnswer)

# Item 7a - Market risk disclosures
rawItemDocs = [Document(page_content=secFilingList[0]['item7A'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item7A : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Risk Disclosures : " + outputAnswer)

# Item 9 - Disagreements with accountants and changes in accounting
rawItemDocs = [Document(page_content=secFilingList[0]['item9'])]
itemDocs = splitter.split_documents(rawItemDocs)
print("Number of documents chunks generated from Item9 : ", len(itemDocs))
summary = generateSummaries(itemDocs)
outputAnswer = summary['output_text']
print("Accounting Disclosures : " + outputAnswer)

Number of documents chunks generated from Item1 :  12
Business Description : 
Apple Inc. is a technology company that designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories. It offers a range of services, including advertising, AppleCare®, cloud services, digital content, and payment services. The Company sells its products and resells third-party products in most of its major markets directly to consumers, small and mid-sized businesses, and education, enterprise and government customers through its retail and online stores and its direct sales force. It also employs a variety of indirect distribution channels. The main competitive factors for Apple include price, product and service features, relative price and performance, product and service quality and reliability, design innovation, a strong third-party software and accessories ecosystem, marketing and distribution capability, service and support, and corporate reputation. The Comp

In [None]:
## Do we need Financial Reports (Balance Sheet, Income Statement and Cash Flow) for last 3 years?