### PIB CoPilot
PIBs are also used to create a pitchbook by assessing a company's strategy, competitive positioning, review of financial statements, industry dynamics, and trends within the industry. 
1. Company Overview and Executive Bio - A brief description of the company and its key executives with biographies.
2. Conference calls: The same day a company issues its quarterly press release, it will also hold a conference call. On the call, analysts often learn details about management guidance. These conference calls are transcribed by several service providers and can be accessed by subscribers of large financial data providers.
3. Press Release: Can be found in the investor relations section of most companies' websites and contains the financial statements which are used in forms 10-K and 10-Q. 
4. News: News articles that may affect a company's stock price or growth prospect would be something that analysts look into, particularly within a 6-12 month time horizon.
5. SEC filings: These regulatory documents require a company to file Form 10-K and Form 10-Q with the SEC on an ongoing basis. Form 10-K is a financial overview and commentary for the last year, usually found on the company's website. Form 10-Q is similar to form 10-K, but it is a report for the last quarter instead of the previous year.
6. Equity research reports: Look into key forecasts for metrics like Revenue, EBITDA, and EPS for the company or competing firms to form a consensus estimate. 
7. Investor Presentations: Companies provide historical information as an important foundation from which forecasts are made to guide key forecasting drivers. 

#### 0 -  Pre-requsite and imports

In [1]:
import os  
import json  
import openai
from Utilities.envVars import *
import uuid
# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
openai.api_base = openAiEndPoint

In [2]:
# Parameters
embeddingModelType = "azureopenai"
temperature = 0
tokenLength = 1000
symbol = 'AAPL'
apikey = FmpKey
os.environ['BING_SUBSCRIPTION_KEY'] = BingKey
os.environ['BING_SEARCH_URL'] = BingUrl
pibIndexName = 'pitchbook'

In [3]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from IPython.display import display, HTML
from langchain.utilities import BingSearchAPIWrapper
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd
from langchain.prompts import PromptTemplate
from datetime import datetime
from pytz import timezone
from dateutil.relativedelta import relativedelta
from datetime import timedelta
from Utilities.pitchBook import indexDocs, createPressReleaseIndex, createStockNewsIndex, mergeDocs, createPibIndex, findPibData, findEarningCalls, deletePibData, performEarningCallCogSearch
from Utilities.pitchBook import indexEarningCallSections, createEarningCallVectorIndex, createEarningCallIndex, performCogSearch, createSecFilingIndex, findSecFiling
from Utilities.pitchBook import findLatestSecFilings, createSecFilingsVectorIndex, indexSecFilingsSections
from Utilities.pitchBook import deletePibData, findEarningCallsBySymbol
import typing
from Utilities.fmp import *
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.chains import LLMChain
import yfinance as yf

In [4]:
# Flexibility to change the call to OpenAI or Azure OpenAI
if (embeddingModelType == 'azureopenai'):
    openai.api_type = "azure"
    openai.api_key = OpenAiKey
    openai.api_version = OpenAiVersion
    openai.api_base = OpenAiEndPoint
    
    llm = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat16k,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
    
    logging.info("LLM Setup done")
    embeddings = OpenAIEmbeddings(deployment=OpenAiEmbedding, openai_api_key=OpenAiKey, openai_api_type="azure")
elif embeddingModelType == "openai":
    openai.api_type = "open_ai"
    openai.api_base = "https://api.openai.com/v1"
    openai.api_version = '2020-11-07' 
    openai.api_key = OpenAiApiKey
    llm = OpenAI(temperature=temperature,
            openai_api_key=OpenAiApiKey,
            model_name="gpt-3.5-turbo",
            max_tokens=tokenLength)
    embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)    


In [5]:
central = timezone('US/Central')
today = datetime.now(central)
currentYear = today.year
historicalDate = today - relativedelta(years=3)
historicalYear = historicalDate.year
historicalDate = historicalDate.strftime("%Y-%m-%d")
totalYears = currentYear - historicalYear

In [6]:
#find CIK based on Symbol
cik = str(int(searchCik(apikey=apikey, ticker=symbol)[0]["companyCik"]))
createPibIndex(SearchService, SearchKey, pibIndexName)

In [7]:
#deletePibData(SearchService, SearchKey, pibIndexName, cik, "1", returnFields=['id', 'symbol', 'cik', 'step', 'description', 'insertedDate',
#                                                                       'pibData'])

#### 1 -  Company Overview and Executive Bio
    Snapshot of the company - Overview
    Board and Management - Executive Bio
    Shareholders - Ownership
    M&A Overview
    Competitors - Peer Group
    Industry Overview

##### Data Source - Factset, FMP, Custom configuration, Yahoo Finance

In [8]:
#datasource = 'fmp'

#### 1. Paid Data - Company Profile and Key Executives

In [9]:
def getYahooProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, dataSource):
    df = pd.DataFrame.from_dict(pd.json_normalize(yInfo.info))
    df = df.rename(columns={'longName': 'companyName', 'longBusinessSummary': 'description', 'address1': 'address'})
    df['cik']=cik
    df['isin']=yInfo.isin
    sData = {
            'id' : str(uuid.uuid4()),
            'symbol': symbol,
            'cik': cik,
            'step': step,
            'dataSource': dataSource,
            'description': 'Company Profile',
            'insertedDate': today.strftime("%Y-%m-%d"),
            'pibData' : str(df[['symbol', 'marketCap', 'companyName', 'currency', 'cik', 'isin', 'exchange', 'industry', 'sector', 'address', 'city', 'state', 'zip', 'website', 'description']].to_dict('records'))
    }
    return sData

def getFmpProfile(pibIndexName, cik, step, symbol, temperature, llm, today, dataSource):
    profile = companyProfile(apikey=FmpKey, symbol=symbol)
    df = pd.DataFrame.from_dict(pd.json_normalize(profile))
    df.fillna("",inplace=True)
    sData = {
            'id' : str(uuid.uuid4()),
            'symbol': symbol,
            'cik': cik,
            'step': step,
            'dataSource': dataSource,
            'description': 'Company Profile',
            'insertedDate': today.strftime("%Y-%m-%d"),
            'pibData' : str(df[['symbol', 'mktCap', 'companyName', 'currency', 'cik', 'isin', 'exchange', 'industry', 'sector', 'address', 'city', 'state', 'zip', 'website', 'description']].to_dict('records'))
    }
    return sData

def getYahooExecProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, dataSource):
    df = pd.DataFrame.from_dict(pd.json_normalize(yInfo.info['companyOfficers']),orient='columns')
    df = df.drop_duplicates(subset='name', keep="first")
    step1Executives = []
    for index, row in df.iterrows():
        name = row['name']
        title = row['title']
        query = f"Give me brief biography of {name} who is {title} at {symbol}. Biography should be restricted to {symbol} and summarize it as 4 paragraphs."
        qaPromptTemplate = """
            Rephrase the following question asked by user to perform intelligent internet search
            {query}
            """
        optimizedPrompt = qaPromptTemplate.format(query=query)
        qaPrompt = PromptTemplate(input_variables=["query"],template=qaPromptTemplate)
        chain = LLMChain(llm=llm, prompt=qaPrompt)
        q = chain.run(query=query)
        bingSearch = BingSearchAPIWrapper(k=20)
        results = bingSearch.run(query=q)
        logging.info(f"Generate Summary for {q}")
        chain = load_summarize_chain(llm, chain_type="stuff")
        docs = [Document(page_content=results)]
        summary = chain.run(docs)
        step1Executives.append({
            "name": name,
            "title": title,
            "biography": summary
        })

    sData = {
            'id' : str(uuid.uuid4()),
            'symbol': symbol,
            'cik': cik,
            'step': step,
            'dataSource': dataSource,
            'description': 'Biography of Key Executives',
            'insertedDate': today.strftime("%Y-%m-%d"),
            'pibData' : str(step1Executives)
    }
    return sData
        
def getFmpExecProfile(pibIndexName, cik, step, symbol, temperature, llm, today, dataSource):
    # Get the list of all executives and generate biography for each of them
    executives = keyExecutives(apikey=FmpKey, symbol=symbol)
    df = pd.DataFrame.from_dict(pd.json_normalize(executives),orient='columns')
    df = df.drop_duplicates(subset='name', keep="first")

    step1Executives = []
    #### With the company profile and key executives, we can ask Bing Search to get the biography of the all Key executives and 
    # ask OpenAI to summarize it - Public Data
    for executive in executives:
        name = executive['name']
        title = executive['title']
        query = f"Give me brief biography of {name} who is {title} at {symbol}. Biography should be restricted to {symbol} and summarize it as 4 paragraphs."
        qaPromptTemplate = """
            Rephrase the following question asked by user to perform intelligent internet search
            {query}
            """
        
        qaPrompt = PromptTemplate(input_variables=["query"],template=qaPromptTemplate)
        chain = LLMChain(llm=llm, prompt=qaPrompt)
        q = chain.run(query=query)
        bingSearch = BingSearchAPIWrapper(k=20)
        results = bingSearch.run(query=q)
        logging.info(f"Generate Summary for {q}")
        chain = load_summarize_chain(llm, chain_type="stuff")
        docs = [Document(page_content=results)]
        summary = chain.run(docs)
        step1Executives.append({
            "name": name,
            "title": title,
            "biography": summary
        })

    sData = {
            'id' : str(uuid.uuid4()),
            'symbol': symbol,
            'cik': cik,
            'step': step,
            'dataSource': dataSource,
            'description': 'Biography of Key Executives',
            'insertedDate': today.strftime("%Y-%m-%d"),
            'pibData' : str(step1Executives)
    }
    return sData

In [10]:
def processStep1(pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource):
    if profileDataSource == 'yahoo':
        # Data source is Yahoo Finance
        yInfo = yf.Ticker(symbol)

    s1Data = []
    r = findPibData(SearchService, SearchKey, pibIndexName, cik, step, returnFields=['id', 'symbol', 'cik', 'step', 'dataSource', 'description', 'insertedDate',
                                                                    'pibData'])
    
    logging.info(f"Found {r.get_count()} records for {symbol} in {pibIndexName}")
    if r.get_count() == 0:
        step1Profile = []
        step1Biography = []
        if profileDataSource == 'yahoo':
            sData = getYahooProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
        elif profileDataSource == 'fmp':
            sData = getFmpProfile(pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
            
        step1Profile.append(sData)
        s1Data.append(sData)
        # Insert data into pibIndex
        mergeDocs(SearchService, SearchKey, pibIndexName, step1Profile)

        if profileDataSource == 'yahoo':
            sData = getYahooExecProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
        elif profileDataSource == 'fmp':
            sData = getFmpExecProfile(pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
            
        step1Biography.append(sData)
        s1Data.append(sData)
        mergeDocs(SearchService, SearchKey, pibIndexName, step1Biography)
    elif r.get_count() == 1:
        for s in r:
            logging.info(f"Found Company Profile for {symbol}")
            if s['description'] == 'Company Profile':
                s1Data.append(
                    {
                        'id' : s['id'],
                        'symbol': s['symbol'],
                        'cik': s['cik'],
                        'step': s['step'],
                        'dataSource': s['dataSource'],
                        'description': s['description'],
                        'insertedDate': s['insertedDate'],
                        'pibData' : s['pibData']
                    })
                
                step1Biography = []
                
                if profileDataSource == 'yahoo':
                    sData = getYahooExecProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
                elif profileDataSource == 'fmp':
                    sData = getFmpExecProfile(pibIndexName, cik, step, symbol, temperature, llm, today)
                
                step1Biography.append(sData)
                s1Data.append(sData)
                mergeDocs(SearchService, SearchKey, pibIndexName, step1Biography)
            elif s['description'] == 'Biography of Key Executives':
                logging.info(f"Found Biography of Key Executives for {symbol}")
                s1Data.append(
                    {
                        'id' : s['id'],
                        'symbol': s['symbol'],
                        'cik': s['cik'],
                        'step': s['step'],
                        'dataSource': s['dataSource'],
                        'description': s['description'],
                        'insertedDate': s['insertedDate'],
                        'pibData' : s['pibData']
                    })
                
                step1Profile = []
                
                if profileDataSource == 'yahoo':
                    sData = getYahooProfile(yInfo, pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
                elif profileDataSource == 'fmp':
                    sData = getFmpProfile(pibIndexName, cik, step, symbol, temperature, llm, today, profileDataSource)
                
                step1Profile.append(sData)
                s1Data.append(sData)
                # Insert data into pibIndex
                mergeDocs(SearchService, SearchKey, pibIndexName, step1Profile)
    else:
        for s in r:
            s1Data.append(
                {
                    'id' : s['id'],
                    'symbol': s['symbol'],
                    'cik': s['cik'],
                    'step': s['step'],
                    'dataSource': s['dataSource'],
                    'description': s['description'],
                    'insertedDate': s['insertedDate'],
                    'pibData' : s['pibData']
                })
    
    return s1Data

In [11]:
profileDataSource = 'yahoo'
s1Data = processStep1(pibIndexName, cik, "1", symbol, temperature, llm, today, profileDataSource)
print(s1Data)

[{'id': 'a3ad0220-6410-4163-a126-c17806dbd300', 'symbol': 'AAPL', 'cik': '320193', 'step': '1', 'dataSource': 'yahoo', 'description': 'Company Profile', 'insertedDate': '2023-10-09', 'pibData': "[{'symbol': 'AAPL', 'marketCap': 2798052704256, 'companyName': 'Apple Inc.', 'currency': 'USD', 'cik': '320193', 'isin': 'US0378331005', 'exchange': 'NMS', 'industry': 'Consumer Electronics', 'sector': 'Technology', 'address': 'One Apple Park Way', 'city': 'Cupertino', 'state': 'CA', 'zip': '95014', 'website': 'https://www.apple.com', 'description': 'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, and HomePod. It also provides AppleCare support and cloud services; and operates various platforms, includ

#### 2. Paid Data -  Get the Earnings Call Transcript for each quarter for last 3 years

In [12]:
def getFmpEarningCalls(earningIndexName, symbol, earningTranscriptDataSource, earningQuarters):
    # Call the paid data (FMP) API
    # Get the earning call transcripts for the last 3 years and merge documents into the index.
    i = 0
    earningsData = []
    try:
        # Get the list of all earning calls available
        earningCallDates = earningCallsAvailableDates(apikey=FmpKey, symbol=symbol)
        if len(earningCallDates) > 0:
            for i in range(earningQuarters):
                quarter = earningCallDates[i][0]
                year = earningCallDates[i][1]
                r = findEarningCalls(SearchService, SearchKey, earningIndexName, symbol, str(quarter), str(year), returnFields=['id', 'symbol', 
                                    'quarter', 'year', 'callDate', 'content'])
                if r.get_count() == 0:
                    insertEarningCall = []
                    earningTranscript = earningCallTranscript(apikey=FmpKey, symbol=symbol, year=str(year), quarter=quarter)
                    for transcript in earningTranscript:
                        symbol = transcript['symbol']
                        quarter = transcript['quarter']
                        year = transcript['year']
                        callDate = transcript['date']
                        content = transcript['content']
                        id = f"{symbol}-{year}-{quarter}-{earningTranscriptDataSource}"
                        earningRecord = {
                            "id": id,
                            "symbol": symbol,
                            "quarter": str(quarter),
                            "year": str(year),
                            "callDate": callDate,
                            "content": content,
                            #"inserteddate": datetime.now(central).strftime("%Y-%m-%d"),
                        }
                        earningsData.append(earningRecord)
                        insertEarningCall.append(earningRecord)
                        mergeDocs(SearchService, SearchKey, earningIndexName, insertEarningCall)
                else:
                    logging.info(f"Found {r.get_count()} records for {symbol} for {quarter} {str(year)}")
                    for s in r:
                        record = {
                                'id' : s['id'],
                                'symbol': s['symbol'],
                                'quarter': s['quarter'],
                                'year': s['year'],
                                'callDate': s['callDate'],
                                'content': s['content']
                            }
                        earningsData.append(record)
        else:
            logging.info(f"No earning calls found for {symbol}")
            return earningsData
                
        logging.info(f"Total records found for {symbol} : {len(earningsData)}")

        return earningsData
    except Exception as e:
        logging.error(f"Error occured while processing {symbol} : {e}")

In [15]:
def processStep2(pibIndexName, cik, step, symbol, llm, today, embeddingModelType, earningTranscriptDataSource, earningQuarters):
    r = findPibData(SearchService, SearchKey, pibIndexName, cik, step, returnFields=['id', 'symbol', 'cik', 'step', 'dataSource', 'description', 'insertedDate',
                                                                   'pibData'])
    content = ''
    latestCallDate = ''
    s2Data = []
    earningIndexName = 'pitchbookec'
    earningIndexVsName = 'pitchbookecvector'
    # Create the index if it does not exist
    createEarningCallIndex(SearchService, SearchKey, earningIndexName)
    createEarningCallVectorIndex(SearchService, SearchKey, earningIndexVsName)
    if r.get_count() == 0:

        #Let's just use the latest earnings call transcript to create the documents that we want to use it 
        #for generative AI tasks
        try:
            if earningTranscriptDataSource == 'fmp':
                earningsCallData = getFmpEarningCalls(earningIndexName, symbol, earningTranscriptDataSource, earningQuarters)
            elif earningTranscriptDataSource == 'yahoo':
                earningsCallData = []
            for earningCall in earningsCallData:
                content = earningCall['content']
                callDate = earningCall['callDate']
                year = earningCall['year']
                quarter = earningCall['quarter']
                splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=1000)
                rawDocs = splitter.create_documents([content])
                docs = splitter.split_documents(rawDocs)
                logging.info("Number of documents chunks generated from Call transcript : " + str(len(docs)))
                # Check if we already have the data store, if not then create it
                indexEarningCallSections(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey,
                                        embeddingModelType, OpenAiEmbedding, earningIndexVsName, docs,
                                        callDate, symbol, year,
                                        quarter, earningTranscriptDataSource)
                
                logging.info("Completed earning call transcript indexing")
                earningCallQa = []
                
                promptTemplate = """You are an AI assistant tasked with summarizing financial information from earning call transcript. 
                    Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
                    Please generate a concise and comprehensive summary between 5-7 paragraphs on each of the following numbered topics.  Your response should include the topic as part of the summary.
                    1. Financial Results: Please provide a summary of the financial results.
                    2. Business Highlights: Please provide a summary of the business highlights.
                    3. Future Outlook: Please provide a summary of the future outlook.
                    4. Business Risks: Please provide a summary of the business risks.
                    5. Management Positive Sentiment: Please provide a summary of the what management is confident about.
                    6. Management Negative Sentiment: Please provide a summary of the what management is concerned about.
                    Please remember to use clear language and maintain the integrity of the original information without missing any important details:
                    {text}
                    """
                customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
                chainType = "map_reduce"
                summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=False, 
                                            combine_prompt=customPrompt)
                summaryOutput = summaryChain({"input_documents": docs}, return_only_outputs=True)
                output = summaryOutput['output_text']
                logging.info("Completed latest earning call transcript summarization")

                formattedOutput = output.splitlines()
                while("" in formattedOutput):
                    formattedOutput.remove("")
                for summary in formattedOutput:
                    splitSummary = summary.split(":")
                    try:
                        question = splitSummary[0]
                        answer = splitSummary[1]
                        earningCallQa.append({"question": question, "answer": answer})
                    except:
                        continue

                s2Data.append({
                            'id' : str(uuid.uuid4()),
                            'symbol': symbol,
                            'cik': cik,
                            'step': step,
                            'dataSource': earningTranscriptDataSource,
                            'description': 'Earning Call Q&A',
                            'insertedDate': today.strftime("%Y-%m-%d"),
                            'pibData' : str(earningCallQa)
                })

                promptTemplate = """You are an AI assistant tasked with summarizing financial information from earning call transcript. 
                Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
                Please generate a concise and comprehensive summary between 5-7 paragraphs and maintain the continuity.  
                Ensure your summary includes the key information from the transcript like future outlook, business risk, 
                management concerns.
                {text}
                    """
                customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
                logging.info("Starting latest earning call transcript summarization - Stuff or MapReduce")
                try:
                    chainType = "stuff"
                    summaryChain = load_summarize_chain(llm, chain_type=chainType, prompt=customPrompt)
                    summaryOutput = summaryChain({"input_documents": docs}, return_only_outputs=True)
                    output = summaryOutput['output_text']
                    logging.info("Completed latest earning call transcript summarization - Stuff")
                except:
                    chainType = "map_reduce"
                    summaryChain = load_summarize_chain(llm, chain_type=chainType, combine_prompt=customPrompt)
                    summaryOutput = summaryChain({"input_documents": docs}, return_only_outputs=True)
                    output = summaryOutput['output_text']
                    logging.info("Completed latest earning call transcript summarization - MapReduce")
                
                s2Data.append({
                            'id' : str(uuid.uuid4()),
                            'symbol': symbol,
                            'cik': cik,
                            'step': step,
                            'dataSource': earningTranscriptDataSource,
                            'description': 'Earning Call Summary',
                            'insertedDate': today.strftime("%Y-%m-%d"),
                            'pibData' : str([{"summary": output}])
                })

                mergeDocs(SearchService, SearchKey, pibIndexName, s2Data)
        except Exception as e:
            logging.info("Error in splitting the earning call transcript : ", e)
            return s2Data, content, latestCallDate
    else:
        logging.info('Found existing data')
        for s in r:
            s2Data.append(
                {
                    'id' : s['id'],
                    'symbol': s['symbol'],
                    'cik': s['cik'],
                    'step': s['step'],
                    'dataSource': s['dataSource'],
                    'description': s['description'],
                    'insertedDate': s['insertedDate'],
                    'pibData' : s['pibData']
                })
    return s2Data

In [16]:
earningTranscriptDataSource = "fmp"
earningQuarters = 3
s2Data = processStep2(pibIndexName, cik, "2", symbol, llm, today, embeddingModelType, earningTranscriptDataSource, earningQuarters)
s2Data

[{'id': '09ecb017-d26c-4838-b7dd-49ad426f369c',
  'symbol': 'AAPL',
  'cik': '320193',
  'step': '2',
  'dataSource': None,
  'description': 'Earning Call Summary',
  'insertedDate': '2023-10-09',
  'pibData': "[{'summary': 'Apple reported revenue of $117.2 billion for the December quarter, setting all-time revenue records in several markets. However, revenue was down 5% year-over-year due to foreign exchange headwinds, COVID-19-related challenges, and a challenging macroeconomic environment. The supply of iPhone 14 Pro and iPhone 14 Pro Max was significantly impacted, causing ship times to extend beyond expectations. Despite these challenges, iPhone revenue was roughly flat on a constant currency basis. Mac revenue was in line with expectations, while iPad revenue grew 30% due to a favorable compare and the launch of new models. Wearables, Home, and Accessories revenue was down 8% due to foreign exchange headwinds and the macroeconomic environment. Services revenue set an all-time rec

#### 3. Paid Data - Press Releases - Get the Press Releases for last year

In [19]:
def summarizePressReleases(llm, docs):
    promptTemplate = """You are an AI assistant tasked with summarizing company's press releases and performing sentiments on those. 
                Your summary should accurately capture the key information in the press-releases while avoiding the omission of any domain-specific words. 
                Please generate a concise and comprehensive summary and sentiment with score with range of 0 to 10. 
                Your response should be in JSON object with following keys.  All JSON properties are required.
                summary: 
                sentiment:
                sentiment score: 
                {text}
                """
    customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
    chainType = "stuff"
    summaryChain = load_summarize_chain(llm, chain_type=chainType, prompt=customPrompt)
    summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
    outputAnswer = summary['output_text']
    return outputAnswer

In [20]:
# For now we are calling API to get data, but otherwise we need to ensure the data is not persisted in our 
# index repository before calling again, if it is persisted then we need to delete it first
step = "3"
s3Data = []
r = findPibData(SearchService, SearchKey, pibIndexName, cik, step, returnFields=['id', 'symbol', 'cik', 'step', 'description', 'insertedDate',
                                                                   'pibData'])
if r.get_count() == 0:
    counter = 0
    pressReleasesList = []
    pressReleaseIndexName = 'pressreleases'
    # Create the index if it does not exist
    createPressReleaseIndex(SearchService, SearchKey, pressReleaseIndexName)
    print(f"Processing ticker : {symbol}")
    pr = pressReleases(apikey=apikey, symbol=symbol, limit=25)
    for pressRelease in pr:
        symbol = pressRelease['symbol']
        releaseDate = pressRelease['date']
        title = pressRelease['title']
        content = pressRelease['text']
        todayYmd = today.strftime("%Y-%m-%d")
        id = f"{symbol}-{counter}"
        pressReleasesList.append({
            "id": id,
            "symbol": symbol,
            "releaseDate": releaseDate,
            "title": title,
            "content": content,
        })
        counter = counter + 1

    mergeDocs(SearchService, SearchKey, pressReleaseIndexName, pressReleasesList)

    splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)
    rawPressReleasesDoc = [Document(page_content=t['content']) for t in pressReleasesList[:25]]
    pressReleasesDocs = splitter.split_documents(rawPressReleasesDoc)
    print("Number of documents chunks generated from Press releases : ", len(pressReleasesDocs))

    pressReleasesPib = []
    last25PressReleases = pressReleasesList[:25]
    i = 0
    for pDocs in pressReleasesDocs:
        try:
            outputAnswer = summarizePressReleases(llm, [pDocs])
            jsonStep = json.loads(outputAnswer)
            pressReleasesPib.append({
                    "releaseDate": last25PressReleases[i]['releaseDate'],
                    "title": last25PressReleases[i]['title'],
                    "summary": jsonStep['summary'],
                    "sentiment": jsonStep['sentiment'],
                    "sentimentScore": jsonStep['sentiment score']
            })
            i = i + 1
        except Exception as e:
            i = i + 1
            continue
    
    s3Data.append({
                'id' : str(uuid.uuid4()),
                'symbol': symbol,
                'cik': cik,
                'step': step,
                'description': 'Press Releases',
                'insertedDate': today.strftime("%Y-%m-%d"),
                'pibData' : str(pressReleasesPib)
        })
    mergeDocs(SearchService, SearchKey, pibIndexName, s3Data)
else:
    print('Found existing data')
    for s in r:
        s3Data.append(
            {
                'id' : s['id'],
                'symbol': s['symbol'],
                'cik': s['cik'],
                'step': s['step'],
                'description': s['description'],
                'insertedDate': s['insertedDate'],
                'pibData' : s['pibData']
            })

print(s3Data)

Found existing data
[{'id': '0c6ade48-d697-4362-8ea9-3cd4543b96f4', 'symbol': 'SMFG', 'cik': '1022837', 'step': '3', 'description': 'Press Releases', 'insertedDate': '2023-08-06', 'pibData': '[{\'releaseDate\': \'2023-04-27 06:30:00\', \'title\': "JEFFERIES AND SMBC EXPAND AND STRENGTHEN STRATEGIC ALLIANCE, BROADENING JOINT BUSINESS EFFORTS AND INCREASING SMBC\'S EQUITY OWNERSHIP IN JEFFERIES", \'summary\': \'Jefferies Financial Group and Sumitomo Mitsui Financial Group have expanded their strategic alliance to collaborate on future corporate and investment banking business opportunities, as well as in equity sales and trading.\', \'sentiment\': \'Positive\', \'sentimentScore\': 8.5}, {\'releaseDate\': \'2022-12-22 09:00:00\', \'title\': \'SMBC NIKKO SECURITIES AMERICA, INC. EXPANDS EQUITY EXECUTION SERVICES GROUP WITH SEVERAL NEW HIRES\', \'summary\': \'SMBC Nikko Securities America, Inc., a member of SMBC Group, has made several significant hires in its equity execution business to e

### 4. Paid Data - Get Stock News - Limit it to cover for current year

In [21]:
# # For now we are calling API to get data, but otherwise we need to ensure the data is not persisted in our 
# # index repository before calling again, if it is persisted then we need to delete it first
# counter = 0
# stockNewsList = []
# stockNewsIndexName = 'stocknews'
# # Create the index if it does not exist
# createStockNewsIndex(SearchService, SearchKey, stockNewsIndexName)
# print(f"Processing ticker : {symbol}")
# sn = stockNews(apikey=apikey, tickers=symbol, limit=5000)
# for news in sn:
#     symbol = news['symbol']
#     publishedDate = news['publishedDate']
#     title = news['title']
#     image = news['image']
#     site = news['site']
#     content = news['text']
#     url = news['url']
#     todayYmd = today.strftime("%Y-%m-%d")
#     id = f"{symbol}-{todayYmd}-{counter}"
#     stockNewsList.append({
#         "id": id,
#         "symbol": symbol,
#         "publishedDate": publishedDate,
#         "title": title,
#         "image": image,
#         "site": site,
#         "content": content,
#         "url": url,
#     })
#     counter = counter + 1
# mergeDocs(SearchService, SearchKey, stockNewsIndexName, stockNewsList)

In [22]:
# # Group our news by Date and summarize the content and sentimet per day
# stocksDf = pd.DataFrame.from_dict(pd.json_normalize(stockNewsList))
# stocksDf['publishedDate'] = pd.to_datetime(stocksDf['publishedDate']).dt.date
# stocksNewsDailyDf = stocksDf.sort_values('publishedDate').groupby('publishedDate')['content'].apply('\n'.join).reset_index()
# splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50)
# rawNewsDocs = [Document(page_content=row['content']) for index, row in stocksNewsDailyDf.tail(10).iterrows()]
# newsDocs = splitter.split_documents(rawNewsDocs)
# print("Number of documents chunks generated from Press releases : ", len(newsDocs))

# # With the data indexed, let's summarize the information
# promptTemplate = """You are an AI assistant tasked with summarizing news related to company and performing sentiments on those. 
#         Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
#         Please generate a concise and comprehensive summary and sentiment with score with range of 0 to 10. Your response should be in JSON format with following keys.
#         summary: 
#         sentiment:
#         sentiment score:
#         Please remember to use clear language and maintain the integrity of the original information without missing any important details.
#         {text}
#         """
# customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
# chainType = "map_reduce"
# summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
#                                     map_prompt=customPrompt, combine_prompt=customPrompt)
# summary = summaryChain({"input_documents": newsDocs}, return_only_outputs=True)
# outputAnswer = summary['output_text']
# print(outputAnswer)

In [23]:
# # For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# # This way you can inspect the results from map_reduce chain type, each top similar chunk summary
# intermediateSteps = summary['intermediate_steps']
# for step in intermediateSteps:
#         display(HTML("<b>Chunk Summary:</b> " + step))

#### 5. Public Data - Get the SEC Filings - Limit it to cover for last 3 year

In [24]:
filingType = "10-K"
secFilingsList = secFilings(apikey=apikey, symbol=symbol, filing_type=filingType)

In [269]:
latestFilingDateTime = datetime.strptime(secFilingsList[0]['fillingDate'], '%Y-%m-%d %H:%M:%S')
latestFilingDate = latestFilingDateTime.strftime("%Y-%m-%d")
filingYear = latestFilingDateTime.strftime("%Y")
filingMonth = int(latestFilingDateTime.strftime("%m"))

if filingMonth > 0 & filingMonth <= 3:
    filingQuarter = 1
elif filingMonth > 3 & filingMonth <= 6:
    filingQuarter = 2
elif filingMonth > 6 & filingMonth <= 9:
    filingQuarter = 3
else:
    filingQuarter = 4


secFilingIndexName = 'secdata'
secFilingList = []
dt = pd.to_datetime(datetime.now(), format='%Y/%m/%d')
dt1 = pd.to_datetime(latestFilingDate, format='%Y/%m/%d')
totalDays = (dt-dt1).days
if totalDays < 31:
    skipIndicies = False
else:
    skipIndicies = True
emptyBody = {
        "values": [
            {
                "recordId": 0,
                "data": {
                    "text": ""
                }
            }
        ]
}

secExtractBody = {
    "values": [
        {
            "recordId": 0,
            "data": {
                "text": {
                    "edgar_crawler": {
                        "start_year": int(filingYear),
                        "end_year": int(filingYear),
                        "quarters": [filingQuarter],
                        "filing_types": [
                            "10-K"
                        ],
                        "cik_tickers": [cik],
                        "user_agent": "Your name (your email)",
                        "raw_filings_folder": "RAW_FILINGS",
                        "indices_folder": "INDICES",
                        "filings_metadata_file": "FILINGS_METADATA.csv",
                        "skip_present_indices": skipIndicies,
                    },
                    "extract_items": {
                        "raw_filings_folder": "RAW_FILINGS",
                        "extracted_filings_folder": "EXTRACTED_FILINGS",
                        "filings_metadata_file": "FILINGS_METADATA.csv",
                        "items_to_extract": ["1","1A","1B","2","3","4","5","6","7","7A","8","9","9A","9B","10","11","12","13","14","15"],
                        "remove_tables": False,
                        "skip_extracted_filings": False
                    }
                }
            }
        }
    ]
}

# Check if we have already processed the latest filing, if yes then skip
createSecFilingIndex(SearchService, SearchKey, secFilingIndexName)
r = findSecFiling(SearchService, SearchKey, secFilingIndexName, cik, filingType, latestFilingDate, returnFields=['id', 'cik', 'company', 'filingType', 'filingDate',
                                                                                                                 'periodOfReport', 'sic', 'stateOfInc', 'fiscalYearEnd',
                                                                                                                 'filingHtmlIndex', 'htmFilingLink', 'completeTextFilingLink',
                                                                                                                 'item1', 'item1A', 'item1B', 'item2', 'item3', 'item4', 'item5',
                                                                                                                 'item6', 'item7', 'item7A', 'item8', 'item9', 'item9A', 'item9B',
                                                                                                                 'item10', 'item11', 'item12', 'item13', 'item14', 'item15',
                                                                                                                 'sourcefile'])
if r.get_count() == 0:
    # Call Azure Function to perform Web-scraping and store the JSON in our blob
    secExtract = requests.post(SecExtractionUrl, json = secExtractBody)
    # Once the JSON is created, call the function to process the JSON and store the data in our index
    docPersistUrl = SecDocPersistUrl + "&indexType=cogsearchvs&indexName=" + secFilingIndexName + "&embeddingModelType=" + embeddingModelType
    secPersist = requests.post(docPersistUrl, json = emptyBody)
    r = findSecFiling(SearchService, SearchKey, secFilingIndexName, cik, filingType, latestFilingDate, returnFields=['id', 'cik', 'company', 'filingType', 'filingDate',
                                                                                                                 'periodOfReport', 'sic', 'stateOfInc', 'fiscalYearEnd',
                                                                                                                 'filingHtmlIndex', 'htmFilingLink', 'completeTextFilingLink',
                                                                                                                 'item1', 'item1A', 'item1B', 'item2', 'item3', 'item4', 'item5',
                                                                                                                 'item6', 'item7', 'item7A', 'item8', 'item9', 'item9A', 'item9B',
                                                                                                                 'item10', 'item11', 'item12', 'item13', 'item14', 'item15',
                                                                                                                 'sourcefile'])

lastSecData = ''
# Retrieve the latest filing from our index
for filing in r:
    lastSecData = filing['item1'] + '\n' + filing['item1A'] + '\n' + filing['item1B'] + '\n' + filing['item2'] + '\n' + filing['item3'] + '\n' + filing['item4'] + '\n' + \
                filing['item5'] + '\n' + filing['item6'] + '\n' + filing['item7'] + '\n' + filing['item7A'] + '\n' + filing['item8'] + '\n' + \
                filing['item9'] + '\n' + filing['item9A'] + '\n' + filing['item9B'] + '\n' + filing['item10'] + '\n' + filing['item11'] + '\n' + filing['item12'] + '\n' + \
                filing['item13'] + '\n' + filing['item14'] + '\n' + filing['item15']

    secFilingList.append({
        "id": filing['id'],
        "cik": filing['cik'],
        "company": filing['company'],
        "filingType": filing['filingType'],
        "filingDate": filing['filingDate'],
        "periodOfReport": filing['periodOfReport'],
        "sic": filing['sic'],
        "stateOfInc": filing['stateOfInc'],
        "fiscalYearEnd": filing['fiscalYearEnd'],
        "filingHtmlIndex": filing['filingHtmlIndex'],
        "completeTextFilingLink": filing['completeTextFilingLink'],
        "item1": filing['item1'],
        "item1A": filing['item1A'],
        "item1B": filing['item1B'],
        "item2": filing['item2'],
        "item3": filing['item3'],
        "item4": filing['item4'],
        "item5": filing['item5'],
        "item6": filing['item6'],
        "item7": filing['item7'],
        "item7A": filing['item7A'],
        "item8": filing['item8'],
        "item9": filing['item9'],
        "item9A": filing['item9A'],
        "item9B": filing['item9B'],
        "item10": filing['item10'],
        "item11": filing['item11'],
        "item12": filing['item12'],
        "item13": filing['item13'],
        "item14": filing['item14'],
        "item15": filing['item15'],
        "sourcefile": filing['sourcefile']
    })

# Check if we have already processed the latest filing, if yes then skip
secFilingsVectorIndexName = 'latestsecfilings'
createSecFilingsVectorIndex(SearchService, SearchKey, secFilingsVectorIndexName)
r = findLatestSecFilings(SearchService, SearchKey, secFilingsVectorIndexName, cik, symbol, latestFilingDate, filingType, returnFields=['id', 'cik', 'symbol', 'latestFilingDate', 'filingType',
                                                                                                                 'content'])
if r.get_count() == 0:
    print("Processing latest SEC Filings for CIK : ", cik, " and Symbol : ", symbol)
    splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=1000)
    rawDocs = splitter.create_documents([lastSecData])
    docs = splitter.split_documents(rawDocs)
    print("Number of documents chunks generated from Last SEC Filings : ", len(docs))

    # Store the last index of the earning call transcript in vector Index
    indexSecFilingsSections(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey,
                         embeddingModelType, OpenAiEmbedding, secFilingsVectorIndexName, docs, cik,
                         symbol, latestFilingDate, filingType)
else:
    print("Latest SEC Filings for CIK : ", cik, " and Symbol : ", symbol, " already processed")


Search index latestsecfilings already exists
Latest SEC Filings for CIK :  73124  and Symbol :  NTRS  already processed


In [None]:
item8 = secFilingList[0]['item8']
splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=0)
rawItemDocs8 = [Document(page_content=item8, metadata={'source': ''})]
itemDocs8 = splitter.split_documents(rawItemDocs8)

In [283]:
def getItem8Answer(llm, docs, question, chainType):
    template = """
            You are an AI assistant tasked with answering questions from financial statements like income statement, cashflow and balance sheets. 
            The data that you are presented with is in table.
            Your answer should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
            Please generate a concise and comprehensive information that includes details such as reporting year and amount in millions.
            Ensure that it is easy to understand for business professionals and provides an accurate representation of the financial statement history. 
            
            Please remember to use clear language and maintain the integrity of the original information without missing any important details

            QUESTION: {question}
            =========
            {summaries}
            =========
            """
    qaPrompt = PromptTemplate(template=template, input_variables=["summaries", "question"])
    qaChain = load_qa_with_sources_chain(llm, chain_type=chainType, combine_prompt=qaPrompt)
    answer = qaChain({"input_documents": docs, "question": question})
    outputAnswer = answer['output_text']
    return outputAnswer

In [284]:
print(getItem8Answer(llm, itemDocs8, "What was the reported revenue for 2021?", "map_reduce"))

The reported revenue for 2021 is not provided in the given portion of the document.
SOURCES:


In [29]:
def generateSummaries(docs):
    chainType = "map_reduce"
    summaryChain = load_summarize_chain(llm, chain_type=chainType)
    summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
    return summary

In [30]:
step = "4"
s4Data = []

r = findPibData(SearchService, SearchKey, pibIndexName, cik, step, returnFields=['id', 'symbol', 'cik', 'step', 'description', 'insertedDate',
                                                                   'pibData'])

if r.get_count() == 0:
        secFilingsPib = []

        # For different section of extracted data, process summarization and generate common answers to questions
        splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=0)

        # Item 1 - Describes the business of the company
        rawItemDocs1 = [Document(page_content=secFilingList[0]['item1'])]
        itemDocs1 = splitter.split_documents(rawItemDocs1)
        print("Number of documents chunks generated from Item1 : ", len(itemDocs1))
        summary1 = generateSummaries(itemDocs1)
        outputAnswer1 = summary1['output_text']
        secFilingsPib.append({
                        "section": "item1",
                        "summaryType": "Business Description",
                        "summary": outputAnswer1
                })

        # Item 1A - Risk Factors
        rawItemDocs2 = [Document(page_content=secFilingList[0]['item1A'])]
        itemDocs2 = splitter.split_documents(rawItemDocs2)
        print("Number of documents chunks generated from Item1A : ", len(itemDocs2))
        summary2 = generateSummaries(itemDocs2)
        outputAnswer2 = summary2['output_text']
        secFilingsPib.append({
                        "section": "item1A",
                        "summaryType": "Risk Factors",
                        "summary": outputAnswer2
                })

        rawItemDocs2 = [Document(page_content=secFilingList[0]['item3'])]
        itemDocs2 = splitter.split_documents(rawItemDocs2)
        print("Number of documents chunks generated from Item3 : ", len(itemDocs2))
        summary2 = generateSummaries(itemDocs2)
        outputAnswer2 = summary2['output_text']
        secFilingsPib.append({
                        "section": "item3",
                        "summaryType": "Legal Proceedings",
                        "summary": outputAnswer2
                })

        # Item 6 - Consolidated Financial Data
        rawItemDocs3 = [Document(page_content=secFilingList[0]['item5'])]
        itemDocs3 = splitter.split_documents(rawItemDocs3)
        print("Number of documents chunks generated from Item5 : ", len(itemDocs3))
        summary3 = generateSummaries(itemDocs3)
        outputAnswer3 = summary3['output_text']
        secFilingsPib.append({
                        "section": "item5",
                        "summaryType": "Market",
                        "summary": outputAnswer3
                })

        # Item 7 - Management's Discussion and Analysis of Financial Condition and Results of Operations
        rawItemDocs4 = [Document(page_content=secFilingList[0]['item7'])]
        itemDocs4 = splitter.split_documents(rawItemDocs4)
        print("Number of documents chunks generated from Item7 : ", len(itemDocs4))
        summary4 = generateSummaries(itemDocs4)
        outputAnswer4 = summary4['output_text']
        secFilingsPib.append({
                        "section": "item7",
                        "summaryType": "Management Discussion",
                        "summary": outputAnswer4
                })

        # Item 7a - Market risk disclosures
        rawItemDocs5 = [Document(page_content=secFilingList[0]['item7A'])]
        itemDocs5= splitter.split_documents(rawItemDocs5)
        print("Number of documents chunks generated from Item7A : ", len(itemDocs5))
        summary5 = generateSummaries(itemDocs5)
        outputAnswer5 = summary5['output_text']
        secFilingsPib.append({
                        "section": "item7A",
                        "summaryType": "Risk Disclosures",
                        "summary": outputAnswer5
                })

        # Item 9 - Disagreements with accountants and changes in accounting
        section9 = secFilingList[0]['item9'] + "\n " + secFilingList[0]['item9A'] + "\n " + secFilingList[0]['item9B']
        rawItemDocs6 = [Document(page_content=section9)]
        itemDocs6 = splitter.split_documents(rawItemDocs6)
        print("Number of documents chunks generated from Item9 : ", len(itemDocs6))
        summary6 = generateSummaries(itemDocs6)
        outputAnswer6 = summary6['output_text']
        secFilingsPib.append({
                        "section": "item9",
                        "summaryType": "Accounting Disclosures",
                        "summary": outputAnswer6
                })
        
        s4Data.append({
                'id' : str(uuid.uuid4()),
                'symbol': symbol,
                'cik': cik,
                'step': step,
                'description': 'SEC Filings',
                'insertedDate': today.strftime("%Y-%m-%d"),
                'pibData' : str(secFilingsPib)
        })
        mergeDocs(SearchService, SearchKey, pibIndexName, s4Data)
else:
        print("Step 4 data already exists in the index")
        for item in r:
                s4Data.append({
                        'id' : item['id'],
                        'symbol': item['symbol'],
                        'cik': item['cik'],
                        'step': item['step'],
                        'description': item['description'],
                        'insertedDate': item['insertedDate'],
                        'pibData' : item['pibData']
                })

Step 4 data already exists in the index


#### 6. Private Data - Equity Research Reports

In [31]:
# from azure.search.documents import SearchClient
# from azure.core.credentials import AzureKeyCredential
# step = "5"
# searchClient = SearchClient(endpoint=f"https://{SearchService}.search.windows.net",
#         index_name=pibIndexName,
#         credential=AzureKeyCredential(SearchKey))
# r = searchClient.search(  
#     search_text="",
#     filter="cik eq '" + cik + "' and step eq '" + step + "'",
#     select=["id"],
#     semantic_configuration_name="semanticConfig",
#     include_total_count=True
# )
# if r.get_count() > 0:
#     for doc in r:
#         searchClient.delete_documents(doc)

In [32]:
step = "5"
s5Data = []
r = findPibData(SearchService, SearchKey, pibIndexName, cik, step, returnFields=['id', 'symbol', 'cik', 'step', 'description', 'insertedDate',
                                                                   'pibData'])

if r.get_count() == 0:
    companyRating = rating(apikey=apikey, symbol=symbol)
    fScore = financialScore(apikey=apikey, symbol=symbol)
    esgScores = esgScore(apikey=apikey, symbol=symbol)
    esgRating = esgRatings(apikey=apikey, symbol=symbol)
    ugConsensus = upgradeDowngrades(apikey=apikey, symbol=symbol)
    priceConsensus = priceTarget(apikey=apikey, symbol=symbol)
    #ratingsDf = pd.DataFrame.from_dict(pd.json_normalize(companyRating))
    researchReport = []

    try:
        researchReport.append({
            "key": "Overall Recommendation",
            "value": companyRating[0]['ratingRecommendation']
        })
        researchReport.append({
            "key": "DCF Recommendation",
            "value": companyRating[0]['ratingDetailsDCFRecommendation']
        })
        researchReport.append({
            "key": "ROE Recommendation",
            "value": companyRating[0]['ratingDetailsROERecommendation']
        })
        researchReport.append({
            "key": "ROA Recommendation",
            "value": companyRating[0]['ratingDetailsROARecommendation']
        })
        researchReport.append({
            "key": "PB Recommendation",
            "value": companyRating[0]['ratingDetailsPBRecommendation']
        })
        researchReport.append({
            "key": "PE Recommendation",
            "value": companyRating[0]['ratingDetailsPERecommendation']
        })
    except:
        logging.info('No data found for companyRating')
        pass

    try:
        researchReport.append({
            "key": "Altman ZScore",
            "value": fScore[0]['altmanZScore']
        })
        researchReport.append({
            "key": "Piotroski Score",
            "value": fScore[0]['piotroskiScore']
        })
    except:
        logging.info('No data found for fScore')
        pass

    try:
        researchReport.append({
            "key": "Environmental Score",
            "value": esgScores[0]['environmentalScore']
        })
        researchReport.append({
            "key": "Social Score",
            "value": esgScores[0]['socialScore']
        })
        researchReport.append({
            "key": "Governance Score",
            "value": esgScores[0]['governanceScore']
        })
        researchReport.append({
            "key": "ESG Score",
            "value": esgScores[0]['ESGScore']
        })
    except:
        logging.info('No data found for esgScores')
        pass

    try:
        researchReport.append({
            "key": "ESG RIsk Rating",
            "value": esgRating[0]['ESGRiskRating']
        })
    except:
        logging.info('No data found for esgRating')
        pass

    try:
        researchReport.append({
            "key": "Analyst Consensus Buy",
            "value": ugConsensus[0]['buy']
        })
        researchReport.append({
            "key": "Analyst Consensus Sell",
            "value": ugConsensus[0]['sell']
        })
        researchReport.append({
            "key": "Analyst Consensus Strong Buy",
            "value": ugConsensus[0]['strongBuy']
        })
        researchReport.append({
            "key": "Analyst Consensus Strong Sell",
            "value": ugConsensus[0]['strongSell']
        })
        researchReport.append({
            "key": "Analyst Consensus Hold",
            "value": ugConsensus[0]['hold']
        })
        researchReport.append({
            "key": "Analyst Consensus",
            "value": ugConsensus[0]['consensus']
        })
    except:
        logging.info('No data found for ugConsensus')
        pass
    # researchReport.append({
    #     "key": "Price Target Consensus",
    #     "value": priceConsensus[0]['targetConsensus']
    # })
    # researchReport.append({
    #     "key": "Price Target Median",
    #     "value": priceConsensus[0]['targetMedian']
    # })
  
    s5Data.append({
                'id' : str(uuid.uuid4()),
                'symbol': symbol,
                'cik': cik,
                'step': step,
                'description': 'Research Report',
                'insertedDate': today.strftime("%Y-%m-%d"),
                'pibData' : str(researchReport)
        })
    mergeDocs(SearchService, SearchKey, pibIndexName, s5Data)
else:
    for s in r:
        s5Data.append(
            {
                'id' : s['id'],
                'symbol': s['symbol'],
                'cik': s['cik'],
                'step': s['step'],
                'description': s['description'],
                'insertedDate': s['insertedDate'],
                'pibData' : s['pibData']
            })

#### 7. Paid Data - Investor Presentations - Financial Reports (Balance Sheet, Income Statement and Cash Flow) for last 3 years?

In [285]:
incomeStatement = incomeStatement(apikey=apikey, symbol=symbol, limit=10)
print(incomeStatement)

[{'date': '2022-12-31', 'symbol': 'NTRS', 'reportedCurrency': 'USD', 'cik': '0000073124', 'fillingDate': '2023-02-28', 'acceptedDate': '2023-02-28 16:20:50', 'calendarYear': '2022', 'period': 'FY', 'revenue': 6761200000, 'costOfRevenue': 0, 'grossProfit': 6761200000, 'grossProfitRatio': 1, 'researchAndDevelopmentExpenses': 0, 'generalAndAdministrativeExpenses': 2685400000, 'sellingAndMarketingExpenses': 76700000, 'sellingGeneralAndAdministrativeExpenses': 2685400000, 'otherExpenses': 0, 'operatingExpenses': 0, 'costAndExpenses': 0, 'interestIncome': 2877700000, 'interestExpense': 990500000, 'depreciationAndAmortization': 553600000, 'ebitda': 2816000000, 'ebitdaratio': 0.4164941135, 'operatingIncome': 2262400000, 'operatingIncomeRatio': 0.3346151571, 'totalOtherIncomeExpensesNet': -496100000, 'incomeBeforeTax': 1766300000, 'incomeBeforeTaxRatio': 0.2612406082, 'incomeTaxExpense': 430300000, 'netIncome': 1336000000, 'netIncomeRatio': 0.1975980595, 'eps': 6.16, 'epsdiluted': 6.14, 'weight

In [293]:
incomeStmtDf = pd.read_json(json.dumps(incomeStatement))
incomeStmtDf = incomeStmtDf[['date', 'symbol', 'fillingDate',
       'acceptedDate', 'calendarYear', 'period', 'revenue', 'costOfRevenue',
       'grossProfit', 'grossProfitRatio', 'researchAndDevelopmentExpenses',
       'generalAndAdministrativeExpenses', 'sellingAndMarketingExpenses',
       'sellingGeneralAndAdministrativeExpenses', 'otherExpenses',
       'operatingExpenses', 'costAndExpenses', 'interestIncome',
       'interestExpense', 'depreciationAndAmortization', 'ebitda',
       'ebitdaratio', 'operatingIncome', 'operatingIncomeRatio',
       'totalOtherIncomeExpensesNet', 'incomeBeforeTax',
       'incomeBeforeTaxRatio', 'incomeTaxExpense', 'netIncome',
       'netIncomeRatio', 'eps', 'epsdiluted', 'weightedAverageShsOut',
       'weightedAverageShsOutDil']]
incomeStmtDf.columns = [['date', 'symbol', 'Filing Date',
       'Accepted Date', 'Calendar Year', 'Period', 'Revenue', 'Cost of Revenue',
       'Gross Profit', 'Gross Profit Ratio', 'R&D Expenses',
       'General Adminstrative Expenses', 'Selling and Marketing Expenses',
       'Selling General Administrative Expenses', 'Other Expenses',
       'Operating Expenses', 'Cost and Expenses', 'Interest Income',
       'Interest Expense', 'Depreciation & Amortization', 'EBITDA',
       'EBITDA Ratio', 'Operating Income', 'Operating Income Ratio',
       'Total Other Net Expenses', 'Income Before Tax',
       'Income Before Tax Ratio', 'Income Tax Expenses', 'Net Income',
       'Net Income Ratio', 'EPS', 'Diluted EPS', 'Outstanding Shares',
       'Diluted Outstanding Shares']]
incomeStmtDf

Unnamed: 0,date,symbol,Filing Date,Accepted Date,Calendar Year,Period,Revenue,Cost of Revenue,Gross Profit,Gross Profit Ratio,...,Total Other Net Expenses,Income Before Tax,Income Before Tax Ratio,Income Tax Expenses,Net Income,Net Income Ratio,EPS,Diluted EPS,Outstanding Shares,Diluted Outstanding Shares
0,2022-12-31,NTRS,2023-02-28,2023-02-28 16:20:50,2022,FY,6761200000,0,6761200000,1,...,-496100000,1766300000,0.261241,430300000,1336000000,0.197598,6.16,6.14,208309331,208867264
1,2021-12-31,NTRS,2022-02-28,2022-02-28 16:35:35,2021,FY,6464500000,0,6464500000,1,...,0,2010100000,0.310944,464800000,1545300000,0.239044,7.16,7.14,208076000,208899000
2,2020-12-31,NTRS,2021-02-23,2021-02-23 16:51:30,2020,FY,6100800000,0,6100800000,1,...,0,1627600000,0.266785,418300000,1209300000,0.19822,5.48,5.46,208319412,209007986
3,2019-12-31,NTRS,2020-02-25,2020-02-25 17:06:15,2019,FY,6073100000,0,6073100000,1,...,0,1944100000,0.320117,451900000,1492200000,0.245706,6.66,6.63,214525547,215601149
4,2018-12-31,NTRS,2019-02-26,2019-02-26 16:37:47,2018,FY,5220800000,0,5220800000,1,...,0,1957800000,0.375,401400000,1556400000,0.298115,6.68,6.64,223148000,224488000
5,2017-12-31,NTRS,2018-02-27,2018-02-27 17:02:44,2017,FY,4706900000,0,4706900000,1,...,0,1633900000,0.347129,434900000,1199000000,0.254732,4.95,4.92,228258000,229654000
6,2016-12-31,NTRS,2017-02-28,2017-02-27 21:25:26,2016,FY,4334700000,0,4334700000,1,...,0,1517100000,0.34999,484600000,1032500000,0.238194,4.35,4.32,227581000,229151000
7,2015-12-31,NTRS,2016-02-29,2016-02-29 16:57:00,2015,FY,4106900000,0,4106900000,1,...,0,1465000000,0.356717,491200000,973800000,0.237113,4.03,3.99,232280000,234222000
8,2014-12-31,NTRS,2015-02-26,2015-02-26 17:04:56,2014,FY,3756600000,0,3756600000,1,...,0,1190200000,0.316829,378400000,811800000,0.2161,3.34,3.32,235830000,237720000
9,2013-12-31,NTRS,2014-02-26,2014-02-26 16:43:54,2013,FY,3525200000,0,3525200000,1,...,0,1075500000,0.305089,344200000,731300000,0.207449,3.01,2.99,239265000,240555000


In [295]:
# deletePibData(SearchService, SearchKey, pibIndexName, "51143", "1", returnFields=['id', 'symbol', 'cik', 'step', 'description', 'insertedDate',
#                                                                       'pibData'])

In [296]:
CSV_PROMPT_PREFIX = """
First set the pandas display options to show all the columns, get the column names, then answer the question.
"""

CSV_PROMPT_SUFFIX = """
- **ALWAYS** before giving the Final Answer, try another method. Then reflect on the answers of the two methods you did and ask yourself if it answers correctly the original question. If you are not sure, try another method.
- If the methods tried do not give the same result, reflect and try again until you have two methods that have the same result. 
- If you still cannot arrive to a consistent result, say that you are not sure of the answer.
- If you are sure of the correct answer, create a beautiful and thorough response using Markdown.
- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE, ONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**. 
- **ALWAYS**, as part of your "Final Answer", explain how you got to the answer on a section that starts with: "\n\nExplanation:\n". In the explanation, mention the column names that you used to get to the final answer. 
"""


In [326]:
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_experimental.agents.agent_toolkits import create_python_agent
from langchain.tools.python.tool import PythonREPLTool
from langchain.python import PythonREPL

agent = create_pandas_dataframe_agent(llm,df=incomeStmtDf, verbose=True)

Question = 'What is the total revenue for the year 2022?'
#Question = 'How much did revenue increased in 2022 in comparision to 2021?'
response = agent.run(Question) 

print(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to filter the dataframe to only include rows where the 'Calendar Year' column is 2022, and then sum the 'Revenue' column.

Action: python_repl_ast
Action Input: df[df[('Calendar Year',)] == 2022]['Revenue'].sum()[0m
Observation: [36;1m[1;3mRevenue    6761200000
dtype: int64[0m
Thought:[32;1m[1;3mThe total revenue for the year 2022 is 6,761,200,000.

Final Answer: 6,761,200,000[0m

[1m> Finished chain.[0m
6,761,200,000
