# Step 1: Initial Setup

First, we start out with our imports. This includes installing necessary libraries such as `openai`, `pinecone-client`, `langchain`, `tiktoken`, and `pypdf`. These libraries provide essential functions for our RAG Pipeline, including language model integration, vector storage, and PDF processing capabilities.

In [1]:
!pip install openai
!pip install pinecone-client
!pip install langchain
!pip install tiktoken
!pip install pypdf

[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m


[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mCollecting pypdf
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/29/10/055b649e914ad8c5d07113c22805014988825abbeff007b0e89255b481fa/pypdf-3.17.4-py3-none-any.whl.metadata
  Downloadi

# Step 2: Importing Text Processing Modules

In this step, we import modules from `langchain` for handling text and documents:

- `RecursiveCharacterTextSplitter`: This class is used to split text into manageable chunks. We initialize it with a `chunk_size` of 1000 and `chunk_overlap` of 0. It's designed to divide large text documents into smaller sections without losing context.

- `PyPDFLoader`: This module is used for loading PDF documents. It enables us to process and extract text from PDF files for further analysis.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Step 3: Loading and Processing Financial Reports

## AXSM and Johnson & Johnson Financial Report
We begin by loading Axsome and JNJ's 10-Q filings. Given the size of these documents, this process may take a minute or two.


In [3]:
# Load $AXSM's financial report. This may take 1-2 minutes since the PDF is large
axsm_10Q = "https://app.quotemedia.com/data/downloadFiling?webmasterId=90423&ref=317845829&type=PDF&symbol=AXSM&cdn=758892bcc180a6f1fb1b85181dfa0d06&companyName=Axsome+Therapeutics+Inc.&formType=10-Q&formDescription=General+form+for+quarterly+reports+under+Section+13+or+15%28d%29&dateFiled=2023-11-06"

# Create your PDF loader
loader = PyPDFLoader(axsm_10Q)

# Load the PDF document
axsm_documents = loader.load()

# Chunk the financial report
docs = text_splitter.split_documents(axsm_documents)
axsm_texts = [d.page_content for d in docs]

In [5]:
# Load $JNJ's financial report. This may take 1-2 minutes since the PDF is large
jnj_10Q = "https://d18rn0p25nwr6d.cloudfront.net/CIK-0000200406/a0c68a93-e699-45d6-b3e7-311e8f9c43bb.pdf"

# Create your PDF loader
loader = PyPDFLoader(jnj_10Q)

# Load the PDF document
jnj_documents = loader.load()

# Chunk the financial report
docs = text_splitter.split_documents(jnj_documents)
jnj_texts = [d.page_content for d in docs]

# Step 4: Setting Up the Vector Store

In this step, we establish our vector store using Pinecone and OpenAI embeddings. This setup is crucial for efficiently storing and retrieving document vectors for our question-answering pipeline.

The integration of Pinecone, a scalable vector database, with OpenAI embeddings, allows for advanced text vectorization capabilities. By doing so, we create a robust infrastructure for our text data, ensuring that the information retrieval process is both fast and accurate. This vector store will serve as the backbone for efficiently handling the vectorized form of our financial reports, enabling quick and relevant retrievals for the Q&A chain.


In [4]:
import pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone, Weaviate

  from tqdm.autonotebook import tqdm


In [None]:
# The environment should be the one specified next to the API key
# in your Pinecone console
# Go to this link to create Pinecone index: https://app.pinecone.io/organizations/-Nlee1f4ZjpOS7ERdM6k/projects/gcp-starter:itrgkgp/indexes
pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="YOUR_PINECONE_ENVIRONMENT")
index = pinecone.Index("YOUR_PINECONE_INDEX")
openai_api_key = 'YOUR_OPENAI_API_KEY'
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectorstore = Pinecone(index, embeddings, "text")

# Step 5: Adding SEC Filings to the Vector Store

Now that our vector store is set up, the next crucial step involves adding our SEC filings into it. This process is vital for populating the store with the text data from the financial reports of AXSM and JNJ. 

By adding these texts to the vector store under respective namespaces, we create a structured and searchable dataset. This allows for efficient retrieval of specific documents when handling queries related to each company's financial information in the subsequent stages of our pipeline.


In [None]:
vectorstore.add_texts(aapl_texts, namespace="AXSM")
vectorstore.add_texts(aapl_texts, namespace="JNJ")

# Step 6: Create Document Q&A Chain With LLM Framework

In this phase, we establish a basic question-answering chain leveraging LangChain's capabilities. The setup involves importing modules for chat models, embeddings, prompt templates, output parsing, and runnables. We initialize the `ChatOpenAI` model with an API key to access OpenAI's language model. A `ChatPromptTemplate` is crafted to define the structure for the question-answering interaction.

The `vectorstore` acts as a retriever, fetching relevant content from our document collection. We enhance the retriever's functionality by configuring a `search_kwargs` field, enabling tailored search parameters. Finally, the Q&A chain is formed by sequentially linking the retriever, prompt template, model, and a string output parser. This configuration allows for efficient processing and response generation based on the context extracted from the vector store.


In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import (
    ConfigurableField,
    RunnableBinding,
    RunnableLambda,
    RunnablePassthrough,
)

In [None]:
# This is basic question-answering chain set up.
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(openai_api_key=openai_api_key)

retriever = vectorstore.as_retriever()

In [None]:
# Here we mark the retriever as having a configurable field. All vectorstore retrievers have search_kwargs as a field. This is just a dictionary, with vectorstore specific fields
configurable_retriever = retriever.configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)

In [None]:
# Create the chain
chain = (
    {"context": configurable_retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

# Step 7: Document Q&A Queries By Company

Finally, we utilize our Q&A chain to perform specific queries on the financial reports of companies like AXSM and JNJ. By invoking the `chain` with targeted questions and configuring the `search_kwargs`, we can retrieve information related to specific companies' financial data.

- For Axsome Therapeutics Inc. (AXSM), we ask about their revenue in July 2023. This is done by setting the `namespace` in `search_kwargs` to "AXSM".

- Similarly, for Johnson & Johnson (JNJ), we inquire about their revenue in September 2023, with the `namespace` set to "JNJ".

These queries demonstrate the pipeline's ability to extract precise financial information from the SEC filings of different companies.


In [None]:
chain.invoke(
    "What was revenue in July 2023?",
    config={"configurable": {"search_kwargs": {"namespace": "AXSM"}}},
)

In [None]:
chain.invoke(
    "What was revenue in September 2023??",
    config={"configurable": {"search_kwargs": {"namespace": "JNJ"}}},
)