# Data Ingestion

In this tutorial we show how create a data ingenstion pipeline to add data to a vector database.

We are going to use `Pinecone` as the vector database, but there are other vector databases available too for example `Chroma, Weaviate, Faiss, etc.`

We will be doing the following in this session:
- How to load in documents.
- Add metadata to each document.
- How to use a text splitter to split documents.
- How to generate embeddings for each text chunk.
- How to insert into a vector database.


## Pinecone

You will need a [Pinecone](https://www.pinecone.io/) API key, you can [sign-up](https://app.pinecone.io/?sessionType=signup) for free to get a started account and then get the API key after sign-up.

In [1]:
# Import the os module to interact with the operating system environment variables
import os
from pinecone import Pinecone, ServerlessSpec
# Set the OPENAI_API_KEY environment variable to your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""

# Set the PINECONE_API_KEY environment variable to your Pinecone API key
os.environ["PINECONE_API_KEY"] = ""


## OpenAI

You will need an [OpenAI](https://openai.com/) api key for this session. Login to your [platform.openai.com](https://platform.openai.com/) account, click on your profile picture in the upper right corner, and choose 'API Keys' from the menu. Create an API key and save it.

## Setup a Pinecone Index

In [2]:
# Initialize a ServerlessSpec object for AWS with the specified region
import time
pc = Pinecone(api_key="")
spec = ServerlessSpec(
    cloud="aws",
    region="us-east-1"
)
INDEX_NAME = "earning-calls"
# Check if the index already exists in the current PC (presumably a database or similar)
if INDEX_NAME in pc.list_indexes().names():
    # If the index exists, print a message indicating its existence
    print(f"Index `{INDEX_NAME}` already exists")
    
    # Retrieve the existing index object
#     index = <YOUR CODE HERE>
    
    # Print detailed statistics about the existing index
#     print(index.describe_index_stats())
    
# If the index does not exist, proceed to create a new one
else:
    # Create a new index with specific parameters
    pc.create_index(
        name=INDEX_NAME,
        dimension=1536,
        metric="cosine",
        spec=spec
    )
    
    # Wait for the index to be initialized before proceeding
    while not pc.describe_index(INDEX_NAME).status['ready']:
        # Sleep for 1 second to avoid overloading the system with requests
        time.sleep(1)
    
    # Once the index is ready, print a confirmation message
    print(f"Index with name `{INDEX_NAME}` is created")
    
    # Retrieve the newly created index object
    index = pc.Index(INDEX_NAME)
    
    # Print detailed statistics about the newly created index
    print(index.describe_index_stats())

Index `earning-calls` already exists


`Note:` In case you want to delete an already existing index then use the following `pc.delete_index(index_name)`

## Building an Ingestion Pipeline

### Importing the required packages

In [4]:
# To split the text into smaller chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

# To create embeddings
from langchain_openai import OpenAIEmbeddings

# To connect with the Vectorstore
from langchain_pinecone import PineconeVectorStore

# To load files in a directory
from langchain_community.document_loaders import DirectoryLoader 

# To parse the PDFs
from langchain_community.document_loaders import PyPDFLoader 

In [7]:
# Define the directory path where the data files are stored
DATA_DIR_PATH = "Data/"

# Set the chunk size for processing data, typically in bytes
CHUNK_SIZE = 1024

# Define the overlap between chunks for more efficient processing
CHUNK_OVERLAP = 204

# Specify the name of the index to be used for storing or retrieving data
INDEX_NAME = 'earning-calls'

`Note:` Make sure to maintain the below show directory structure since we will be using the Year and Quarter directory names in the metadata later.

<!-- ![Data Dir Tree](images/data_dir_tree.png) -->

<img src="images/data_dir_tree.png"/>

### Loading Files

Initialize a DirectoryLoader object and pass the `Path to data`, `the type of files to load from directory`, and `the loader_class` which in our case is PyPDFLoader since we are working with PDF files.

In [8]:
# Initialize a loader object with your specific loading logic or method
loaders = {
    '.pdf': PyPDFLoader,
}
file_type = '.pdf'
loader = DirectoryLoader(path=DATA_DIR_PATH,glob=f"**/*{file_type}",loader_cls=loaders[file_type])

# Load documents using the loader object
docs = loader.load()

# Print the total number of documents loaded
print(f"Total Documents loaded: {len(docs)}")

Total Documents loaded: 29


In [9]:
# looking into the first document
docs[0]

Document(page_content=' \n Page 1 of 18  \n \n“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 \nEarnings Conference Call”  \n \nApril 20 , 2023 \n \n \n \n \n \n \n \n \n \n \n  \n \n \nMANAGEMENT : MR. C. VIJAYAKUMAR – CHIEF EXECUTIVE OFFICER & \nMANAGING DIRECTOR , HCL  TECH NOLOGIES LIMITED  \nMR. PRATEEK  AGGARWAL  – CHIEF  FINANCIAL \nOFFICER , HCL  TECH NOLOGIES LIMITED  \nMR. SRINIVASAN SESHADRI – GLOBAL HEAD, \nFINANCIAL SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. VIJAY GUNTUR  – PRESIDEN T, ENGINEERING AND \nR&D  SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. MANAN  BATRA – SENIOR MANAGER , INVESTOR \nRELATIONS , HCL  TECH NOLOGIES LIMITED  \n  \n', metadata={'source': 'Data\\FY23\\Q4\\HCLTech.pdf', 'page': 0})

In [10]:
# we can convert the Document object to a python dict using the .dict() method.
print(f"keys associated with a Document: {docs[0].dict().keys()}")

keys associated with a Document: dict_keys(['page_content', 'metadata', 'type'])


In [11]:
print(f"{'-'*15}\nFirst 100 charachters of the page content: {docs[0].page_content[:100]}\n{'-'*15}")
print(f"Metadata associated with the document: {docs[0].metadata}\n{'-'*15}")
print(f"Datatype of the document: {docs[0].type}\n{'-'*15}")

---------------
First 100 charachters of the page content:  
 Page 1 of 18  
 
“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 
Earnings Conference Call”  
---------------
Metadata associated with the document: {'source': 'Data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}
---------------
Datatype of the document: Document
---------------


In [12]:
#  We loop through each document and add additional metadata - filename, quarter, and year
# metadata is very much important for narrowing down the search,if we are not narrowing down there will be lot of hallucinations.
# it might responds to some generic way instead of specific company. Success of RAG depends on the retrieval.
# check about graphRAG?
for doc in docs:
    filename = doc.dict()['metadata']['source'].split("\\")[-1]
    quarter = doc.dict()['metadata']['source'].split("\\")[-2]
    year = doc.dict()['metadata']['source'].split("\\")[-3]
    doc.metadata = {"filename": filename, "quarter": quarter, "year": year, "source": doc.dict()['metadata']['source'], "page": doc.dict()['metadata']['page']}

In [13]:
# To veryfy that the metadata is indeed added to the document
print(f"Metadata associated with the document: {docs[0].metadata}\n{'-'*15}")
print(f"Metadata associated with the document: {docs[1].metadata}\n{'-'*15}")

Metadata associated with the document: {'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': 'Data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}
---------------
Metadata associated with the document: {'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': 'Data\\FY23\\Q4\\HCLTech.pdf', 'page': 1}
---------------


### Chunking Text

As the name suggests, chunking is the process of dividing a large amount of data into several smaller parts for more effective and meaningful storage.

There are various ways to perform chunking naming some as:
 - Character Chunking
 - Recursive Character Chunking
 - Document Specific Chunking

For the sake of this session we will be using the `Recursive Character Chunking` and langchain has an implemention that we can directly use. To read more about it you can refer to the [docs](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/)

`Additional Resource:` If you want to explore the different chunking stratigies than you can refer to the following docs from langchain - [Link to Docs](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/)

In [14]:
# Split text into chunks 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CHUNK_SIZE,
    chunk_overlap = CHUNK_OVERLAP
)
documents = text_splitter.split_documents(docs)

In [16]:
len(docs), len(documents)

(29, 112)

In [17]:
documents[0:4]

[Document(page_content='Page 1 of 18  \n \n“HCL  Tech nologies Limited ’s Q4FY23 & Annual FY23 \nEarnings Conference Call”  \n \nApril 20 , 2023 \n \n \n \n \n \n \n \n \n \n \n  \n \n \nMANAGEMENT : MR. C. VIJAYAKUMAR – CHIEF EXECUTIVE OFFICER & \nMANAGING DIRECTOR , HCL  TECH NOLOGIES LIMITED  \nMR. PRATEEK  AGGARWAL  – CHIEF  FINANCIAL \nOFFICER , HCL  TECH NOLOGIES LIMITED  \nMR. SRINIVASAN SESHADRI – GLOBAL HEAD, \nFINANCIAL SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. VIJAY GUNTUR  – PRESIDEN T, ENGINEERING AND \nR&D  SERVICES , HCL  TECH NOLOGIES LIMITED  \nMR. MANAN  BATRA – SENIOR MANAGER , INVESTOR \nRELATIONS , HCL  TECH NOLOGIES LIMITED', metadata={'filename': 'HCLTech.pdf', 'quarter': 'Q4', 'year': 'FY23', 'source': 'Data\\FY23\\Q4\\HCLTech.pdf', 'page': 0}),
 Document(page_content="HCL Tech nologies Limited  \nApril 20 , 202 3 \n \n Page 2 of 18 \nModerator:  Ladies and gentlemen, g ood day and welcome to the HCL Technologies  Limited Q4 FY'23 & \nAnnual FY'23 Earnings Co

In [5]:
# Initialize the embedding model
embeddings = OpenAIEmbeddings(model = "text-embedding-ada-002") 

In [6]:
# Prompt the user to confirm if the vectors are already added to the Pinecone database
docs_already_in_pinecone = input("Are the vectors already added in DB: (Type Y/N)")

# Check if the user has confirmed that the vectors are already in the database
if docs_already_in_pinecone == "Y" or docs_already_in_pinecone == "y":
    
    # Initialize a PineconeVectorStore object with the existing index and embeddings
    docsearch = PineconeVectorStore(
        index_name = INDEX_NAME,
        embedding = embeddings
    )
    
    print("Existing Vectorstore is loaded")
    
# If the user confirms that the vectors are not in the database, create a new PineconeVectorStore from the documents and embeddings
elif docs_already_in_pinecone == "N" or docs_already_in_pinecone == "n":
    
    # Create a PineconeVectorStore object from the documents and embeddings, specifying the index name
    docsearch = PineconeVectorStore.from_documents(
        documents,
        index_name = INDEX_NAME,
        embedding = embeddings        
    )
    
    print("New vectorstore is created and loaded")
    
# If the user input is neither 'Y' nor 'N', prompt them to enter a valid response
else:
    print("Please type Y - for yes and N - for no")

Are the vectors already added in DB: (Type Y/N) Y


Existing Vectorstore is loaded


In [7]:
# Here we are defing how to use the loaded vectorstore as retriver
retriver = docsearch.as_retirever(search_kwargs={"filter":{"quarter": QUARTER, "filename": FILENAME},"k": TOP_K})

AttributeError: 'PineconeVectorStore' object has no attribute 'as_retirever'

In [None]:
retriver.invoke("what is the income?")

#### Using metadata with retriver

In [None]:
# create a retriever object using the `docsearch` module, configured with specific search parameters
retriver = <YOUR CODE HERE>

In [None]:
retriver.invoke("what is the income?")