# Project: Text Analytics (Knowledge base Design and Implementation)
# Team - TalkTech Tinkerers

### Members:
### Rodgers Okeyo Ochieng
### Jihee Wang
### Divyanshi Singh
### Chandan Patel

![plan.png](plan.png)

### In this Notebook we perform the below task:

**Data Collection:**
Load all the collected data or files Collected files.

**Data Preprocessing**
Preprocess the data as required for schema.

**Vectorization**
Convert the preprocessed text data into vectors using the selected vectorization techniques.

**Database Implementation**
Use Pinecone to create a vector database and store the vectors along with associated metadata.

**Database Testing**
Implement a few test prompts to verify the efficacy of the database in improving query responses.

### Import Necessary key and Libraries

In [20]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [21]:
from langchain.chains import RetrievalQA  ## retrieval over some documents
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader  ## document loader
from langchain.document_loaders import PyPDFLoader
from IPython.display import display, Markdown
from langchain.prompts import PromptTemplate
import pinecone # pip3 install "pinecone-client[grpc]"
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI

from pprint import pprint # pretty print

## Data Collected through various sources and Loading these documents

![Documents.PNG](Documents.PNG)

In [22]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("./data1/doc1_ overall admission process of MS BAIS_with specific 8 steps.pdf"),
    PyPDFLoader("./data1/doc2_how to apply for MS BAIS_apply in 6 easy steps (application, fee, transcript, test score, additional infromation, clearance questions).pdf"),
    PyPDFLoader("./data1/doc3_MS BAIS Admission requirements (application, four_year bachelor degree, gmat or gre,relevant work experience, recommendations,  toefl, sop).pdf"),
    PyPDFLoader("./data1/doc4_cost of MS BAIS course (tuition fee, online tuition caculator,  our_of_state tuition).pdf"),
    PyPDFLoader("./data1/doc5_contact admissions.pdf"),
    PyPDFLoader("./data1/doc6_MS BAIS application deadlines (spring application deadline, fall appilcation deadline).pdf"),
    PyPDFLoader("./data1/doc7_explore graduate programs (types , graduate catalog, graduate research, concurrent, accelerated bachelors to masters degree, online programs).pdf"),
    PyPDFLoader("./data1/doc8_what is I20 (full name, issued by, purpose, usage, important features, validity, importance).pdf"),
    PyPDFLoader("./data1/doc9_most frequently asked questions (admission, application, program, finances, international students).pdf"),
    PyPDFLoader("./data1/doc10_additional information (cost of attendance cost, assistantships, ways to save, office of financial aid, scholarships, fellowship, onoff campus jobs).pdf"),
    PyPDFLoader("./data1/doc11_graduate assistantships.pdf"),
    PyPDFLoader("./data1/doc12_MS BAIS Curriculum (required time, coursework, necessary prerequisite, credit hours, core courses, capstone, concentrations).pdf"),
    PyPDFLoader("./data1/doc13_MS BAIS curriculum requirements (minimum, incoming students, prerequisites, executive weekend, core, capstone, concentration, electives).pdf"),
    PyPDFLoader("./data1/doc14_MS BAIS prerequisites courses for incoming students.pdf"),
    PyPDFLoader("./data1/doc15_global executive program (full time, 100 per online).pdf"),
    PyPDFLoader("./data1/doc16_getting started with MS BAIS_new student checklist.pdf"),
    PyPDFLoader("./data1/doc17_Conversation Records between Students and College Management.pdf"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [23]:
## check docs
docs[0].page_content[0:500]

'1. Overall Admission Process  of MS BAIS  \nWe get thousands of applications, which are reviewed first by our Office of Graduate \nAdmissions and then by our MS -BAIS program on a first come, first served basis. The entire \nprocess may take us up to two months, and admissions will close whenever we reach our \nintake capacity for the academic year. Please contact us if you don’t hear back from us in \ntwo months. Our admissions process is as follows:  \n \nStep1. Submit your application, complete with'

#### metadata information of the loaded docs below

In [24]:
# Iterate through the list of documents and collect metadata
all_metadata = []

for doc in docs:
    metadata = doc.metadata
    all_metadata.append(metadata)

# Print all metadata at the end
for metadata in all_metadata:
    print(metadata)

{'source': './data1/doc1_ overall admission process of MS BAIS_with specific 8 steps.pdf', 'page': 0}
{'source': './data1/doc2_how to apply for MS BAIS_apply in 6 easy steps (application, fee, transcript, test score, additional infromation, clearance questions).pdf', 'page': 0}
{'source': './data1/doc2_how to apply for MS BAIS_apply in 6 easy steps (application, fee, transcript, test score, additional infromation, clearance questions).pdf', 'page': 1}
{'source': './data1/doc2_how to apply for MS BAIS_apply in 6 easy steps (application, fee, transcript, test score, additional infromation, clearance questions).pdf', 'page': 2}
{'source': './data1/doc3_MS BAIS Admission requirements (application, four_year bachelor degree, gmat or gre,relevant work experience, recommendations,  toefl, sop).pdf', 'page': 0}
{'source': './data1/doc3_MS BAIS Admission requirements (application, four_year bachelor degree, gmat or gre,relevant work experience, recommendations,  toefl, sop).pdf', 'page': 1}
{'s

## Data Preprocessing

Here will do necessary splitting into chunks and embedding

### Document Splitting into chunks

Document spiltted into chunks so that predefined doc size can be retrived. 

In [25]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 20
)

In [27]:
texts = text_splitter.split_documents(docs)

In [28]:
## check text length
len(texts)

10852

In [29]:
## check first doc
texts[1]

Document(page_content='some additional supplemental information via an online form. Note: If you did not receive this e -mail, your \napplication is likely still pending at our G raduate Admissions Office. Contact them at 813 -974-3350 \nor GradAdmissions@usf.edu  (and not the MS -BAIS program director) for status on your application.  \nStep4. The MS -BAIS prog ram will review your complete application and notify you the final decision on your \napplication. Expect this process to take another month.  \nStep5. If you are an international student, you will then have to provide proof of funds for our Office of \nInternati onal Students to issue you an I -20. This may require a bank statement or a bank loan. Domestic students \nmay skip to Step 7.  \nStep6. Upon receiving your I -20, schedule a visa appointment at your closest US consulate or embassy. \nDepending on applicant volume,  it may take months to get a visa appointment. If you don’t get your visa', metadata={'source': './data1/

![split.png](split.png)

Above is the explanation of the splitting and embedding 

## Vectorization

## Embeddings

Let's take our splits and embed them using text embedding ada

In [30]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002")

## Vectorstores and Storage using Pinecone
Applications that involve large language models, generative AI, and semantic search rely on vector embeddings, a type of data that represents semantic information. This information allows AI applications to gain understanding and maintain a long-term memory that they can draw upon when executing complex tasks.

## Integrate a Knowledge Base using Pinecone and LangChain

We will use Pinecone to create a vector database. We will use the LangChain to create a knowledge base. The knowledge base will be used to answer questions.

In [31]:
# Import and initialize Pinecone client
pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV')  
)

## Pinecone Database Implementation

In [32]:
# create an index

pinecone.init(api_key=os.getenv('PINECONE_API_KEY'), environment=os.getenv('PINECONE_ENV'))

# pinecone.delete_index("langchain-demo-index")

# pinecone.create_index("langchain-demo-index", dimension=1536) # 1536 is openai ada embedding dimension

In [33]:
# Upload vectors to Pinecone
# You need first create an index in Pinecone, see www.pinecone.io - set it to 1536 dimensions

index_name = "langchain-demo-index"
vectordb = Pinecone.from_documents(texts, embeddings, index_name=index_name) # load all text chunks into Pinecone with the associated embeddings

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-9E0KtZHRQjJeAMkp7vbAlMhP on tokens per min. Limit: 1000000 / min. Current: 842688 / min. Contact us through our help center at help.openai.com if you continue to have issues..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-9E0KtZHRQjJeAMkp7vbAlMhP on tokens per min. Limit: 1000000 / min. Current: 769844 / min. Contact us through our help center at help.openai.com if you continue to have issues..


## Database Testing
Database testing can be done by retrieving the chunks through similarity search. There can be various other methods as well where we can retrieve the files related to the search text. Below shown are the example using database testing using similarty search retrieval and langChain Retrieval QA. 

### Retrieval of Docs

Retrieval of docs from the vector database which are the closest match from the query

In [34]:
# Do a simple vector similarity search

query = "Tell me admission process in BAIS"
result = vectordb.similarity_search(query)

pprint(result)

[Document(page_content='1. Overall Admission Process  of MS BAIS  \nWe get thousands of applications, which are reviewed first by our Office of Graduate \nAdmissions and then by our MS -BAIS program on a first come, first served basis. The entire \nprocess may take us up to two months, and admissions will close whenever we reach our \nintake capacity for the academic year. Please contact us if you don’t hear back from us in \ntwo months. Our admissions process is as follows:  \n \nStep1. Submit your application, complete with GRE/GMAT scores, college transcripts, and other necessary details \nto USF Office of Graduate Admissions.  \nStep2. The graduate admissions offi ce will notify you the receipt of your application, assign you a U number, \nreview your application, and if complete, will refer your application to the MS -BAIS program. This process takes \none month.  \nStep3. The MS -BAIS program will notify you by e -mail our receipt of your application and ask you to provide us', m

We can see above that from similarity search we found relevant 4 docs listed above. Now explore other retrieval methods where we can send this doc to LLM to produce the natural language response by utilizing rectrieved chunks from the semantic search. 

In [35]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.5)

In [36]:
## Retrieval using langChain (from langchain.chains import RetrievalQA)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [37]:
question = "Tell me admission process in BAIS?"

In [38]:
result = qa_chain({"query": question})

In [39]:
display(Markdown(result["result"]))

The admission process for the MS BAIS program at USF is as follows:

1. Step 1: Submit your application, including GRE/GMAT scores, college transcripts, and other necessary details, to the USF Office of Graduate Admissions.
2. Step 2: The graduate admissions office will notify you of the receipt of your application, assign you a U number, review your application, and if complete, refer your application to the MS-BAIS program. This process takes approximately one month.
3. Step 3: The MS-BAIS program will notify you by email of the receipt of your application and ask you to provide additional information if needed.
4. Step 4: Once your application is reviewed, you will receive an official acceptance from the Graduate School. At that time, you will be able to log into your account and access your e-admit letter.

Please note that the entire admission process may take up to two months, and admissions will close once the intake capacity for the academic year is reached. If you don't hear back within two months, it is recommended to contact the program for an update on your application status.

### Database testing - Retrieval and Output using Langchain RetrievalQA and utilizing prompt template

In [40]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [41]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [42]:
question = "What are the prerequisite courses in BAIS"

In [43]:
result = qa_chain({"query": question})

In [44]:
display(Markdown(result["result"]))

The prerequisite courses for the BAIS program are:
1. A course in high-level, object-oriented programming language (e.g., C#, C++, Java, and Python) or substantial programming experience.
2. A course in Information Systems Analysis and Design or equivalent experience.
3. A course in Database Systems or equivalent experience.
4. A course in Statistics or equivalent professional qualification or experience.
5. A course in economics or equivalent professional qualification or experience.
6. A course in financial accounting. 

Thanks for asking!

## Overall explanation of what we are doing above is shown below in the schema

![schema.png](schema.png)

The above diagram is a representation of a two-step process related to creating a database and querying a document using some sort of vectorized approach. Below is the overview of steps 1 and step 2 which we are doing in this project. In this report, we are more focused on step 1. 

**Step 1: Create Data Base (This is the main part of the project)**

- **Split Type:**
  - Various types of files, such as PDFs, TXTs, and URLs, are ingested.
  - Preprocessing or splitting phase to extract relevant data or sections from these sources.

- **Embedding Type:**
  - The extracted data or sentences from the source files are then embedded or transformed into a numerical vector form.
  - This might be using word embeddings or other Natural Language Processing techniques.

- **Model & VectorStore:**
  - The created vectors are then stored in a "VectorStore". This is like a database but specifically designed to handle vectors. We are using Pinecone database in this project.
  - In case of search, the model calculates the "distance" between vectors. This would be useful for understanding the similarity between different pieces of data or to retrieve relevant information.
  - As you can see, the vector storage system is visually represented as a container holding various vectorized data.

**Step 2: Ask to the document (Here we use retrieval and LLM – OpenAI API Calls)**

- **Question:**
  - For example, a user or system can pose a question: "What is …". Through retrieval, we can get the relevant files or chunks for vector search based on the semantic search on this query.

- **LLM (OpenAI- “gpt-3.5-turbo”):**
  - The question plus the relevant doc retrieved will pass to an LLM.
  - The LLM understands and processes the question in the context of the data passed to it and its own knowledge parameters.

- **Relevance and Answers:**
  - The LLM then determines the most relevant "splits" and comes up with the best possible answer to the question asked.
  - It then provides an answer in natural language: Ex. "It is ...".

In conclusion, the convergence of traditional database methods with advanced vector embeddings and natural language processing heralds a new era in information management and retrieval. The system we've outlined not only holds the potential to streamline data integration but also ensures more contextually relevant data extraction based on natural language queries. As the demands of information processing continue to grow, it is imperative that our systems evolve in tandem, offering solutions that are both efficient and intuitive. This study has laid the groundwork for what promises to be an exciting journey ahead in the realm of data storage and retrieval.

