In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

In [2]:
#Read the PDF from folder

loader = PyPDFDirectoryLoader("./us_census")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
final_documents = text_splitter.split_documents(documents)
final_documents[0]

Document(page_content='Health Insurance Coverage Status and Type \nby Geography: 2021 and 2022\nAmerican Community Survey Briefs\nACSBR-015Issued September 2023Douglas Conway and Breauna Branch\nINTRODUCTION\nDemographic shifts as well as economic and govern-\nment policy changes can affect people’s access to health coverage. For example, between 2021 and 2022, the labor market continued to improve, which may have affected private coverage in the United States \nduring that time.\n1 Public policy changes included \nthe renewal of the Public Health Emergency, which \nallowed Medicaid enrollees to remain covered under the Continuous Enrollment Provision.\n2 The American \nRescue Plan (ARP) enhanced Marketplace premium subsidies for those with incomes above 400 percent of the poverty level as well as for unemployed people.\n3', metadata={'source': 'us_census\\acsbr-015.pdf', 'page': 0})

In [3]:
len(final_documents)

316

In [4]:
## Embedding Using Huggingface
huggingface_embeddings=HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",      #sentence-transformers/all-MiniLM-l6-v2
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings':True}
)

  from tqdm.autonotebook import tqdm, trange


In [5]:
import numpy as np

np.array(huggingface_embeddings.embed_query(final_documents[0].page_content))

array([-8.46568495e-02, -1.19099310e-02, -3.37892435e-02,  2.94559207e-02,
        5.19159995e-02,  5.73839732e-02, -4.10017520e-02,  2.74268091e-02,
       -1.05128177e-01, -1.58056207e-02,  7.94858709e-02,  5.64318746e-02,
       -1.31765325e-02, -3.41543853e-02,  5.81604335e-03,  4.72547896e-02,
       -1.30746868e-02,  3.12990951e-03, -3.44225764e-02,  3.08406297e-02,
       -4.09085974e-02,  3.52737829e-02, -2.43761409e-02, -4.35831882e-02,
        2.41503194e-02,  1.31986579e-02, -4.84452443e-03,  1.92347690e-02,
       -5.43913022e-02, -1.42735049e-01,  5.15529327e-03,  2.93115675e-02,
       -5.60810640e-02, -8.53534322e-03,  3.14141139e-02,  2.76736636e-02,
       -2.06188019e-02,  8.24231580e-02,  4.15425301e-02,  5.79655319e-02,
       -3.71587239e-02,  6.26160018e-03, -2.41390113e-02, -5.61795151e-03,
       -2.51715351e-02,  5.04970131e-03, -2.52801143e-02, -2.91944109e-03,
       -8.24046135e-03, -5.69604374e-02,  2.30822731e-02, -5.54221636e-03,
        5.11555634e-02,  

In [6]:
np.array(huggingface_embeddings.embed_query(final_documents[0].page_content)).shape

(384,)

In [7]:
# VectorStore Creation
vectorstore = FAISS.from_documents(final_documents[:120], huggingface_embeddings)

In [8]:
## Query using Similarity Search
query="WHAT IS HEALTH INSURANCE COVERAGE?"
relevant_docments = vectorstore.similarity_search(query)

print(relevant_docments[0].page_content)

2 U.S. Census Bureau
WHAT IS HEALTH INSURANCE COVERAGE?
This brief presents state-level estimates of health insurance coverage 
using data from the American Community Survey (ACS). The  
U.S. Census Bureau conducts the ACS throughout the year; the 
survey asks respondents to report their coverage at the time of 
interview. The resulting measure of health insurance coverage, 
therefore, reflects an annual average of current comprehensive 
health insurance coverage status.* This uninsured rate measures a 
different concept than the measure based on the Current Population 
Survey Annual Social and Economic Supplement (CPS ASEC). 
For reporting purposes, the ACS broadly classifies health insurance 
coverage as private insurance or public insurance. The ACS defines 
private health insurance as a plan provided through an employer 
or a union, coverage purchased directly by an individual from an 
insurance company or through an exchange (such as healthcare.


In [9]:
retriever=vectorstore.as_retriever(search_type="similarity",search_kwargs={"k":3})
print(retriever)

tags=['FAISS', 'HuggingFaceBgeEmbeddings'] vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000015DCA586210> search_kwargs={'k': 3}


This code is using the Hugging Face Transformers library, specifically the `vectorstore` module, to create a retriever for searching in a vector space.

Here's a breakdown of what the code does:

1. `vectorstore.as_retriever`: This is a function that creates a retriever from a vector store. A retriever is an object that can be used to search for similar vectors in a vector space.
2. `search_type="similarity"`: This specifies the type of search to perform. In this case, it's set to "similarity", which means the retriever will search for vectors that are similar to the query vector.
3. `search_kwargs={"k":3}`: This is a dictionary of keyword arguments that will be passed to the search function. In this case, it sets the `k` parameter to 3, which means the retriever will return the top 3 most similar vectors for each query.

The `print(retriever)` statement simply prints the retriever object to the console.

The retriever object can then be used to search for similar vectors in a vector space. For example, you could use it to retrieve the top 3 most similar vectors for a given query vector like this:
```python
query_vector = ...  # get the query vector
results = retriever(query_vector)
print(results)  # print the top 3 most similar vectors
```
Note that this code assumes you have already created a vector store and have loaded it with vectors. The `vectorstore` module provides various functions for creating and manipulating vector stores, such as `vectorstore.from_pretrained` and `vectorstore.save`.

In [10]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN']= "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"


The Hugging Face Hub is an platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

In [11]:
from langchain_community.llms import HuggingFaceHub

hf = HuggingFaceHub(
    repo_id = "mistralai/Mistral-7B-v0.1",
    model_kwargs = {"temperature":0.1, "max_length":500}
)

query="What is the health insurance coverage?"
hf.invoke(query)

  warn_deprecated(


This code is using the LangChain library, specifically the `HuggingFaceHub` class, to interact with a pre-trained language model on the Hugging Face Hub.

Here's a breakdown of what the code does:

1. `from langchain_community.llms import HuggingFaceHub`: This line imports the `HuggingFaceHub` class from the `langchain_community.llms` module.
2. `hf = HuggingFaceHub(...)`: This line creates an instance of the `HuggingFaceHub` class, which is used to interact with a pre-trained language model on the Hugging Face Hub.
3. `repo_id = "mistralai/Mistral-7B-v0.1"`: This specifies the repository ID of the pre-trained language model to use. In this case, it's the "Mistral-7B-v0.1" model from the "mistralai" organization on the Hugging Face Hub.
4. `model_kwargs = {"temperature":0.1, "max_length":500}`: This dictionary specifies additional keyword arguments to pass to the language model when it's invoked. In this case, it sets the `temperature` parameter to 0.1 and the `max_length` parameter to 500.
5. `query="What is the health insurance coverage?"`: This specifies the query string to pass to the language model.
6. `hf.invoke(query)`: This line invokes the language model with the specified query string, using the `invoke` method of the `HuggingFaceHub` instance. The `invoke` method returns the output of the language model, which can be used for further processing or analysis.

The output of the `invoke` method will depend on the specific language model and the query string. In this case, the output might be a text response generated by the language model in response to the query.

Note that the `HuggingFaceHub` class provides a convenient way to interact with pre-trained language models on the Hugging Face Hub, without having to manually download and load the models.

### Hugging Face models can be run locally through the HuggingFacePipeline class.
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="mistralai/Mistral-7B-v0.1",
    task="text-generation",
    pipeline_kwargs={"temperature": 0, "max_new_tokens": 300}
)

llm = hf 
llm.invoke(query)


In [None]:
prompt_template="""
Use the following piece of context to answer the question asked.
Please try to provide the answer only based on the context

{context}
Question:{question}

Helpful Answers:
 """

In [None]:
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [None]:
retrievalQA=RetrievalQA.from_chain_type(
    llm=hf,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt":prompt}
)

This code is creating an instance of the `RetrievalQA` class, which is a type of language model that uses a combination of a language model (LLM) and a retriever to generate responses to questions.

Here's a breakdown of what the code does:

1. `RetrievalQA.from_chain_type(...)`: This line creates an instance of the `RetrievalQA` class using the `from_chain_type` method.
2. `llm=hf`: This specifies the language model (LLM) to use. In this case, it's the `hf` object created earlier, which is an instance of the `HuggingFaceHub` class.
3. `chain_type="stuff"`: This specifies the type of chain to use. In this case, it's set to `"stuff"`, which is likely a custom chain type defined elsewhere in the code.
4. `retriever=retriever`: This specifies the retriever to use. In this case, it's the `retriever` object created earlier, which is an instance of the `vectorstore.as_retriever` class.
5. `return_source_documents=True`: This specifies whether to return the source documents (i.e., the original text) along with the generated response. In this case, it's set to `True`, which means the `RetrievalQA` instance will return both the generated response and the source documents.
6. `chain_type_kwargs={"prompt":prompt}`: This specifies additional keyword arguments to pass to the chain type. In this case, it sets the `prompt` parameter to the value of the `prompt` variable.

The `RetrievalQA` class is a type of language model that uses a combination of a language model (LLM) and a retriever to generate responses to questions. The LLM is used to generate text based on the input prompt, and the retriever is used to retrieve relevant documents or passages from a corpus. The `RetrievalQA` class then combines the output of the LLM and the retriever to generate a response to the question.

In this specific case, the `RetrievalQA` instance is likely being used to generate responses to questions based on the output of the `retriever` object. The `retriever` object is used to retrieve relevant documents or passages from a corpus, and the `RetrievalQA` instance is used to generate a response to the question based on those retrieved documents.

The `return_source_documents=True` parameter means that the `RetrievalQA` instance will return both the generated response and the source documents (i.e., the original text) along with the response. This can be useful for debugging or for analyzing the generated responses.

In [None]:
#call the QA chain with our query

result = retrievalQA.invoke({"query": query})
print(result['result'])


ConnectionError: (ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 440a5778-289f-44aa-8bcb-c3f98a2b444d)')