# Step 1: Install all the required Packages

```
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

```

## Step 2: Import the required libraries

In [1]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader


## Method 1: Extracting Text from PDF Document using PDF reader - For ChatGPT

In [2]:
data = r"C:/Users/Lukas/Desktop/My_Projects/To_Upload/Llama2/llama2_projects/llama2_pdf_chatbot_faiss_windows/data/Hotline_Wiki.pdf"
reader = PdfReader(data)

# loader = PyPDFLoader(data)
# pages = loader.load_and_split()
# pages

incorrect startxref pointer(1)


## Read Data from PDF file and put it into vriable raw_text

In [None]:
# It will go to each page and read text from each page, raw_text file contain all the text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
raw_text

In [None]:
raw_text[:100]

## Split Text into Smaller Chunks

In [None]:
# Here we'll split the text we read into smaller chunks so that during information retrieval we don't hit the maximum token limit.
# Token limit for llama is 16000 words - 4000 tokens

textSplitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap = 50,
    
    #chunk size of 1000 token each and there is going to be overlap of 200 tokens between the consecutive chunks
    #first chunk is 1000 characters long, next chunk will include 200 characters from 1st Chunk
    
    length_function=len
)
textSplitter

# textSplitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=200
# )

In [None]:
## Now to convert text into chunks we will use text splitter
texts = textSplitter.split_text(raw_text)
texts

In [None]:
len(texts)

In [None]:
texts[0]

In [None]:
texts[1]

## Method 2: Extracting Text from PDF Document using PDF reader - For ChatGPT

## Read Data from PDF file and put it into vriable text

In [None]:
reader = PdfReader(data)
text = ""
for page in reader.pages:
    text+=page.extract_text()
text

## Split Text into Smaller Chunks

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, length_function=len)

In [None]:
chunks = text_splitter.split_text(text)
chunks

## Method 3: Extracting Text from PDF Document - For Llama 2

In [3]:
loader = PyPDFLoader(data)
data = loader.load()

incorrect startxref pointer(1)


In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, length_function=len)

In [5]:
chunks = text_splitter.split_documents(data)
chunks

[Document(page_content="Question: internal my X-FAB (SiFo) account\nAnswer: Follow the normal setup procedure (see\nhttp://sp.erf.xfab.de/RandD/DesSup/Hotline/AX_ACM-workflow-2020Mar.docx - or newer version if it\nexists), but: (i) The username in AX should be preceded by 'FO_' to clearly identify them as X-FAB\nusers (ii) include the following text in your answer: Thank you for your request. To access the SiFo\nsystem, you need to have your own my X-FAB account, which I have set up for you. You will soon", metadata={'source': 'C:/Users/Lukas/Desktop/My_Projects/To_Upload/Llama2/llama2_projects/llama2_pdf_chatbot_faiss_windows/data/Hotline_Wiki.pdf', 'page': 0}),
 Document(page_content='receive an automatically generated email with your account username (a 5-digit number) and a link to\nestablish your password (the link is valid for 7 days). That link will work fine outside the X-FAB network,\nbut internally you’ll need to modify the URL for it to work. You would need to add -fo to the

In [6]:
len(chunks)

192

## Download Embedding

In [7]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',
                                       model_kwargs={'device': 'cpu'})

In [8]:
# We want to compute the embeddings on our document, there is many options available for vector stores. 
# In this case, we'll use FAISS
# FAISS will take the text chunks, find corresponding embedding and that will be store in the document search
docsearch = FAISS.from_documents(chunks, embeddings)

## Similarity Search

In [1]:
query = "where i can find fatal violations list ?"
docs = docsearch.similarity_search(query, k=1)
docs


NameError: name 'docsearch' is not defined

In [10]:
docs = docsearch.similarity_search(query)
docs

[Document(page_content="setting parameters in a header file. Using the PVS LVS GUI may be easier for most users to extract a\nlist of the primitive devices used in a given layout. 1. Open layout (read-only is OK) 2. PVS -> Run LVS\n3. Input -> exclude comparison setup 4. Output -> LVS Report -> Click 'SET' button. 5. Apply/Submit to\ninitiate the run. Results: Files/Output Files: Extraction report -> scroll to bottom of report. Or in the", metadata={'source': 'C:/Users/Lukas/Desktop/My_Projects/To_Upload/Llama2/llama2_projects/llama2_pdf_chatbot_faiss_windows/data/Hotline_Wiki.pdf', 'page': 20}),
 Document(page_content="for information only. There have been cases in some of our technologies where a primitive device may\nbe superseded by an improved version, or conceivably some issue under certain operating conditions\nof a device may have been identified -- in such cases, our Customer Projects department would use the\nPDL to check if such devices are used in a customer's design and ma

## Query the Docs to get answer back using Llama 2 

In [11]:
MODEL_PATH = r"D:/llama2_quantized_models/7B_chat/llama-2-7b-chat.ggmlv3.q5_K_M.bin"

In [12]:
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
from langchain.llms import LlamaCpp
from langchain.chains.question_answering import load_qa_chain

# Use CUDA GPU
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path= MODEL_PATH,
    max_tokens=256,
    n_gpu_layers=35,
    n_batch= 512, #256,
    callback_manager=callback_manager,
    n_ctx= 1024,
    verbose=False,
    temperature=0.2,
)

In [13]:
chain = load_qa_chain(llm, chain_type="stuff")
chain.run(input_documents=docs, question=query)

 To generate a PDL (primitive device list) file, you need to run LVS with specific command line options. The most common way is to use the PVS LVS GUI and follow these steps:
1. Open layout (read-only is OK).
2. PVS -> Run LVS.
3. Input -> Exclude comparison setup.
4. Output -> LVS Report -> Click 'SET' button.
5. Apply/Submit to initiate the run.
The resulting PDL file will contain a list of all primitive devices (including parasitic devices) used in your layout, along with their names and types.

" To generate a PDL (primitive device list) file, you need to run LVS with specific command line options. The most common way is to use the PVS LVS GUI and follow these steps:\n1. Open layout (read-only is OK).\n2. PVS -> Run LVS.\n3. Input -> Exclude comparison setup.\n4. Output -> LVS Report -> Click 'SET' button.\n5. Apply/Submit to initiate the run.\nThe resulting PDL file will contain a list of all primitive devices (including parasitic devices) used in your layout, along with their names and types."