Using spacy <br>
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf

pip install langchain pinecone-client PyPDF2
# maybe: conda install -c conda-forge -y ipykernel=6
# maybe: conda install -c anaconda -y notebook
```

Note: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024

# 0. Parse textbook (retain page numbers)

In [1]:
# parse textbook. 
# pip install PyPDF2
from PyPDF2 import PdfReader
 
# reader = PdfReader('../raw_data/notes/Student_Notes.pdf')
reader = PdfReader('../raw_data/patel_textbook/Yale Patt - Introduction to Computing Systems_ From Bits & Gates to C & Beyond.pdf')
print("Total pages: ", len(reader.pages))
 
# extracting text from page
textbook = []
for i, page in enumerate(reader.pages):
    text = page.extract_text().replace("\n", " ")
    # skip empty pages
    if text:
        textbook.append(dict(
                            text=text,
                            page_number=i, 
                            textbook_name='Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems'))

Total pages:  801


## Clean up useless pages

In [2]:
# manual delete last pages
textbook = textbook[:765] # remove everything past page 780. Delete all appendix stuff.
textbook[-1] # should show page_number 779

{'text': 'D.9 Some Standard Library Functions 751 to byptr. The value passed to free must be a pointer to a previously allocated region of memory, otherwise errors could occur. D.9.4.3 randandsrand The C standard utility functions contain a function to generate a sequence of random numbers. The function is called rand . It does not generate a truly random sequence, however. Instead, it generates the same sequence of varying values based on an initial seed value. When the seed is changed, a diﬀerent sequence is generated. For example, when seeded with the value 10, the generator will always generate the same sequence of numbers. However, this sequence will be diﬀerent than the sequence generated by another seed value. The function rand has the following declaration: int rand(void) It returns a pseudo-random integer in the range 0 to RAND_MAX , which is at least 32,767. To seed the pseudo-random number generator, use the function srand . This function has the following declaration: void 

In [3]:
# manual delete first 29 pages
textbook = textbook[25:]
textbook[0] # should show page_number 29

{'text': '1CHAPTER Welcome Aboard 1.1 What We Will Try to Do Welcome to From Bits and Gates to C and Beyond . Our intent is to introduce you over the next xxx pages to the world of computing. As we do so, we have one objective above all others: to show you very clearly that there is no magic to computing. The computer is a deterministic system—every time we hit it over the head in the same way and in the same place (provided, of course, it was in the same starting condition), we get the same response. The computer is not an electronic genius; on the contrary, if anything, it is an electronic idiot, doing exactly what we tell it to do. It has no mind of its own. What appears to be a very complex organism is really just a very large, sys- tematically interconnected collection of very simple parts. Our job throughout this book is to introduce you to those very simple parts and, step-by-step, build the interconnected structure that you know by the name computer . Like a house, we will star

In [4]:
# save all_text json to file
import json
with open('./patel_textbook_cleaned.json', 'w') as f:
    json.dump(textbook, f)

print(len(textbook)) # 740 pages remaining in patel textbook

740


In [5]:
metadatas = [dict(page_number=page['page_number'], textbook_name=page['textbook_name']) for page in textbook]
textbook_texts = [page['text'] for page in textbook]
assert len(textbook_texts) == len(metadatas), 'must be equal sizes'

## 1. Split the textbook string into context-size contexts.

In [6]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xl')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=450, chunk_overlap=50, separators = ". ",)
# texts = text_splitter.split_text(textbook)
texts = text_splitter.create_documents(texts=textbook_texts, metadatas=metadatas)
print(len(texts))

# 250 --> 2723
# 350 --> 1692
# 450 chunks -> 1397

1397


In [8]:
texts[200:210]

[Document(page_content='In Chapter 5, we will begin the complete deﬁnition of the LC-3 computer. We will see that the memory address space of t h eL C - 3i s216, and the addressability is 16 bits. Recall from Chapter 3 that we access memory by providing the address from which we wish to read, or to which we wish to write. To read the contents of a', lookup_str='', metadata={'page_number': 150, 'textbook_name': 'Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems'}, lookup_index=0),
 Document(page_content='4.1Basic Components 123 00000110 00000100000 001 010 011 100 101 110 111 Figure 4.2 Location 6 contains the value 4; location 4 contains the value 6. memory location, we ﬁrst place the address of that location in the memory’s address register ( MAR ) and then interrogate the computer’s memory. The information stored in the location having that address will be placed in the memory’s data register ( MDR ). To write (or store) a value in a memory location, we ﬁrst write the address of the

## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [9]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
# from sentence_transformers import SentenceTransformer
# See the docs here, (search for pinecone): https://langchain.readthedocs.io/en/latest/reference/modules/vectorstore.html

pinecone.init(api_key="87823627-c1f4-48fe-9c36-3d19d3dd29bb", environment="us-west1-gcp")

model_name = "intfloat/e5-large"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

pinecone_index = Pinecone.from_texts(
    texts=textbook_texts,
    metadatas=metadatas,
    embedding=embeddings,
    index_name="uiuc-chatbot" 
)

# see the Pinecone index here (requires auth): https://app.pinecone.io/organizations/-NF9ryDGePT7APP6wrFM/projects/us-west1-gcp:32dcf9c/indexes/uiuc-chatbot-2

No sentence-transformers model found with name /home/kastanday/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# ✅ Done with critical steps, the rest is for demonstration only

## 3. Easily run simliarity search

In [13]:
# Full code to run Pinecone search during inference.

# embeddings = HuggingFaceEmbeddings(model_name=model_name)
# pinecone.init(api_key="***", environment="us-west1-gcp")
# pincecone_index = pinecone.Index("uiuc-chatbot")
# vectorstore = Pinecone(index=pincecone_index, embedding_function=embeddings, text_key="text")
# question = "What is a finite state machine in electrical engineering?"
# relevant_context_list = pinecone_index.similarity_search(question, k=3)

# for d in relevant_context_list:
#     print(d.page_content)
#     print(d.metadata['page_number'], d.metadata['textbook_name'])

In [10]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata['page_number'], d.metadata['textbook_name'])

page_content='82  3.1.3 Finite State Machines Aﬁnite state machine (orFSM) is a model for understanding the behavior of a system by describin g the system as occupying one of a ﬁnite set of states, moving betwe en these states in response to external inputs, and producing external outputs. In any given state, a pa rticular input may cause the FSM to move to another state; this combination is called a transition rule . An FSM comprises ﬁve parts: a ﬁnite set of states, a set of possible inputs, a set of possible outputs, a set of transition rules, and methods for calculating outputs. When an FSM is implemented as a digital system, all states must be rep resented as patterns using a ﬁxed number of bits, all inputs must be translated into bits, and all outpu ts must be translated into bits. For a digital FSM, transition rules must be complete ; in other words, given any state of the FSM, and any pattern of input bits, a transition must be deﬁned from that state to another state (transitio

In [32]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='roberta-large',
  model='roberta-large',
  task='question-answering',
  device='cuda:0' if torch.cuda.is_available() else 'cpu'
)

question="What is a programmable logic array (PLA)?"
for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print(doc.page_content)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForQuestionAnswering: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-large and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to us

{'score': 3.074671985814348e-05, 'start': 1981, 'end': 2020, 'answer': 'wishes to secure a bicycle with a lock,'}
3.6 Sequential Logic Circuits 79 Combinational logic ci rcuit Sto rage elementsOutput Input Figure 3.22 Sequential logic circuit block diagram. In this section, we discuss digital logic structures that can both process infor- mation (i.e., make decisions) andstore information. That is, these structures base their decisions not only on the input values now present, but also (and this is very important) on what has happened before. These structures are usually called sequential logic circuits . They are distinguishable from combinational logic cir- cuits because, unlike combinational logic circuits, they contain storage elements that allow them to keep track of prior history information. Figure 3.22 shows a block diagram of a sequential logic circuit. Note the storage elements. Note also that the output can be dependent on both the inputs now and the values stored in the stor