Using spacy <br>
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf

pip install langchain pinecone-client PyPDF2
# maybe: conda install -c conda-forge -y ipykernel=6
# maybe: conda install -c anaconda -y notebook
```

Note: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024

# 0. Parse textbook (retain page numbers)

In [1]:
# pip install PyPDF2
from PyPDF2 import PdfReader
 
# git clone `non-public-datasets` repo
reader = PdfReader('../../non-public-datasets/cleaned_data/patel_textbook/patel_short_and_clean.pdf')
print("Total pages: ", len(reader.pages))
 
# extracting text from page
textbook = []
for i, page in enumerate(reader.pages):
    text = page.extract_text() #.replace("\n", " ")
    # skip empty pages
    if text:
        textbook.append(dict(
                            text=text,
                            page_number=i, 
                            textbook_name='Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems'))

Total pages:  253


In [2]:
textbook[9]['text']

'come Aboard/one.tnum./one.tnumWhat We Will Try to DoWelcome toFrom Bits and Gates to C and Beyond.O u ri n t e n ti st oi n t r o d u c eyou over the next xxx pages to the world of computing. As we do so, we haveone objective above all others: to show you very clearly that there is no magic tocomputing. The computer is a deterministic system—every time we hit it over thehead in the same way and in the same place (provided, of course, it was in the samestarting condition), we get the same response. The computer is not an electronicgenius; on the contrary, if anything, it is an electronic idiot, doing exactly whatwe tell it to do. It has no mind of its own.What appears to be a very complex organism is really just a very large, sys-tematically interconnected collection of very simple parts. Our job throughoutthis book is to introduce you to those very simple parts and, step-by-step, build theinterconnected structure that you know by the namecomputer.L i k eah o u s e ,w ewill start at th

In [3]:
textbook_cleaned = []
for i, page in enumerate(textbook):
    # print(page['page_number'], page['textbook_name'])
    text = page['text']
    for _ in range(3):  
      text = text.replace("/zero.tnum", "") \
                        .replace("/zero.tnum.", "") \
                        .replace("/one.tnum", "") \
                        .replace("/one.tnum.", "") \
                        .replace("/two.tnum", "") \
                        .replace("/two.tnum.", "") \
                        .replace("/three.tnum", "") \
                        .replace("/three.tnum.", "") \
                        .replace("/four.tnum", "") \
                        .replace("/four.tnum.", "") \
                        .replace("/five.tnum", "") \
                        .replace("/five.tnum.", "") \
                        .replace("/six.tnum", "") \
                        .replace("/six.tnum.", "") \
                        .replace("/seven.tnum", "") \
                        .replace("/seven.tnum.", "") \
                        .replace("/eight.tnum", "") \
                        .replace("/eight.tnum.", "") \
                        .replace("/nine.tnum", "") \
                        .replace("/nine.tnum.", "") \
                        .replace("FiuslochthisknboagthrewTmbo", "") 
      # print()
    if text:
      print(i, text[:100])
      textbook_cleaned.append(dict(
                            text=text,
                            page_number=i, 
                            textbook_name='Yale-Patt_&_Sanjay-Patel--Intro_to_Computing_Systems'))

0 Why the Book HappenedThis textbook evolved from EECS 100, the ﬁrst computing course for computerscie
1 doing it, had its shortcomings. We decided that the reason students were not get-ting it was that th
2 The Addition of C++We’ve had an ongoing debate about how to extend our approach and textbookto C++. 
3 and see how these structures are actually organized in memory. We moved our dis-cussion of subroutin
4 The LC-3 is a 16-bit architecture that includes physical I/O via keyboardand monitor, TRAPs to the o
5 under program control. Both are supported by our LC-3 simulator so the studentcan write interrupt dr
6 Chapter 17 teaches recursion, using the student’s newly gained knowledge offunctions, stack frames, 
7 education, but we feel they are better suited to a later course in computerarchitecture and design. 
8 In both cases, students appreciated starting with the LC-3, and their subsequentintroduction to ARM 
9 come Aboard.What We Will Try to DoWelcome toFrom Bits and Gates to C an

In [4]:
for i, page in enumerate(textbook_cleaned):
# for i, page in enumerate(textbook):
    # print(page['page_number'], page['textbook_name'])
    print(page['text'])
    print()

Why the Book HappenedThis textbook evolved from EECS 100, the ﬁrst computing course for computerscience, computer engineering, and electrical engineering majors at the Univer-sity of Michigan, Ann Arbor, that Kevin Compton and the ﬁrst author introducedfor the ﬁrst time in the fall term, 1995.EECS 100 happened at Michigan because Computer Science and Engi-neering faculty had been dissatisﬁed for many years with the lack of studentcomprehension of some very basic concepts. For example, students had a lotof trouble with pointer variables. Recursion seemed to be “magic,” beyondunderstanding.We decided in 1993 that the conventional wisdom of starting with a high-level programming language, which was the way we (and most universities) were

doing it, had its shortcomings. We decided that the reason students were not get-ting it was that they were forced to memorize technical details when they did notunderstand the basic underpinnings.Our result was the bottom-up approach taken in this book,

## Save textbook to json

In [5]:
# save all_text json to file

import json
with open('../../non-public-datasets/cleaned_data/patel_textbook/patel_only_ECE120_content_keep_diagrams.json', 'w') as f:
    json.dump(textbook_cleaned, f)

print(len(textbook_cleaned)) # 740 pages remaining in patel textbook

253


In [6]:
metadatas = [dict(page_number=page['page_number'], textbook_name=page['textbook_name']) for page in textbook]
textbook_texts = [page['text'] for page in textbook_cleaned]
assert len(textbook_texts) == len(metadatas), 'must be equal sizes'

## 1. Split the textbook string into context-size contexts.

In [7]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xxl')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=350, chunk_overlap=80, separators = ". ")
# texts = text_splitter.split_text(textbook)
texts = text_splitter.create_documents(texts=textbook_texts, metadatas=metadatas)
print(len(texts))

# full textbook 
# 250 --> 2723
# 350 --> 1692
# 450 chunks -> 1397

647


In [8]:
texts[200:250]

[Document(page_content='It does not matter which state since wenever use that information untilafterwe have set it to 1 or 0...The Gated D LatchTo be useful, it is necessary to control when a latch is set and when it is cleared.As i m p l ew a yt oa c c o m p l i s ht h i si sw i t ht h eg a t e dl a t c h .Figure 3.19 shows a logic circuit that implements a gatedDlatch. It consistsof the R-S latch of Figure 3.18, plus two additional NAND gates that allow the', lookup_str='', metadata={'page_number': 68, 'textbook_name': 'Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems'}, lookup_index=0),
 Document(page_content='RFigure.Ag a t e dDl a t c h .latch to be set to the value ofD,b u tonlywhen WE is asserted (i.e., when WEequals 1). WE stands forwrite enable. When WE is not asserted (i.e., when WEequals 0), the outputsSandRare both equal to 1. SinceSandRare inputs to theR-S latch, if they are kept at 1, the value stored in the latch remains unchanged,as we explained in Section 3.4.1. When

## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [9]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
# from sentence_transformers import SentenceTransformer
# See the docs here, (search for pinecone): https://langchain.readthedocs.io/en/latest/reference/modules/vectorstore.html


pinecone.init(api_key='87823627-c1f4-48fe-9c36-3d19d3dd29bb', environment="us-west1-gcp")

model_name = "intfloat/e5-large"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

pinecone_index = Pinecone.from_texts(
    texts=textbook_texts,
    metadatas=metadatas,
    embedding=embeddings,
    index_name="uiuc-chatbot" 
)

# see the Pinecone index here (requires auth): https://app.pinecone.io/organizations/-NF9ryDGePT7APP6wrFM/projects/us-west1-gcp:32dcf9c/indexes/uiuc-chatbot-2

No sentence-transformers model found with name /home/kastanday/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

# ✅ Done with critical steps, the rest is for demonstration only

## 3. Easily run simliarity search

In [10]:
# Full code to run Pinecone search during inference.

# embeddings = HuggingFaceEmbeddings(model_name=model_name)
# pinecone.init(api_key="***", environment="us-west1-gcp")
# pincecone_index = pinecone.Index("uiuc-chatbot")
# vectorstore = Pinecone(index=pincecone_index, embedding_function=embeddings, text_key="text")
# question = "What is a finite state machine in electrical engineering?"
# relevant_context_list = pinecone_index.similarity_search(question, k=3)

# for d in relevant_context_list:
#     print(d.page_content)
#     print(d.metadata['page_number'], d.metadata['textbook_name'])

In [11]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
relevant_context_list = pinecone_index.similarity_search(question, k=5)

for d in relevant_context_list:
    print(d.page_content)
    # print(d.metadata['page_number'], d.metadata['textbook_name'])
    print()

82  3.1.3 Finite State Machines Aﬁnite state machine (orFSM) is a model for understanding the behavior of a system by describin g the system as occupying one of a ﬁnite set of states, moving betwe en these states in response to external inputs, and producing external outputs. In any given state, a pa rticular input may cause the FSM to move to another state; this combination is called a transition rule . An FSM comprises ﬁve parts: a ﬁnite set of states, a set of possible inputs, a set of possible outputs, a set of transition rules, and methods for calculating outputs. When an FSM is implemented as a digital system, all states must be rep resented as patterns using a ﬁxed number of bits, all inputs must be translated into bits, and all outpu ts must be translated into bits. For a digital FSM, transition rules must be complete ; in other words, given any state of the FSM, and any pattern of input bits, a transition must be deﬁned from that state to another state (transitions from a stat

In [13]:
# Easily run similarity search on the Pinecone index
question = "What is a LC-3?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata['page_number'], d.metadata['textbook_name'])

The ISA speciﬁes the memory organization, register set, and instruction set,including the opcodes, data types, and addressing modes of the instructions inthe instruction set...Memory OrganizationThe LC-3 memory has an address space of 216(i.e., 65,536) locations, and anaddressability of 16 bits. Not all 65,536 addresses are actually used for memorylocations, but we will leave that discussion for Chapter 9. Since the normal unitof data that is processed in the LC-3 is 16 bits, we refer to 16 bits as oneword,and we say the LC-3 isword-addressable...RegistersSince it usually takes far more than one clock cycle to obtain data from mem-ory, the LC-3 provides (like almost all computers) additional temporary storagelocations that can be accessed in a single clock cycle.The most common type of temporary storage locations, and the one used inthe LC-3, is a set of registers. Each register in the set is called ageneral purposeregister(GPR). Like memory locations, registers store information that 

In [None]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='roberta-large',
  model='roberta-large',
  task='question-answering',
  device='cuda:0' if torch.cuda.is_available() else 'cpu'
)

question="What is a programmable logic array (PLA)?"
for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print(doc.page_content)