Using spacy <br>
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf

pip install langchain pinecone-client PyPDF2
# maybe: conda install -c conda-forge -y ipykernel=6
```

Note: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024

# 0. Parse textbook (retain page numbers)

In [1]:
# parse textbook. 
# pip install PyPDF2
from PyPDF2 import PdfReader
 
reader = PdfReader('../../non-public-datasets/raw_data/ece120-notes/Student_Notes.pdf')
# reader = PdfReader('../raw_data/patel_textbook/Yale Patt - Introduction to Computing Systems_ From Bits & Gates to C & Beyond.pdf')
print("Total pages: ", len(reader.pages))
 
# extracting text from page
textbook = []
for i, page in enumerate(reader.pages):
    text = page.extract_text().replace("\n", " ").replace("c/⌋ir⌋l⌉⌋opyrt2000-2017 Steven S. Lumetta. All rights reserved.", "").replace("c/⌋ir⌋l⌉⌋opyrt2000-2017 Steven S. Lumetta. All rights reserved.", "")
    # skip empty pages
    if text:
        textbook.append(dict(
                            text=text,
                            page_number=i, 
                            textbook_name='ECE-120-student-notes'))

Total pages:  164


## Clean up useless pages

In [2]:
# delete first 5 pages, that's all the cleaning we're doing.
textbook = textbook[6:]
print(textbook[0])

# save cleaned version to file
import json
with open('./student_notes_cleaned.json', 'w') as f:
    json.dump(textbook, f)

{'text': ' 1 ECE120: Introduction to Computer Engineering Notes Set 1.1 The Halting Problem For some of the topics in this course, we plan to cover the material m ore deeply than does the textbook. We will provide notes in this format to supplement the textbook for this purpose. In order to make these notes more useful as a reference, deﬁnitions are highlighted with boldfac e, and italicization emphasizes pitfalls or other important points. Sections marked with an asterisk are provided solely for you r interest, but you probably need to learn this material in later classes. These notes are broken up into four parts, corresponding to the three midterm exams and the ﬁnal exam. Each part is covered by one examination in our class. The last section of each of the four parts gives you a summary of material that you are expected to know for the cor responding exam. Feel free to read it in advance. As discussed in the textbook and in class, a universal computational device (orcomputing machin

In [3]:
len(textbook) # 158 pages remaining in Student Notes. 

158

In [4]:
metadatas = [dict(page_number=page['page_number'], textbook_name=page['textbook_name']) for page in textbook]
textbook_texts = [page['text'] for page in textbook]
assert len(textbook_texts) == len(metadatas), 'must be equal sizes'

## 1. Split the textbook string into context-size contexts.

In [11]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/oasst-sft-1-pythia-12b')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=682, chunk_overlap=100, separators = ". ",)
# texts = text_splitter.split_text(textbook)
texts = text_splitter.create_documents(texts=textbook_texts, metadatas=metadatas)
print("Num chunks", len(texts))

# 250 --> 2723
# 450 chunks --> 452 passages
# 682 chunks --> 286 passages

# must be bigger than 50 chars
clean_meta = [text.metadata     for text in texts if len(text.page_content) > 50]
clean_text = [text.page_content for text in texts if len(text.page_content) > 50]
print("Num chunks after filtering short ones",len(clean_text))


Num chunks 286
Num chunks after filtering short ones 283


## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [12]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
from dotenv import load_dotenv
import os

# load API keys from globally-availabe .env file
load_dotenv(dotenv_path='/mnt/project/chatbotai/huggingface_cache/internal_api_keys.env', override=True)

pinecone.init(api_key=os.environ['PINECONE_API_KEY_NEW_ACCT'], environment="us-east4-gcp")

model_name = "intfloat/e5-large"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

pinecone_index = Pinecone.from_texts(
    texts=clean_text,
    metadatas=clean_meta,
    embedding=embeddings,
    index_name="uiuc-chatbot-deduped" 
)

No sentence-transformers model found with name /home/kastanday/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

# Save CSV

In [13]:
# make a dataframe from clean_text, clean_meta, embeddings
import pandas as pd
df = pd.DataFrame({'text': clean_text, 'metadata': clean_meta})
df.to_parquet('../finalized_datasets/Lumetta_student_notes-chunk_size_682-chunk_overlap_100.parquet', index=False)
df.to_csv('../finalized_datasets/Lumetta_student_notes-chunk_size_682-chunk_overlap_100.csv', index=False)

# ✅ Done with critical steps, the rest is for demonstration only

## 3. Easily run simliarity search

In [10]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d)
    print()

page_content='82  3.1.3 Finite State Machines Aﬁnite state machine (orFSM) is a model for understanding the behavior of a system by describin g the system as occupying one of a ﬁnite set of states, moving betwe en these states in response to external inputs, and producing external outputs. In any given state, a pa rticular input may cause the FSM to move to another state; this combination is called a transition rule . An FSM comprises ﬁve parts: a ﬁnite set of states, a set of possible inputs, a set of possible outputs, a set of transition rules, and methods for calculating outputs. When an FSM is implemented as a digital system, all states must be rep resented as patterns using a ﬁxed number of bits, all inputs must be translated into bits, and all outpu ts must be translated into bits. For a digital FSM, transition rules must be complete ; in other words, given any state of the FSM, and any pattern of input bits, a transition must be deﬁned from that state to another state (transitio

In [17]:
# Easily run similarity search on the Pinecone index
question = "What is LC-3?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata)
    print()

4.3 Instruction Set Architecture* 147 ECE120: Introduction to Computer Engineering Notes Set 4.3 Instruction Set Architecture* This set of notes discusses tradeoﬀs and design elements of instru ction set architectures (ISAs). The material is beyond the scope of our class, and is provided purely for yo ur interest. Those who ﬁnd these topics interesting may also want to read the ECE391 notes, which describe similar material with a focus on the x86 ISA. As you know, the ISA deﬁnes the interface between software and hardware, abstracting the capabilities of a computer’s datapath and standardizing the format of instructio ns to utilize those capabilities. Successful ISAs are rarely discarded, as success implies the existence of large amounts of software built to use the ISA. Rather, they are extended, and their original forms must be supp orted for decades (consider, for example, the IBM 360 and the Intel x86). Employing sound design principles is t hus imperative in an ISA. 4.3.1 Formats 

In [32]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='roberta-large',
  model='roberta-large',
  task='question-answering',
  device='cuda:0' if torch.cuda.is_available() else 'cpu'
)

question="What is a programmable logic array (PLA)?"
for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print(doc.page_content)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForQuestionAnswering: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-large and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to us

{'score': 3.074671985814348e-05, 'start': 1981, 'end': 2020, 'answer': 'wishes to secure a bicycle with a lock,'}
3.6 Sequential Logic Circuits 79 Combinational logic ci rcuit Sto rage elementsOutput Input Figure 3.22 Sequential logic circuit block diagram. In this section, we discuss digital logic structures that can both process infor- mation (i.e., make decisions) andstore information. That is, these structures base their decisions not only on the input values now present, but also (and this is very important) on what has happened before. These structures are usually called sequential logic circuits . They are distinguishable from combinational logic cir- cuits because, unlike combinational logic circuits, they contain storage elements that allow them to keep track of prior history information. Figure 3.22 shows a block diagram of a sequential logic circuit. Note the storage elements. Note also that the output can be dependent on both the inputs now and the values stored in the stor

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    separator='. ', 
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

In [53]:
print(len(texts))

1


In [33]:
# todo: check to see if chunks are less than chunk_size

tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xl')
tokens = tokenizer.encode(texts[22], return_tensors='pt')
print(tokens)
print(tokens.shape)

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 512). Running this sequence through the model will result in indexing errors


tensor([[  101,  1597, 30587,  ...,  2530,     5,     1]])


In [None]:

flan_xl_text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(AutoTokenizer.from_pretrained('google/flan-t5-xl'), chunk_size=300, chunk_overlap=0)
texts = flan_xl_text_splitter.split_text(" ".join(textbook))
# print(texts[0])
# print(len(texts))