
## Note on context windows: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024
* OpenAssistant-pythia-12b max length is 2048

# 0. Parse textbook (retain page numbers)

The OLD WAY used `PyPDF2` and it was really weird. Not great. 

NEW way is to use `fitz` 

If we need even more detail (unlikely for this particular textbook),we can use `pdfplumber` and `pdfminer.six` to parse the PDF.

In [None]:
try: 
  textbook_cleaned = None
  import json
  with open('./patel_textbook_ascii.json', 'r') as f:
    textbook_cleaned = json.load(f)
    
except FileNotFoundError:
  
  import sys, fitz
  pdf_path = "../../non-public-datasets/cleaned_data/patel_textbook/patel_short_and_clean.pdf"

  # out = open('REDACTED_patel_short_and_clean' + ".txt", "wb")  # open text output file
  textbook_cleaned = []
  for i, page in enumerate(fitz.open(pdf_path)):
      text = page.get_text().encode("utf8").decode('ascii', errors='ignore')  # get plain text (is in UTF-8)
      textbook_cleaned.append(dict(text=text,
                          page_number=i, 
                          textbook_name='Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems'))
len(textbook_cleaned)

In [40]:
# print preview
# for i, page in enumerate(textbook_cleaned):
#     print(page['text']) 

In [30]:
metadatas = [dict(page_number=page['page_number'], textbook_name=page['textbook_name']) for page in textbook_cleaned]
textbook_texts = [page['text'] for page in textbook_cleaned]
assert len(metadatas) == len(textbook_texts), 'must be equal sizes'

## 1. Split the textbook string into context-size contexts.

In [31]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('OpenAssistant/oasst-sft-1-pythia-12b')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=682, chunk_overlap=100, separators = ". ",)
# texts = text_splitter.split_text(textbook)
texts = text_splitter.create_documents(texts=textbook_texts, metadatas=metadatas)
print(len(texts))

# 682 -> 307

# must be bigger than 50 chars
clean_meta = [text.metadata     for text in texts if len(text.page_content) > 50]
clean_text = [text.page_content for text in texts if len(text.page_content) > 50]
print("Num chunks after filtering short ones",len(clean_text))

307
Num chunks after filtering short ones 307


## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [32]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
from dotenv import load_dotenv
import os

# load API keys from globally-availabe .env file
load_dotenv(dotenv_path='/mnt/project/chatbotai/huggingface_cache/internal_api_keys.env', override=True)

pinecone.init(api_key=os.environ['PINECONE_API_KEY_NEW_ACCT'], environment="us-east4-gcp")
# pinecone.init(api_key=os.environ['PINECONE_API_KEY'], environment="us-west1-gcp")

model_name = "intfloat/e5-large"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

pinecone_index = Pinecone.from_texts(
    texts=clean_text,
    metadatas=clean_meta,
    embedding=embeddings,
    index_name="uiuc-chatbot-deduped" 
)

No sentence-transformers model found with name /home/kastanday/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.


# Save CSV

In [33]:
# make a dataframe from clean_text, clean_meta, embeddings
import pandas as pd
df = pd.DataFrame({'text': clean_text, 'metadata': clean_meta})
df.to_parquet('../finalized_datasets/v2-Patel_textbook-chunk_size_682-chunk_overlap_100.parquet', index=False)
df.to_csv('../finalized_datasets/v2-Patel_textbook-chunk_size_682-chunk_overlap_100.csv', index=False)

# ✅ Done with critical steps, the rest is for demonstration only

## 3. Easily run simliarity search

In [10]:
# Full code to run Pinecone search during inference.

# embeddings = HuggingFaceEmbeddings(model_name=model_name)
# pinecone.init(api_key="***", environment="us-west1-gcp")
# pincecone_index = pinecone.Index("uiuc-chatbot")
# vectorstore = Pinecone(index=pincecone_index, embedding_function=embeddings, text_key="text")
# question = "What is a finite state machine in electrical engineering?"
# relevant_context_list = pinecone_index.similarity_search(question, k=3)

# for d in relevant_context_list:
#     print(d.page_content)
#     print(d.metadata['page_number'], d.metadata['textbook_name'])

In [34]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
relevant_context_list = pinecone_index.similarity_search(question, k=5)

for d in relevant_context_list:
    print(d.page_content)
    # print(d.metadata['page_number'], d.metadata['textbook_name'])
    print()

The abstract variant shown below outlines desired behavior at a high leve l, and is often ambiguous, incomplete, and even inconsistent. For example, what happens if a user pushes two buttons? What happens if they push unlock while the alarm is sounding? These questions should event ually be considered. However, we can already start to see the intended use of the design: starting f rom a locked car, a user can push “unlock” once to gain entry to the driver’s seat, or push “unlock” twice to op en the car fully for passengers. To lock the car, a user can push the “lock” button at any time. And, if a use r needs help, pressing the “panic” button sets oﬀ an alarm. state action/input next state LOCKED push “unlock” DRIVER DRIVER push “unlock” UNLOCKED (any) push “lock” LOCKED (any) push “panic” ALARM

And so there's a couple of packets, one of the problems and one of solutions. And so I'd highly recommend people go through those. There aren't online tools for finite state machines yet. So lo

In [38]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
question = "What is a LC-3 code?"
relevant_context_list = pinecone_index.similarity_search(question, k=5)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata['page_number'], d.metadata['textbook_name'])
    print()

Table B.2 lists some of the data movement
opcodes in the x86 instruction set.
Control
The LC-3 has ve control opcodes: BR, JSR/JSRR, JMP, RTI, and
TRAP. x86 has all these and more. Table B.3 lists some of the control opcodes in
the x86 instruction set.
B.1.1.3 Two Address vs. Three Address
The LC-3 is a three-address ISA. This description reects the number of operands
explicitly specied by the ADD instruction. An add operation requires two
source operands (the numbers to be added) and one destination operand to store
227.0 Yale-Patt_Sanjay-Patel--Intro_to_Computing_Systems

The ISA species the memory organization, register set, and instruction set,
including the opcodes, data types, and addressing modes of the instructions in
the instruction set.
5.1.1 Memory Organization
The LC-3 memory has an address space of 216 (i.e., 65,536) locations, and an
addressability of 16 bits. Not all 65,536 addresses are actually used for memory
locations, but we will leave that discussion for Chapter 9.

In [13]:
# Easily run similarity search on the Pinecone index
question = "What is a LC-3?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata['page_number'], d.metadata['textbook_name'])

The ISA speciﬁes the memory organization, register set, and instruction set,including the opcodes, data types, and addressing modes of the instructions inthe instruction set...Memory OrganizationThe LC-3 memory has an address space of 216(i.e., 65,536) locations, and anaddressability of 16 bits. Not all 65,536 addresses are actually used for memorylocations, but we will leave that discussion for Chapter 9. Since the normal unitof data that is processed in the LC-3 is 16 bits, we refer to 16 bits as oneword,and we say the LC-3 isword-addressable...RegistersSince it usually takes far more than one clock cycle to obtain data from mem-ory, the LC-3 provides (like almost all computers) additional temporary storagelocations that can be accessed in a single clock cycle.The most common type of temporary storage locations, and the one used inthe LC-3, is a set of registers. Each register in the set is called ageneral purposeregister(GPR). Like memory locations, registers store information that 

In [None]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='roberta-large',
  model='roberta-large',
  task='question-answering',
  device='cuda:0' if torch.cuda.is_available() else 'cpu'
)

question="What is a programmable logic array (PLA)?"
for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print(doc.page_content)