Using spacy <br>
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf

pip install langchain pinecone-client PyPDF2
# maybe: conda install -c conda-forge -y ipykernel=6
```

Note: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024

In [46]:
from typing import List
def load_from_txt_file(filename: str) -> List:
    # Open the file in read mode
    with open(filename, "r", encoding='utf-8') as file:
        # Read the contents of the file
        text = file.read()
    # Split the contents of the file into a list of sentences
    sentences = text.splitlines()
    return sentences

textbook = load_from_txt_file('cleaned_yale_patt_pages.txt')
# load_from_txt_file('cleaned_yale_patt_pages.txt')

textbook = ' '.join(textbook)

In [48]:
len(textbook)

1526415

## 1. Split the textbook string into context-size contexts.

In [70]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xl')
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=250, chunk_overlap=50, separator = ". ",)
texts = text_splitter.split_text(textbook)
print(len(texts))

Token indices sequence length is longer than the specified maximum sequence length for this model (2471 > 512). Running this sequence through the model will result in indexing errors
Created a chunk of size 2471, which is longer than the specified 250
Created a chunk of size 911, which is longer than the specified 250
Created a chunk of size 1311, which is longer than the specified 250
Created a chunk of size 279, which is longer than the specified 250
Created a chunk of size 373, which is longer than the specified 250
Created a chunk of size 261, which is longer than the specified 250
Created a chunk of size 270, which is longer than the specified 250
Created a chunk of size 257, which is longer than the specified 250
Created a chunk of size 278, which is longer than the specified 250
Created a chunk of size 266, which is longer than the specified 250
Created a chunk of size 261, which is longer than the specified 250
Created a chunk of size 698, which is longer than the specified 250

2066


## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [76]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
# See the docs here, (search for pinecone): https://langchain.readthedocs.io/en/latest/reference/modules/vectorstore.html

pinecone.init(api_key="75379e2d-a2ce-47b8-a444-0340c6c34bea", environment="us-west1-gcp")
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
embeddings = HuggingFaceEmbeddings(model_name=model_name, device="cuda:1")
pinecone = Pinecone.from_texts(
    texts,
    embeddings,
    index_name="uiuc-chatbot" 
)

# see the Pinecone index here (requires auth): https://app.pinecone.io/organizations/-NF9ryDGePT7APP6wrFM/projects/us-west1-gcp:32dcf9c/indexes/uiuc-chatbot-2

## 3. Easily run simliarity search

In [91]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine electrical engineering?"
relevant_context_list = pinecone.similarity_search(question, k=5)

for d in relevant_context_list:
    print(d.page_content)
    print()

One ﬁnal example: a very old soft drink machine, when drinks sold for 15 cents, and the machine would only take nickels (5 cents) and dimes (10 cents) and not be able to give change. The state of the machine can be described as the amount of money inserted, and whether the machine is open (so one can remove a bottle). There are only three possible states: A. The lock is open, so a bottle can be (or has been!) removed. B. The lock is not open, but 5 cents has been inserted. C. The lock is not open, but 10 cents has been inserted. 3.6.3 The Finite State Machine and Its State Diagram We have seen that a state is a snapshot of all relevant parts of a system at a particular point in time. At other times, that system can be in other states

A more precise term for this hardware is a central processing unit (CPU), or simply a processor ormicroprocessor . This textbook is primarily about the processor and the programs that are executed by the processor. 1.4.1 A (Very) Little History for a (Lot

In [90]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='deepset/electra-base-squad2',
  model='deepset/electra-base-squad2',
  task='question-answering',
  device='cuda:1' if torch.cuda.is_available() else 'cpu'
)

for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print()




# model_name = "deepset/electra-base-squad2"
# load the reader model into a question-answering pipeline
# reader = pipeline(tokenizer=model_name, model=model_name, task="question-answering", device=device)


{'score': 1.243356139617527e-13, 'start': 6, 'end': 17, 'answer': 'ASCII Codes'}

{'score': 3.520023383885018e-09, 'start': 611, 'end': 615, 'answer': 'LC-3'}

{'score': 0.0005357519839890301, 'start': 63, 'end': 71, 'answer': 'voltages'}

{'score': 8.161070041978746e-09, 'start': 41, 'end': 56, 'answer': 'condition codes'}

{'score': 7.408106788599211e-13, 'start': 45, 'end': 50, 'answer': 'ASCII'}



In [72]:
texts[50:100]

['We have continually been blessed with enthusiastic, knowledgeable, and car- ing TAs who regularly challenge us and provide useful insights into helping us explain things better. Again, the list is too long – more than 100 at this point. Almost all were very good; still, we want to mention a few who were particularly helpful',
 'Stephen Pruett, Siavash Zangeneh, Meiling Tang, Ali Fakhrzadehgan, Sabee Grewal, William Hoenig, Matthew Normyle, Ben Lin, Ameya Chaudhari,Preface xxvii Nikhil Garg, Lauren Guckert, Jack Koenig, Saijel Mokashi, Sruti Nuthalapati, Faruk Guvenilir, Milad Hashemi, Aater Suleman, Chang Joo Lee, Bhargavi Narayanasetty, RJ Harden, Veynu Narasiman, Eiman Ebrahimi, Khubaib, Allison Korczynski, Pratyusha Nidamaluri, Christopher Wiley, Cameron Davison, Lisa de la Fuente, Phillip Duran, Jose Joao, Rustam Miftakhutdinov, Nady Obeid, Linda Bigelow, Jeremy Carillo, Aamir Hasan, Basit Sheik, Erik Johnson, Tsung- Wei Huang, Matthew Potok, Chun-Xun Lin, Jianxiong Gao, Danny Ki

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    separator='. ', 
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

In [53]:
print(len(texts))

1


In [33]:
# todo: check to see if chunks are less than chunk_size

tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xl')
tokens = tokenizer.encode(texts[22], return_tensors='pt')
print(tokens)
print(tokens.shape)

Token indices sequence length is longer than the specified maximum sequence length for this model (3793 > 512). Running this sequence through the model will result in indexing errors


tensor([[  101,  1597, 30587,  ...,  2530,     5,     1]])


In [None]:

flan_xl_text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(AutoTokenizer.from_pretrained('google/flan-t5-xl'), chunk_size=300, chunk_overlap=0)
texts = flan_xl_text_splitter.split_text(" ".join(textbook))
# print(texts[0])
# print(len(texts))