# THIS workes just run all cells (when first pushed)

Using spacy <br>
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf

pip install langchain pinecone-client PyPDF2
# maybe: conda install -c conda-forge -y ipykernel=6
# maybe: conda install -c anaconda -y notebook
```

Note: 
* Flan T5 XL max length is 512
* Flan T5 XXL max length is 1024

# 0. Load json

In [1]:
# load josn from file: input_data/audio_transcripts/mit_lectures/mit_lecture_transcripts.json
import json
with open('../input_data/audio_transcripts/vlad_lectures/lecture_transcripts_large_model.json') as f:
    lectures = json.load(f)
print(len(lectures))
lectures

43


[{'ECE120-2016-10-28-LEC-27-slides.mp4': " stuff and a little bit of philosophical prattle to start off. One clarification too that's worthwhile from last lecture. Then we'll dive into von Neumann model and then look at LC3 as a von Neumann machine. And then I think we will manage to get through most, if not all of this instruction format stuff. After that on Monday, we'll start talking about instruction processing. I suppose there's some small chance we may get to that today. So before I do the clarification, do this. So if you're feeling now after having gotten your midterm back, you know, my TA, if I hadn't had this TA, I'd have gotten half the score. Then nominate them. Even if you don't feel that way, nominate them. Because you should be proud to be in 120. Otherwise the 210 TAs will win or the 2310 or some people like that. They just don't deserve it. Our TAs should win. I mean, they're good, but they're not as good as ours. So seriously, I mean, unless you feel strongly and you 

In [2]:
metadatas = [dict(textbook_name = list(lecture_dict.keys())[0], page_number = '') for lecture_dict in lectures]
# metadatas = [dict(page_number='', textbook_name=page['textbook_name']) for page in textbook]
textbook_texts = [list(lecture_dict.values())[0] for lecture_dict in lectures]
assert len(textbook_texts) == len(metadatas), 'must be equal sizes'

In [3]:
textbook_texts = [textbook_text.lstrip(' ').rstrip(' ') for textbook_text in textbook_texts]
print(len(textbook_texts))
textbook_texts

43


["stuff and a little bit of philosophical prattle to start off. One clarification too that's worthwhile from last lecture. Then we'll dive into von Neumann model and then look at LC3 as a von Neumann machine. And then I think we will manage to get through most, if not all of this instruction format stuff. After that on Monday, we'll start talking about instruction processing. I suppose there's some small chance we may get to that today. So before I do the clarification, do this. So if you're feeling now after having gotten your midterm back, you know, my TA, if I hadn't had this TA, I'd have gotten half the score. Then nominate them. Even if you don't feel that way, nominate them. Because you should be proud to be in 120. Otherwise the 210 TAs will win or the 2310 or some people like that. They just don't deserve it. Our TAs should win. I mean, they're good, but they're not as good as ours. So seriously, I mean, unless you feel strongly and you don't want to do it, at least think about

## 1. Split the textbook string into context-size contexts.

In [4]:
from langchain import text_splitter
from transformers import AutoTokenizer
from langchain.text_splitter import CharacterTextSplitter, NLTKTextSplitter, SpacyTextSplitter, RecursiveCharacterTextSplitter

# good examples here: https://langchain.readthedocs.io/en/latest/modules/utils/combine_docs_examples/textsplitter.html
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-xxl')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=682, chunk_overlap=100, separators = ". ",)
# texts = text_splitter.split_text(textbook)
texts = text_splitter.create_documents(texts=textbook_texts, metadatas=metadatas)
print(len(texts))

# 250 --> 2723
# 350 --> 1692
# 450 chunks -> 1397
# 682 chunks -> 893

clean_meta = [text.metadata for text in texts]
clean_text = [text.page_content for text in texts]
# clean_meta

893


In [5]:
print(len(texts))
print(len(clean_text))
print(len(clean_meta))

893
893
893


## 2. Embed each context, and save it to a vector database, this one is hosted by Pinecone.

In [6]:
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='/mnt/project/chatbotai/huggingface_cache/internal_api_keys.env', override=True)
os.environ['PINECONE_API_KEY_NEW_ACCT']

'7466ab92-037e-402d-ab9d-15dba2fa862f'

In [10]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings #OpenAIEmbeddings, 
import pinecone
from dotenv import load_dotenv
import os

# load API keys from globally-availabe .env file
load_dotenv(dotenv_path='/mnt/project/chatbotai/huggingface_cache/internal_api_keys.env', override=True)

# See the docs here, (search for pinecone): https://langchain.readthedocs.io/en/latest/reference/modules/vectorstore.html
pinecone.init(api_key=os.environ['PINECONE_API_KEY_NEW_ACCT'], environment="us-east4-gcp")
# pinecone.init(api_key=os.environ['PINECONE_API_KEY'], environment="us-west1-gcp")

model_name = "intfloat/e5-large"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

pinecone_index = Pinecone.from_texts(
    texts=clean_text,
    metadatas=clean_meta,
    embedding=embeddings,
    index_name="uiuc-chatbot-deduped" 
)

No sentence-transformers model found with name /home/kastanday/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.


# Save as CSV and Parquet files

In [20]:
# make a dataframe from clean_text, clean_meta, embeddings
import pandas as pd
df = pd.DataFrame({'text': clean_text, 'metadata': clean_meta})
df.to_parquet('./Vlad_lectures-chunk_size_682-chunk_overlap_100.parquet', index=False)
df.to_csv('./Vlad_lectures-chunk_size_682-chunk_overlap_100.csv', index=False)

# ✅ Done with critical steps, the rest is for demonstration only

## 3. Easily run simliarity search

In [None]:
# Full code to run Pinecone search during inference.

# embeddings = HuggingFaceEmbeddings(model_name=model_name)
# pinecone.init(api_key="***", environment="us-west1-gcp")
# pincecone_index = pinecone.Index("uiuc-chatbot")
# vectorstore = Pinecone(index=pincecone_index, embedding_function=embeddings, text_key="text")
# question = "What is a finite state machine in electrical engineering?"
# relevant_context_list = pinecone_index.similarity_search(question, k=3)

# for d in relevant_context_list:
#     print(d.page_content)
#     print(d.metadata['page_number'], d.metadata['textbook_name'])

In [14]:
# Easily run similarity search on the Pinecone index
question = "What is a finite state machine in electrical engineering?"
relevant_context_list = pinecone_index.similarity_search(question, k=3)

for d in relevant_context_list:
    print(d.page_content)
    print(d.metadata['page_number'], d.metadata['textbook_name'])

So if I just add a flip-flop, now there's a Moore machine. I mean, this is the state bit, so I can affect the output with a state bit. Wait a minute, that's sort of delaying things. In fact, if you think back to our serialized designs, we always delay things. The output of these machines is never reflecting all of the inputs until the next clock cycle. So let's take a look at a timing diagram, but I claim it's no different from the things you've already seen, which is a factor with using Moore machines. So let's take a look at it. So what this is going to do, by adding this flip-flop here, we're actually logically splitting this one state into a 1, 1 state and a 0, 1 state. Now, the 1 and the 0 are different, because they call this one S1. So that's the high bit. So we've got three states now. And now we can put our inputs or outputs into our state. So we've got the 0, 0 state, which is 0 here and 0 here. In that case, remember, output is just S1. So S1 is 0, so output is 0. Here's a 1

In [15]:
s = "What is a finite state machine in electrical engineering?"
# print number of words in string s
print(len(relevant_context_list[1].page_content.split()))

247


In [None]:
import torch
from transformers import pipeline

reader = pipeline(
  tokenizer='roberta-large',
  model='roberta-large',
  task='question-answering',
  device='cuda:0' if torch.cuda.is_available() else 'cpu'
)

question="What is a programmable logic array (PLA)?"
for doc in relevant_context_list:
  answer = reader(question=question, context=doc.page_content)
  print(answer)
  print(doc.page_content)

# 🚨 DON'T RUN BELOW HERE 🚨

# delete all vectors with '.mp4' in metadata  

In [None]:
list_of_filenames = [metadata['textbook_name'] for metadata in clean_meta]
list_of_filenames = set(list_of_filenames)
list_of_filenames = list(list_of_filenames)
list_of_filenames

In [None]:

pinecone.init(api_key="87823627-c1f4-48fe-9c36-3d19d3dd29bb", environment="us-west1-gcp")
pi = pinecone.Index('uiuc-chatbot')

# query for metadata matching exact filename
res = pi.query(
    vector=[0.0] * 1024,
    filter={
        "textbook_name": {"$in": list_of_filenames},
    },
    top_k=50,
    include_metadata=True
)
res['matches']

In [None]:
# get ids then delte
len(res['matches'])
ids = [match['id'] for match in res['matches']]

pi.delete(ids=ids)
