<a href="https://colab.research.google.com/github/aicrashcoursewinter24/jakes_labs/blob/adding_cookiecutter_base_install/notebooks/lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today, we'll play with turning text into numeric vectors (the process of "vectorization"), which first requires splitting up the a long string into something closer to a list of words (or characters).

This latter process is the process of "tokenization": each word/sub-word/character (the atomic unit of text) is called a "token".

Start by installing the "datasets" python package, giving you access to some helpful utilities in downloading public datasets from HuggingFace and elsewhere.

In [None]:
! pip install datasets

There are pre-built tokenizer models, which have both code and mappings between tokens and token *ids* - integers which will be feature columns for the text

We will first use the BERT model (the original "transformer" from the "[Attention is All You Need](https://arxiv.org/abs/1706.03762)" paper), in a form which knows how to differentiate between lower and uppercase characters (some tokenizers lowercase everything first).  It's called "bert-base-uncased".

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Note in the output above, you should see a comment about the "HF_TOKEN" secret.  There is also a link to HuggingFace, where you can generate your HF Token (see note below about the word "token"). To the left part of the Colab screen, there is a "key" icon: you can store your HF_TOKEN as a secret there.  Name it HF_TOKEN and give it "notebook access" via the toggle.


--

 note on "token": there are now two completely unrelated uses of the word "token" in this lab:

* "token": a unit of text like a word or character (or even multi-word phrase) used in text preprocessing
* "HF_TOKEN": a password-like thing for getting access to HuggingFace

In [3]:
encoded = tokenizer.encode("Do not meddle in the affairs of wizards")

In [None]:
# prompt: write python code to print the textual tokens in sequential order from a string, using the above tokenizer

print(tokenizer.convert_ids_to_tokens(encoded))


In [None]:
print(encoded)

In [None]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input.keys())
print(encoded_input['input_ids'])

In [None]:
! pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

At this point, go ahead and explore with the vector representation (the "embedding") of any sentence (or string of text, more generally), looking at the tokenized form, the list of token_id integers, or compute cosine similarities between the embeddings:

In [None]:
words = ["quick", "fast", "red", "blue", "ferari"]
single_word_embeddings = model.encode(words)

for word, embed in zip(words, single_word_embeddings):
  print("word: ", word)
  print("embed: ", embed[0:10])
  print("")



In [24]:
# prompt: python code to compute the matrix of cosines between all of the pairs of words in the list above.

from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity between all pairs of words
word_embeddings = model.encode(words)
word_similarities = cosine_similarity(word_embeddings)
# Print the word similarities
print(word_similarities)


[[1.0000001  0.6515874  0.3388258  0.33914232 0.28320336]
 [0.6515874  1.         0.32009655 0.30601805 0.26345903]
 [0.3388258  0.32009655 1.         0.72944736 0.26313198]
 [0.33914232 0.30601805 0.72944736 1.         0.22827557]
 [0.28320336 0.26345903 0.26313198 0.22827557 0.99999976]]


In [None]:
# prompt: python code for computing cosine similarity between sentence vector embeddings from the above tokenizer and model

from scipy.spatial.distance import cosine
for sentence in sentences:
    print("Sentence:", sentence)
print("")
print("Cosine similarity between the first two sentences:", cosine(embeddings[0], embeddings[1]))
print("Cosine similarity between the second and third sentences:", cosine(embeddings[1], embeddings[2]))
print("Cosine similarity between the first and third sentences:", cosine(embeddings[0], embeddings[2]))
