<a href="https://colab.research.google.com/github/aicrashcoursewinter24/ai_crashcourselabsLukeA/blob/LAB-2/Official_lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today, we'll play with turning text into numeric vectors (the process of "vectorization"), which first requires splitting up the a long string into something closer to a list of words (or characters).

This latter process is the process of "tokenization": each word/sub-word/character (the atomic unit of text) is called a "token".

Start by installing the "datasets" python package, giving you access to some helpful utilities in downloading public datasets from HuggingFace and elsewhere.

In [None]:
! pip install datasets



There are pre-built tokenizer models, which have both code and mappings between tokens and token *ids* - integers which will be feature columns for the text

We will first use the BERT model (the original "transformer" from the "[Attention is All You Need](https://arxiv.org/abs/1706.03762)" paper), in a form which knows how to differentiate between lower and uppercase characters (some tokenizers lowercase everything first).  It's called "bert-base-uncased".

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Note in the output above, you should see a comment about the "HF_TOKEN" secret.  There is also a link to HuggingFace, where you can generate your HF Token (see note below about the word "token"). To the left part of the Colab screen, there is a "key" icon: you can store your HF_TOKEN as a secret there.  Name it HF_TOKEN and give it "notebook access" via the toggle.


--

 note on "token": there are now two completely unrelated uses of the word "token" in this lab:

* "token": a unit of text like a word or character (or even multi-word phrase) used in text preprocessing
* "HF_TOKEN": a password-like thing for getting access to HuggingFace

In [None]:
encoded = tokenizer.encode("Do not meddle in the affairs of wizards")

In [None]:
# prompt: write python code to print the textual tokens in sequential order from a string, using the above tokenizer

print(tokenizer.convert_ids_to_tokens(encoded))


['[CLS]', 'Do', 'not', 'me', '##ddle', 'in', 'the', 'affairs', 'of', 'wizard', '##s', '[SEP]']


In [None]:
print(encoded)

[101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 102]


In [None]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input.keys())
print(encoded_input['input_ids'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
[101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 117, 1111, 1152, 1132, 11515, 1105, 3613, 1106, 4470, 119, 102]


In [None]:
! pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transf

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173552e-02 -4.28515449e-02 -1.56286024e-02  1.40537303e-02
  3.95537727e-02  1.21796280e-01  2.94334106e-02 -3.17524187e-02
  3.54959629e-02 -7.93139935e-02  1.75878741e-02 -4.04369719e-02
  4.97259349e-02  2.54912246e-02 -7.18700588e-02  8.14968869e-02
  1.47069141e-03  4.79626991e-02 -4.50336412e-02 -9.92174670e-02
 -2.81769745e-02  6.45046085e-02  4.44670543e-02 -4.76217009e-02
 -3.52952331e-02  4.38671783e-02 -5.28566055e-02  4.33063833e-04
  1.01921506e-01  1.64072234e-02  3.26996595e-02 -3.45986746e-02
  1.21339476e-02  7.94870779e-02  4.58345609e-03  1.57777797e-02
 -9.68206208e-03  2.87625659e-02 -5.05805984e-02 -1.55793717e-02
 -2.87906546e-02 -9.62280575e-03  3.15556750e-02  2.27349028e-02
  8.71449187e-02 -3.85027491e-02 -8.84718448e-02 -8.75498448e-03
 -2.12343335e-02  2.08923239e-02 -9.02077407e-02 -5.25732562e-02
 -1.05638904e-02  2.88310610e-02 -1.61455162e-02  6.17837207e-03
 -1.23234

At this point, go ahead and explore with the vector representation (the "embedding") of any sentence (or string of text, more generally), looking at the tokenized form, the list of token_id integers, or compute cosine similarities between the embeddings:

In [None]:
words = ["quick", "fast", "red", "blue", "ferari"]
single_word_embeddings = model.encode(words)

for word, embed in zip(words, single_word_embeddings):
  print("word: ", word)
  print("embed: ", embed[0:10])
  print("")



word:  quick
embed:  [-0.01363505  0.02511056 -0.03966426 -0.00121545  0.03869091 -0.04272134
  0.03643535  0.00567384  0.00246003 -0.04250647]

word:  fast
embed:  [-0.01659232  0.06137965 -0.01092987  0.02365591 -0.0138125  -0.01203378
 -0.00972914 -0.05885596 -0.01261965 -0.0577055 ]

word:  red
embed:  [-0.02509159  0.00884627 -0.10083688  0.01320896  0.01490394  0.02841406
  0.15962426  0.01331032  0.03514304 -0.04301136]

word:  blue
embed:  [-0.06580827  0.0203764  -0.05504949 -0.00301157  0.01343209  0.02449333
  0.20061415 -0.00983796  0.04382765 -0.01033155]

word:  ferari
embed:  [ 0.00451991  0.08644823 -0.12962112  0.05201175  0.01814397 -0.07465909
  0.07894592  0.05689763 -0.00582556 -0.09301732]



In [None]:
# prompt: python code to compute the matrix of cosines between all of the pairs of words in the list above.

from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity between all pairs of words
word_embeddings = model.encode(words)
word_similarities = cosine_similarity(word_embeddings)
# Print the word similarities
print(word_similarities)


[[1.0000001  0.6515874  0.3388258  0.33914232 0.28320336]
 [0.6515874  1.         0.32009655 0.30601805 0.26345903]
 [0.3388258  0.32009655 1.         0.72944736 0.26313198]
 [0.33914232 0.30601805 0.72944736 1.         0.22827557]
 [0.28320336 0.26345903 0.26313198 0.22827557 0.99999976]]


In [None]:
# prompt: python code for computing cosine similarity between sentence vector embeddings from the above tokenizer and model

from scipy.spatial.distance import cosine
for sentence in sentences:
    print("Sentence:", sentence)
print("")
print("Cosine similarity between the first two sentences:", cosine(embeddings[0], embeddings[1]))
print("Cosine similarity between the second and third sentences:", cosine(embeddings[1], embeddings[2]))
print("Cosine similarity between the first and third sentences:", cosine(embeddings[0], embeddings[2]))


Sentence: This framework generates embeddings for each input sentence
Sentence: Sentences are passed as a list of string.
Sentence: The quick brown fox jumps over the lazy dog.

Cosine similarity between the first two sentences: 0.4619206190109253
Cosine similarity between the second and third sentences: 0.8964101523160934
Cosine similarity between the first and third sentences: 0.8819436207413673


In [15]:
pip install transformers



In [18]:
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input sentence with the word "flies" used as a verb and as a plural noun
sentence = "The bird flies in the sky, and the flies in the garden are annoying."

# Tokenize the input sentence
tokens = tokenizer(sentence, return_tensors='pt')

# Get the output from BERT model
outputs = model(**tokens)

# Extract the contextual embeddings of each token
embeddings = outputs.last_hidden_state

# Find the index of the word "flies" in the tokenized sequence
flies_indices = [i for i, token in enumerate(tokens["input_ids"][0]) if tokenizer.convert_ids_to_tokens(int(token)) == "flies"]

# Print the contextual embeddings of the word "flies" in both contexts
for index in flies_indices:
    print(f"Context: {tokenizer.convert_ids_to_tokens(int(tokens['input_ids'][0][index]))}, Embedding: {embeddings[0][index].detach().numpy()}")


# Extract the embeddings for each use of "flies"
flies_embeddings = [embeddings[0][index].detach().numpy() for index in flies_indices]

# Calculate the cosine similarity between the different uses of "flies"
similarity_matrix = cosine_similarity(flies_embeddings)

# Print the cosine similarity matrix
print("Cosine Similarity Matrix:")
print(similarity_matrix)






Context: flies, Embedding: [ 6.91121817e-01  7.08067596e-01  3.21222454e-01  2.33792424e-01
  2.42526740e-01 -4.00673896e-01 -1.75818533e-01  1.07339299e+00
 -4.72747125e-02 -5.23784518e-01  6.03983164e-01 -4.21403438e-01
 -3.38130206e-01  6.07586950e-02 -2.83635199e-01  3.54021102e-01
 -1.02077536e-01 -1.31319925e-01 -5.14633119e-01  1.98111430e-01
 -2.02998564e-01  1.41545817e-01 -1.01255760e-01 -2.08421439e-01
  5.68717182e-01 -5.42165279e-01 -1.18277267e-01  4.57483411e-01
 -3.43659490e-01 -8.73050034e-01  1.97320104e-01  5.59366643e-01
  3.18830721e-02  7.43251026e-01  3.58462363e-01 -6.09825552e-01
 -1.29318714e-01  6.58027455e-02 -5.67198582e-02  1.76252946e-01
 -3.12179297e-01 -1.01587892e+00 -2.06433475e-01  5.45573980e-02
  1.55587988e-02  1.70413837e-01 -2.01799437e-01  7.17774332e-02
  8.74774158e-01 -7.31991351e-01 -4.99139369e-01  3.26864332e-01
  7.37306029e-02  4.05852050e-01 -4.68484238e-02  3.20013791e-01
  7.08170831e-01 -1.22925565e-01 -1.24113813e-01 -6.73834860e-0