<a href="https://colab.research.google.com/github/aicrashcoursewinter24/aicrashcourseEthanB/blob/Lab3/aicrashcourseEthanB/lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today, we'll play with turning text into numeric vectors (the process of "vectorization"), which first requires splitting up the a long string into something closer to a list of words (or characters).

This latter process is the process of "tokenization": each word/sub-word/character (the atomic unit of text) is called a "token".

Start by installing the "datasets" python package, giving you access to some helpful utilities in downloading public datasets from HuggingFace and elsewhere.

In [None]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


There are pre-built tokenizer models, which have both code and mappings between tokens and token *ids* - integers which will be feature columns for the text

We will first use the BERT model (the original "transformer" from the "[Attention is All You Need](https://arxiv.org/abs/1706.03762)" paper), in a form which knows how to differentiate between lower and uppercase characters (some tokenizers lowercase everything first).  It's called "bert-base-uncased".

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Note in the output above, you should see a comment about the "HF_TOKEN" secret.  There is also a link to HuggingFace, where you can generate your HF Token (see note below about the word "token"). To the left part of the Colab screen, there is a "key" icon: you can store your HF_TOKEN as a secret there.  Name it HF_TOKEN and give it "notebook access" via the toggle.


--

 note on "token": there are now two completely unrelated uses of the word "token" in this lab:

* "token": a unit of text like a word or character (or even multi-word phrase) used in text preprocessing
* "HF_TOKEN": a password-like thing for getting access to HuggingFace

In [None]:
encoded = tokenizer.encode("Do not meddle in the affairs of wizards")

In [None]:
# prompt: write python code to print the textual tokens in sequential order from a string, using the above tokenizer

print(tokenizer.convert_ids_to_tokens(encoded))


['[CLS]', 'Do', 'not', 'me', '##ddle', 'in', 'the', 'affairs', 'of', 'wizard', '##s', '[SEP]']


In [None]:
print(encoded)

[101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 102]


In [None]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input.keys())
print(encoded_input['input_ids'])

In [None]:
! pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

At this point, go ahead and explore with the vector representation (the "embedding") of any sentence (or string of text, more generally), looking at the tokenized form, the list of token_id integers, or compute cosine similarities between the embeddings:

In [None]:
words = ["quick", "fast", "red", "blue", "ferari"]
single_word_embeddings = model.encode(words)

for word, embed in zip(words, single_word_embeddings):
  print("word: ", word)
  print("embed: ", embed[0:10])
  print("")



word:  quick
embed:  [-0.01363505  0.02511056 -0.03966426 -0.00121545  0.03869091 -0.04272134
  0.03643535  0.00567384  0.00246003 -0.04250647]

word:  fast
embed:  [-0.01659232  0.06137965 -0.01092987  0.02365591 -0.0138125  -0.01203378
 -0.00972914 -0.05885596 -0.01261965 -0.0577055 ]

word:  red
embed:  [-0.02509159  0.00884627 -0.10083688  0.01320896  0.01490394  0.02841406
  0.15962426  0.01331032  0.03514304 -0.04301136]

word:  blue
embed:  [-0.06580827  0.0203764  -0.05504949 -0.00301157  0.01343209  0.02449333
  0.20061415 -0.00983796  0.04382765 -0.01033155]

word:  ferari
embed:  [ 0.00451991  0.08644823 -0.12962112  0.05201175  0.01814397 -0.07465909
  0.07894592  0.05689763 -0.00582556 -0.09301732]



In [None]:
# prompt: python code to compute the matrix of cosines between all of the pairs of words in the list above.

from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity between all pairs of words
word_embeddings = model.encode(words)
word_similarities = cosine_similarity(word_embeddings)
# Print the word similarities
print(word_similarities)


[[1.0000001  0.6515874  0.3388258  0.33914232 0.28320336]
 [0.6515874  1.         0.32009655 0.30601805 0.26345903]
 [0.3388258  0.32009655 1.         0.72944736 0.26313198]
 [0.33914232 0.30601805 0.72944736 1.         0.22827557]
 [0.28320336 0.26345903 0.26313198 0.22827557 0.99999976]]


In [None]:
# prompt: python code for computing cosine similarity between sentence vector embeddings from the above tokenizer and model

from scipy.spatial.distance import cosine
for sentence in sentences:
    print("Sentence:", sentence)
print("")
print("Cosine similarity between the first two sentences:", cosine(embeddings[0], embeddings[1]))
print("Cosine similarity between the second and third sentences:", cosine(embeddings[1], embeddings[2]))
print("Cosine similarity between the first and third sentences:", cosine(embeddings[0], embeddings[2]))


In [None]:
# prompt: python code to compute the contextual BERT embeddings for all the the words in the sentence "time flies like an arrow, fruit flies like a banana"

model = SentenceTransformer('all-MiniLM-L6-v2') #this is literally the only line I kept from the AI prompt, which itself is redundant bc it was called earlier in the notebook.
encoded_input = tokenizer("time flies like an arrow, fruit flies like a banana")
print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids']))
print(encoded_input['input_ids'])
#both flies use the same token when considered individually like this. (same for like)


['[CLS]', 'time', 'flies', 'like', 'an', 'arrow', ',', 'fruit', 'flies', 'like', 'a', 'banana', '[SEP]']
[101, 1159, 10498, 1176, 1126, 11473, 117, 5735, 10498, 1176, 170, 21806, 102]


In [None]:
# prompt: Give me a sentence with 2 words, spelt the same, but with different meanings, but make it a unique such sentence that has nothing to do with flies or time
#Yes I am very creative. I spent like 30 mins just trying to get to the point where I can type and now gotta figure out how to save from colabs to github. If you are reading this I succeeded.
print("The data file is on the disk, but the disk drive is not working.")
encoded_input = tokenizer("The data file is on the disk, but the disk drive is not working.")
print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids']))
print(encoded_input['input_ids'])

The data file is on the disk, but the disk drive is not working.
['[CLS]', 'The', 'data', 'file', 'is', 'on', 'the', 'disk', ',', 'but', 'the', 'disk', 'drive', 'is', 'not', 'working', '.', '[SEP]']
[101, 1109, 2233, 4956, 1110, 1113, 1103, 10437, 117, 1133, 1103, 10437, 2797, 1110, 1136, 1684, 119, 102]


In [None]:
#using the above code to answer a question
words = ['flies', 'insect', 'bird', 'like', 'enjoy', 'similar']
from sklearn.metrics.pairwise import cosine_similarity
# Compute the cosine similarity between all pairs of words
word_embeddings = model.encode(words)
word_similarities = cosine_similarity(word_embeddings)
# Print the word similarities
print(word_similarities)

#I believe this shows that the AI more strongly correlates 'flies' to the bug and 'like' to the comparison usage.


[[1.         0.61485314 0.4749211  0.296232   0.17646007 0.23970282]
 [0.61485314 1.0000005  0.5042044  0.15512988 0.11409886 0.21923347]
 [0.4749211  0.5042044  0.9999999  0.19844429 0.19370267 0.2452532 ]
 [0.296232   0.15512988 0.19844429 1.         0.15361592 0.52412146]
 [0.17646007 0.11409886 0.19370267 0.15361592 1.0000002  0.19263732]
 [0.23970282 0.21923347 0.2452532  0.52412146 0.19263732 1.0000002 ]]


In [2]:
# prompt: write python code to get the contextual embeddings of words in a sentence to demonstrate how the bert model sees the word 'flies' differently depending on the context
#colab AI is dumb so I used chatgpt
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Define a sentence with different contexts for the word 'flies'
sentences = [
    "The bird flies in the sky.",
    "He flies a kite in the park.",
    "Time flies when you're having fun."
]

# Get contextual embeddings for the word 'flies'
word = 'flies'
word_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(word)))
word_indices = [i for i, token in enumerate(tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentences[0])))) if token == word_tokens[0]]

embeddings_list = []

for sentence in sentences:
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    with torch.no_grad():
        outputs = model(input_ids)
        embeddings = outputs.last_hidden_state.squeeze(0)

    # Get the contextual embedding for the word 'flies' by averaging embeddings of its subwords
    flies_embedding = torch.mean(embeddings[word_indices], dim=0)
    embeddings_list.append(flies_embedding)

    # Print the context and the contextual embedding for 'flies'
    #print(f"Context: {sentence}")
    #print(f"Contextual embedding for 'flies': {flies_embedding}")
    #print("\n" + "=" * 50 + "\n")

# I think this is what is required for the contextual embeddings, it is not nearly as clean as the tokens generated in the previous sections.
#here is what it generated for the cosine similarities for the various flies

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(torch.stack(embeddings_list, dim=0))

# Print the similarity matrix
print("Cosine Similarity Matrix:")
print(similarity_matrix)


Cosine Similarity Matrix:
[[1.         0.81232595 0.7941936 ]
 [0.81232595 0.9999998  0.7914548 ]
 [0.7941936  0.7914548  1.        ]]


In [17]:
words = ['there', 'their', 'they\'re']

# Example sentences
sentences = [
    "It is over there",
    "It is over their",
    "It is over they're"
]

# Get contextual embeddings for each word in the list
embeddings_list = []

for word in words:
    word_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(word)))

    for sentence in sentences:
        word_indices = [i for i, token in enumerate(tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))) if token in word_tokens]

        input_ids = tokenizer.encode(sentence, return_tensors='pt')

        with torch.no_grad():
            outputs = model(input_ids)
            embeddings = outputs.last_hidden_state.squeeze(0)

        # Get the contextual embedding for the current word by averaging embeddings of its subwords
        word_embedding = torch.mean(embeddings[word_indices], dim=0)  # Fix the reshaping here
        embeddings_list.append(word_embedding)

# Reshape the embeddings list for comparison
embeddings_tensor = torch.stack(embeddings_list).reshape(len(words), len(sentences), -1)

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(embeddings_tensor.reshape(len(words), -1), embeddings_tensor.reshape(len(words), -1))

# Print the similarity matrix
print("Cosine Similarity Matrix:")
print("'there', 'their', they're")
print(similarity_matrix)

#This shows that the 3 theres are encoded as very similar. I am going to repeat this experiment with other sentences below

Cosine Similarity Matrix:
'there', 'their', they're
[[1.0000002  0.9122655  0.88896024]
 [0.9122655  1.         0.89280814]
 [0.88896024 0.89280814 1.        ]]


In [16]:
words = ['there', 'their', 'they\'re']

# Example sentences
sentences = [
    "It is over there",
    "It is their house",
    "Golly, they're super nice"
]

# Get contextual embeddings for each word in the list
embeddings_list = []

for word in words:
    word_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(word)))

    for sentence in sentences:
        word_indices = [i for i, token in enumerate(tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))) if token in word_tokens]

        input_ids = tokenizer.encode(sentence, return_tensors='pt')

        with torch.no_grad():
            outputs = model(input_ids)
            embeddings = outputs.last_hidden_state.squeeze(0)

        # Get the contextual embedding for the current word by averaging embeddings of its subwords
        word_embedding = torch.mean(embeddings[word_indices], dim=0)  # Fix the reshaping here
        embeddings_list.append(word_embedding)

# Reshape the embeddings list for comparison
embeddings_tensor = torch.stack(embeddings_list).reshape(len(words), len(sentences), -1)

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(embeddings_tensor.reshape(len(words), -1), embeddings_tensor.reshape(len(words), -1))

# Print the similarity matrix
print("Cosine Similarity Matrix:")
print("'there', 'their', they're")
print(similarity_matrix)
#That is really neat! They show different, more similar, meanings when misused.

Cosine Similarity Matrix:
'there', 'their', they're
[[1.        0.9063654 0.8875115]
 [0.9063654 1.0000001 0.8878656]
 [0.8875115 0.8878656 1.0000002]]


In [20]:
words = ['your', 'you\'re', 'yore']

# Example sentences
sentences = [
    "Please bring your book to the meeting",
    " If you're not sure, please ask for clarification",
    "In the days of yore, people relied on handwritten letters for communication."
]

# Get contextual embeddings for each word in the list
embeddings_list = []

for word in words:
    word_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(word)))

    for sentence in sentences:
        word_indices = [i for i, token in enumerate(tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))) if token in word_tokens]

        input_ids = tokenizer.encode(sentence, return_tensors='pt')

        with torch.no_grad():
            outputs = model(input_ids)
            embeddings = outputs.last_hidden_state.squeeze(0)

        # Get the contextual embedding for the current word by averaging embeddings of its subwords
        word_embedding = torch.mean(embeddings[word_indices], dim=0)  # Fix the reshaping here
        embeddings_list.append(word_embedding)

# Reshape the embeddings list for comparison
embeddings_tensor = torch.stack(embeddings_list).reshape(len(words), len(sentences), -1)

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(embeddings_tensor.reshape(len(words), -1), embeddings_tensor.reshape(len(words), -1))

# Print the similarity matrix
print("Cosine Similarity Matrix:")
print("'your', 'you're', 'yore'")
print(similarity_matrix)


sentences = [
    "Please bring your book to the meeting",
    "Please bring you're book to the meeting",
    "Please bring yore book to the meeting"
]
# Get contextual embeddings for each word in the list
embeddings_list = []

for word in words:
    word_tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(word)))

    for sentence in sentences:
        word_indices = [i for i, token in enumerate(tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))) if token in word_tokens]

        input_ids = tokenizer.encode(sentence, return_tensors='pt')

        with torch.no_grad():
            outputs = model(input_ids)
            embeddings = outputs.last_hidden_state.squeeze(0)

        # Get the contextual embedding for the current word by averaging embeddings of its subwords
        word_embedding = torch.mean(embeddings[word_indices], dim=0)  # Fix the reshaping here
        embeddings_list.append(word_embedding)

# Reshape the embeddings list for comparison
embeddings_tensor = torch.stack(embeddings_list).reshape(len(words), len(sentences), -1)

# Calculate cosine similarity between the embeddings
similarity_matrix = cosine_similarity(embeddings_tensor.reshape(len(words), -1), embeddings_tensor.reshape(len(words), -1))

# Print the similarity matrix
print("Cosine Similarity Matrix:")
print("'your', 'you're', 'yore'")
print(similarity_matrix)

#once again a very similar situation

Cosine Similarity Matrix:
'your', 'you're', 'yore'
[[0.9999997 0.8873764 0.8885362]
 [0.8873764 0.9999999 0.8735747]
 [0.8885362 0.8735747 1.0000005]]
Cosine Similarity Matrix:
'your', 'you're', 'yore'
[[1.0000001  0.86687475 0.81430036]
 [0.86687475 1.0000001  0.8054873 ]
 [0.81430036 0.8054873  0.9999997 ]]
