# M3-L3-Screencasts

## M3-L3-SC1: How Tokenization Works: Words, Subwords, and Transformers

### Step 1: Setting Up the Environment
Import necessary libraries and set up Hugging Face Transformers.

In [None]:
!pip install transformers
from transformers import BertTokenizer
import nltk

nltk.download('punkt_tab')
nltk.download('punkt')

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Step 2: Understanding Word Tokenization
Utilize the NLTK library to tokenize a sentence into words.

In [None]:
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "Natural Language Processing unravels the complexity of language."

# Word tokenization
words = word_tokenize(sentence)
print("Word Tokens:", words)


Word Tokens: ['Natural', 'Language', 'Processing', 'unravels', 'the', 'complexity', 'of', 'language', '.']


### Step 3: Introducing Subword Tokenization
Demonstrate subword tokenization with the BERT tokenizer.

In [None]:
# Subword tokenization using BERT
sentence = "Unprecedented thunderstorms affected the megacity extensively."

# Tokenize using BERT tokenizer
encoded_input = tokenizer(sentence)
print("Subword Tokens:", tokenizer.convert_ids_to_tokens(encoded_input['input_ids']))


Subword Tokens: ['[CLS]', 'unprecedented', 'thunder', '##storm', '##s', 'affected', 'the', 'mega', '##city', 'extensively', '.', '[SEP]']


## M3-L3-SC2: Getting Word Vectors and Token Similarity with spaCy

### Step 1: Preparing Your Environment
Install and import the necessary libraries for spaCy.

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_md
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_md")

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


### Step 2: Tokenizing and Vectorizing Text
Tokenize and extract word vectors from sample text.

In [None]:
# Sample text
text = "Understanding language requires both context and experience."

# Tokenize the text
doc = nlp(text)

# Extract vectors for each token
for token in doc:
    print(f"Token: {token.text}, Vector: {token.vector[:5]}")  # Display partial vector for brevity


Token: Understanding, Vector: [-0.63676  0.12778 -0.45423  0.29087 -0.51181]
Token: language, Vector: [-0.62498 -0.8816  -0.60641  0.33662  0.23677]
Token: requires, Vector: [-0.65263  0.79872 -0.3831  -0.22159 -0.51216]
Token: both, Vector: [-0.60053   0.18838  -0.40993   0.3225    0.070322]
Token: context, Vector: [-0.69987   -0.19314   -0.0069517 -0.098401  -0.32545  ]
Token: and, Vector: [-1.1728   0.24514 -0.38037 -0.0536  -0.6409 ]
Token: experience, Vector: [-0.67224  0.45838 -0.18926 -0.54811 -0.20009]
Token: ., Vector: [-0.73351   0.41392  -0.4425   -0.29127  -0.096179]


### Step 3: Calculating Token Similarity
Compute similarity scores between tokens.

In [None]:
# Calculate token similarity
token1, token2 = doc[0], doc[1]  # Use the first two tokens as an example
similarity = token1.similarity(token2)
print(f"Similarity between '{token1.text}' and '{token2.text}': {similarity:.4f}")

Similarity between 'Understanding' and 'language': 0.4575


## M3-L3-SC3: Creating Sentence Embeddings with Hugging Face Transformers

### Step 1: Preparing Your Workspace
Install and import necessary libraries.

In [1]:
!pip install transformers
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np



### Step 2: Selecting and Loading a Transformer Model
Load a pre-trained transformer model from Hugging Face, like DistilBERT or BERT for sentence embeddings.

In [2]:
# Load pre-trained model and tokenizer
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

### Step 3: Tokenizing and Preparing Input Sentences
Tokenize sample sentences to prepare them for embedding extraction.

In [3]:
# Sample sentences
sentence_1 = "Exploring the world of AI opens new horizons."
sentence_2 = "The vast potentials of machine learning are intriguing."

# Tokenize sentences
encoded_input_1 = tokenizer(sentence_1, return_tensors='pt', padding=True, truncation=True)
encoded_input_2 = tokenizer(sentence_2, return_tensors='pt', padding=True, truncation=True)

print(f'Encoded Input 1: {encoded_input_1}')
print(f'Encoded Input 2: {encoded_input_2}')

Encoded Input 1: {'input_ids': tensor([[  101, 11131,  1996,  2088,  1997,  9932,  7480,  2047, 24484,  1012,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Encoded Input 2: {'input_ids': tensor([[  101,  1996,  6565,  4022,  2015,  1997,  3698,  4083,  2024, 23824,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


### Step 4: Generating Sentence Embeddings
Extract sentence embeddings using the transformer model.

In [7]:
# Extract embeddings
with torch.no_grad():
    output_1 = model(**encoded_input_1)
    output_2 = model(**encoded_input_2)

# Compute average of token embeddings to represent the sentence
sentence_embedding_1 = output_1.last_hidden_state.mean(dim=1)
sentence_embedding_2 = output_2.last_hidden_state.mean(dim=1)

print(f'Sentence Embedding 1:\n{sentence_embedding_1}')
print(f'Sentence Embedding 2:\n{sentence_embedding_2}')

Sentence Embedding 1:
tensor([[ 1.3045e-01, -2.2520e-01,  1.1124e-01, -1.4334e-01,  7.7613e-01,
         -2.7930e-02, -1.2690e-01, -8.1524e-02, -6.6611e-03,  1.5771e-01,
         -1.9649e-01, -3.5989e-02, -2.0353e-01,  4.5270e-02, -3.1694e-02,
          1.6170e-01, -1.9161e-01, -1.5843e-01, -2.0201e-01, -6.2210e-01,
         -1.9935e-01,  2.0731e-01,  2.5230e-01, -1.8501e-01, -3.4915e-01,
          3.5247e-01,  2.9153e-01, -3.0263e-01, -6.6631e-02, -1.4376e-01,
          3.6614e-01,  4.0660e-02,  3.2357e-01,  6.3286e-02, -1.0256e-01,
          5.3813e-01,  2.0616e-01, -1.2313e-01,  2.5706e-01, -2.4559e-01,
          5.4632e-02, -3.7992e-01,  2.4372e-01, -9.5749e-02,  4.9022e-01,
          2.7312e-01, -2.5626e-01,  1.4637e-02,  5.3570e-01,  4.8580e-02,
         -6.2182e-01, -4.2035e-01, -2.8489e-02, -1.9343e-01,  2.2537e-01,
          4.4338e-01,  1.6404e-01, -1.0960e-01, -3.0428e-01,  1.3846e-02,
         -4.1353e-02, -2.5904e-01,  1.9212e-01,  3.7485e-01,  4.3974e-01,
         -5.9549