# Module 3 Screencasts

## From Text to Bag-of-Words – Your First Text Vectorizer

### Step 1: Setting Up Your Environment
Let's begin with importing libraries and loading sample text data for vectorization.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "Time flies like an arrow.",
    "Fruit flies like a banana.",
    "The quick brown fox jumps over the lazy dog."
]


### Step 2: Initializing the CountVectorizer
Next, let's initialize `CountVectorizer` and fit-transform the data.

In [None]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

### Step 3: Exploring the Resulting Vectors
Now that the data is ready, let's look at the output of the vectorization process.

In [None]:
# Show the created vector (bag-of-words)
print("Feature Names:", vectorizer.get_feature_names_out())
print("Vectorized Representation:\n", X.toarray())


Feature Names: ['an' 'arrow' 'banana' 'brown' 'dog' 'flies' 'fox' 'fruit' 'jumps' 'lazy'
 'like' 'over' 'quick' 'the' 'time']
Vectorized Representation:
 [[1 1 0 0 0 1 0 0 0 0 1 0 0 0 1]
 [0 0 1 0 0 1 0 1 0 0 1 0 0 0 0]
 [0 0 0 1 1 0 1 0 1 1 0 1 1 2 0]]


## M2-L2-SC2

### Step 1: Setting Up Your Working Environment
Let's begin with importing the necessary libraries and prepare sample text data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text documents
documents = [
    "Time flies like an arrow.",
    "Fruit flies like a banana.",
    "The quick brown fox jumps over the lazy dog."
]


### Step 2: Initializing the TF-IDF Vectorizer
Next, initialize the TfidfVectorizer and apply it to the text data.

In [None]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)


### Step 3: Understanding the Output
Now, examine the resulting vectors and their corresponding feature names.

In [None]:
# Display the feature names and TF-IDF results
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_array = tfidf_matrix.toarray()

print("TF-IDF Feature Names:", feature_names)
print("TF-IDF Matrix:\n", tfidf_array)


TF-IDF Feature Names: ['an' 'arrow' 'banana' 'brown' 'dog' 'flies' 'fox' 'fruit' 'jumps' 'lazy'
 'like' 'over' 'quick' 'the' 'time']
TF-IDF Matrix:
 [[0.49047908 0.49047908 0.         0.         0.         0.37302199
  0.         0.         0.         0.         0.37302199 0.
  0.         0.         0.49047908]
 [0.         0.         0.5628291  0.         0.         0.42804604
  0.         0.5628291  0.         0.         0.42804604 0.
  0.         0.         0.        ]
 [0.         0.         0.         0.30151134 0.30151134 0.
  0.30151134 0.         0.30151134 0.30151134 0.         0.30151134
  0.30151134 0.60302269 0.        ]]


## M2-L2-SC3: Extracting Token Embeddings with Hugging Face Transformers

### Step 1: Setting Up Your Environment
Import necessary libraries and set up the Hugging Face Transformers environment.

In [None]:
!pip install transformers
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Setup pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

### Step 2: Tokenizing the Text
In the next step, we will tokenize a sample sentence using the BERT tokenizer.

In [None]:
# Sample sentence
sentence = "Transformers are great for tackling sequence tasks."

# Tokenize sentence
inputs = tokenizer(sentence, return_tensors='pt')
print("Tokens:", inputs['input_ids'])


Tokens: tensor([[  101, 19081,  2024,  2307,  2005, 26997,  2989,  5537,  8518,  1012,
           102]])


### Step 3: Extracting Token Embeddings

Next, let's use the BERT model to extract embeddings from the tokenized input.

In [None]:
# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get token embeddings
token_embeddings = outputs.last_hidden_state
print("Token Embeddings Shape:", token_embeddings.shape)


Token Embeddings Shape: torch.Size([1, 11, 768])


## M2-L2-SC4: Sentence-Level Embeddings and Similarity Scoring

### Step 1: Setting Up Your Environment
Import essential libraries and prepare the Hugging Face Transformers environment.

In [None]:
!pip install transformers
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


### Step 2: Preparing Sentences and Tokenization
Tokenize two sample sentences using the BERT tokenizer.

In [None]:
# Sample sentences
sentence_1 = "The painting evokes a sense of nostalgia and wonder."
sentence_2 = "This artwork reminds viewers of cherished memories."

# Tokenize sentences
inputs_1 = tokenizer(sentence_1, return_tensors='pt', padding=True)
inputs_2 = tokenizer(sentence_2, return_tensors='pt', padding=True)


### Step 3: Extracting Sentence-Level Embeddings
Use the BERT model to extract embeddings and compute average embeddings to represent sentences.

In [None]:
# Extract token-level embeddings for each sentence
with torch.no_grad():
    outputs_1 = model(**inputs_1)
    outputs_2 = model(**inputs_2)

# Pool embeddings by averaging
sent_embedding_1 = outputs_1.last_hidden_state.mean(dim=1).squeeze()
sent_embedding_2 = outputs_2.last_hidden_state.mean(dim=1).squeeze()


### Step 4: Calculating Similarity Between Sentences
Compute cosine similarity between the two sentence embeddings.

In [None]:
# Compute cosine similarity
cosine_similarity = torch.nn.functional.cosine_similarity(sent_embedding_1.unsqueeze(0), sent_embedding_2.unsqueeze(0))
print(f"Cosine Similarity: {cosine_similarity.item():.4f}")


Cosine Similarity: 0.8065


### Step 5: Exploring Further Enhancements
Let's discuss possible future enhancements.

**Future Enhancements:**
- Experiment with different models for richer embeddings.
- Integrate similarity scores in applications like sentiment analysis.