<h2>Bag-of-Words (BoW) Model</h2>
The Bag-of-Words model represents text as a collection of words, ignoring grammar and word order. It creates a vector representation where each dimension corresponds to a unique word in the vocabulary, and the value in each dimension indicates the frequency of that word in the document.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer


corpus = [
    "I booked a flight to Paris.",
    "Paris is known for its beautiful architecture.",
    "I visited the Eiffel Tower in Paris.",
    "Exploring the Louvre Museum was an amazing experience.",
]


vectorizer = CountVectorizer()


bow_representation = vectorizer.fit_transform(corpus)


vocab = vectorizer.get_feature_names_out()


print("Bag-of-Words representation:")
print(bow_representation.toarray())

# Print the vocabulary
print('****************************')
print("Vocabulary:")
print(vocab)


Bag-of-Words representation:
[[0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0]
 [1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1]]
****************************
Vocabulary:
['amazing' 'an' 'architecture' 'beautiful' 'booked' 'eiffel' 'experience'
 'exploring' 'flight' 'for' 'in' 'is' 'its' 'known' 'louvre' 'museum'
 'paris' 'the' 'to' 'tower' 'visited' 'was']


<h2>TF-IDF (Term Frequency-Inverse Document Frequency)</h2>
TF-IDF measures the importance of a term in a document relative to a collection of documents. It calculates the product of term frequency (TF) and inverse document frequency (IDF). TF measures how frequently a term occurs in a document, while IDF measures how important a term is across the entire corpus by penalizing frequent terms.





In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer


corpus = [
    "I booked a flight to Paris.",
    "Paris is known for its beautiful architecture.",
    "I visited the Eiffel Tower in Paris.",
    "Exploring the Louvre Museum was an amazing experience.",
]


vectorizer = TfidfVectorizer()


tfidf_representation = vectorizer.fit_transform(corpus)


vocab = vectorizer.get_feature_names_out()


print("TF-IDF representation:")
print(tfidf_representation.toarray())

print("********************")
print("Vocabulary:")
print(vocab)


TF-IDF representation:
[[0.         0.         0.         0.         0.5417361  0.
  0.         0.         0.5417361  0.         0.         0.
  0.         0.         0.         0.         0.34578314 0.
  0.5417361  0.         0.         0.        ]
 [0.         0.         0.39505606 0.39505606 0.         0.
  0.         0.         0.         0.39505606 0.         0.39505606
  0.39505606 0.39505606 0.         0.         0.25215917 0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.44592216
  0.         0.         0.         0.         0.44592216 0.
  0.         0.         0.         0.         0.28462634 0.35157015
  0.         0.44592216 0.44592216 0.        ]
 [0.36222393 0.36222393 0.         0.         0.         0.
  0.36222393 0.36222393 0.         0.         0.         0.
  0.         0.         0.36222393 0.36222393 0.         0.2855815
  0.         0.         0.         0.36222393]]
********************
Vocabulary:
['am

<h2>Word Embeddings</h2>
Word embeddings represent words as dense vectors in a continuous vector space, where semantically similar words are mapped to nearby points. These embeddings capture the semantic relationships between words and can be used to derive meaning from the text.
example: (Word2Vec, GloVe)

In [5]:
#!pip install gensim

In [10]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

corpus = [
    "Cricket is a popular sport played with a bat and ball.",
    "The Ashes series is one of the most famous rivalries in cricket.",
    "Sachin Tendulkar is considered one of the greatest cricketers of all time.",
    "The World Cup is the pinnacle of international cricket tournaments.",
    "Playing a test match can last up to five days, with each team having two innings.",
    "In Twenty20 cricket, each team has a maximum of 20 overs to score as many runs as possible.",
    "The term 'hat-trick' refers to a bowler taking three wickets in three consecutive deliveries.",
    "Australia has won the ICC Cricket World Cup multiple times.",
    "The DRS system (Decision Review System) is used to review umpiring decisions in international cricket matches.",
    "Fielding positions in cricket include roles such as slip, gully, and cover.",
]


tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus]


model = Word2Vec(tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)


word_embeddings = {word: model.wv[word] for word in model.wv.index_to_key}


print("Word Embeddings:")
#print(word_embeddings["cricket"])
#print(word_embeddings["drs"]) # please remove '#' to see the ouput


Word Embeddings:


<h2>Contextual Embeddings</h2>
Contextual embeddings capture the meaning of a word based on its context in a sentence or document. Unlike traditional word embeddings, contextual embeddings generate a unique representation for each occurrence of a word, taking into account its surrounding context.
example: (BERT, GPT, etc.)

In [2]:
from transformers import BertTokenizer, BertModel
import torch

#pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

sentence = "I watched the thrilling cricket match between India and Australia."

# tokenize input
inputs = tokenizer(sentence, return_tensors="pt")

outputs = model(**inputs)

contextual_embeddings = outputs.last_hidden_state

# print contextual embeddings for some tokens
print("Contextual Embeddings:")
#print(contextual_embeddings[0][3])  # embedding for the token 'thrilling'
#print(contextual_embeddings[0][5])  # embedding for the token 'cricket'
# remove '#' from above print line to see the output

Contextual Embeddings:
