# Bag of Words

Its technique, where count/frequency of each word in the document(sentence) is considered. The feature matrix is prepared based on vocabulary.

In [1]:
import pandas as pd

df = pd.DataFrame({
    "text": ["people watch campusx", "campusx watch campusx", "people write comment", "campusx write comment"]
})

In [2]:
df

Unnamed: 0,text
0,people watch campusx
1,campusx watch campusx
2,people write comment
3,campusx write comment


Now Lets find the corpus and vocabulary of this

In [3]:
corpus = " ".join([ sentence for sentence in df.text])
corpus

'people watch campusx campusx watch campusx people write comment campusx write comment'

In [4]:
vocabulary = list(set(corpus.split()))
vocabulary

['campusx', 'watch', 'comment', 'people', 'write']

The bag of words : creates a feature matrix, based on this vocabulary. A vector based on the vocabulary where each word's frequency in the document is mentioned such a vector becomes associated with that document, Hence is its vector representation.

In [5]:
feature_matrix :list[list[int]] =[]
for sentence in df.text:
  vector = []
  for word in vocabulary:
    vector.append(sentence.count(word))
  print(sentence, " -> ", vector)
  feature_matrix.append(vector)

people watch campusx  ->  [1, 1, 0, 1, 0]
campusx watch campusx  ->  [2, 1, 0, 0, 0]
people write comment  ->  [0, 0, 1, 1, 1]
campusx write comment  ->  [1, 0, 1, 0, 1]


In [6]:
feature_matrix

[[1, 1, 0, 1, 0], [2, 1, 0, 0, 0], [0, 0, 1, 1, 1], [1, 0, 1, 0, 1]]

# SkLearn Based BoW ( Count Vectorizer)

Note that since All vectorization of text data are kind of ways to convert textual data into numerical features, they can be called as Feature Extraction. Hence sklearn saves this under **sklearn.feature_extraction.text module**

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

feature_matrix = vectorizer.fit_transform(df.text) # Since it fits the data to matrix using vocabulary, for the tables the transformed data is readily avaiable.
feature_matrix

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 11 stored elements and shape (4, 5)>

In [8]:
# Its a sparse matrix convert to straight matrix/arrays with .toarray()
feature_matrix = feature_matrix.toarray()
feature_matrix

array([[1, 0, 1, 1, 0],
       [2, 0, 0, 1, 0],
       [0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1]])

On the first look, It seems different, Lets examine the vocabulary. ( Remember : vocabulary_)

In [23]:
vocabulary2 = vectorizer.vocabulary_
vocabulary2

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

Lets convert it to readable order (vals in dict represent index of the word in vector formed by vocabulary)

In [24]:
vocabulary2 = sorted(vocabulary2.items(), key=lambda x:x[1]) # Sort based on values of dict

vocabulary2

[('campusx', 0), ('comment', 1), ('people', 2), ('watch', 3), ('write', 4)]

In [25]:
vocabulary2 = [x[0] for x in vocabulary2]
vocabulary2

['campusx', 'comment', 'people', 'watch', 'write']

In [10]:
# Compare with our prev vocabulary
vocabulary

['campusx', 'watch', 'comment', 'people', 'write']

Yes the order has been reversed. Its okay because, BoW is ignores **order of words in vocabulary**

In [11]:
# We can find vectors of new sentences also.

vectorizer.transform(["campusx watch campusx write comment"]).toarray() # Cross verify it using mapping dictionary of vocabulary_

array([[2, 1, 0, 1, 1]])

In [26]:
vocabulary2

['campusx', 'comment', 'people', 'watch', 'write']

In [29]:
# What if we include new "Out of Vocbulary" words sentences
vectorizer.transform(["campusx not comment and people still watch "]).toarray()

array([[1, 1, 1, 1, 0]])

As we can see, the frequencies of words inside the vocabulary are correctly recorded. But out of vocabulary words are completely ignored.

## Advantages and Disadvantages of Bag of Words (BoW)

### Advantages:

*   **Simplicity:** BoW is a straightforward and easy-to-understand model for text representation.
*   **Fixed-size vectors:** It transforms variable-length text into fixed-size numerical vectors, making it suitable for machine learning algorithms.
*   **Computational efficiency:** The process of creating BoW vectors is relatively fast.

### Disadvantages:

*   **Lack of semantic meaning:** BoW ignores the order of words and their context, losing semantic relationships and meaning (e.g., "good" and "not good" might have similar vector representations if the word "good" appears).
*   **Out of vocabulary words:** It cannot handle words that were not present in the vocabulary during training, ignoring them completely.
*   **Sparsity:** For large vocabularies, the feature matrix can be very sparse (mostly zeros), which can be inefficient for storage and computation.
*   **Unordering (Lack of context):** As mentioned in the lack of semantic meaning, the disregard for word order means that phrases and sentences with different meanings but the same words will have identical representations.

# N-Grams

## N-grams

N-grams are contiguous sequences of n items from a given sample of text or speech. Unlike Bag of Words, which treats each word independently, N-grams consider the sequence of words. This helps to capture some of the local context and word order, addressing one of the major drawbacks of BoW.

For example, a bigram (n=2) considers pairs of adjacent words, while a trigram (n=3) considers sequences of three adjacent words.

Let's look at an example using bigrams with the sentence "people watch campusx".

In [33]:
df["text"]

Unnamed: 0,text
0,people watch campusx
1,campusx watch campusx
2,people write comment
3,campusx write comment


Note that each bigram must be formed from single document only. Dont start at the end of 1st document and then end with next documents words.
Ex : 1st bigram : people watch,
2nd bigram : watch complex.
Now dont pick 3rd one as campusx campusx.NO.
Instead directly go 2nd document to write. 3rd bigram : campusx watch.
Now watch campusx is already done, just like words in vocabulary, n-grams are unique, hence skip it. Now again dont cmapusx people mixing documents. go to 3rd document.
4th bigram : people write. etc


When forming N-grams across multiple documents, each document is processed independently. N-grams are generated from contiguous sequences within a *single* document. We do not form N-grams by combining words from the end of one document and the beginning of the next. The vocabulary of N-grams is built from all unique N-grams found across all documents in the corpus.

**Implementation of N-grams**

 Just pass "ngram_range" parameter with desired ngram count to CountVectorizer. For bigrams ngram_range = (2,2) . For trigrams, ngram_range=3 etc

We can combine 1-grams, 2-grams, 3-grams etc together to form complex vocabulary. In parameter ngram_range = (a, b). a refers start of n-grams we want to consider, b- refers to end of n-grams to consider. For ex : if ngram_range = (1,2), The vocabulary will have unigrams and bigrams both. If (2,3) then bigrams and trigrams will be considered. If (1,3) start from 1-grams go upto 3-grams.

Hence Unigrams are nothing but Bag of Words as we have seen

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

# Using bigrams (n=2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

bigram_matrix = bigram_vectorizer.fit_transform(df['text'])

print("Bigram vocabulary:\n", bigram_vectorizer.vocabulary_)
print("Bigram matrix:\n", bigram_matrix.toarray())

Bigram vocabulary:
 {'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}
Bigram matrix:
 [[0 0 1 0 1 0]
 [1 0 0 0 1 0]
 [0 0 0 1 0 1]
 [0 1 0 0 0 1]]


In [37]:
# Lets see mixed n-grams

vec = CountVectorizer(ngram_range=(1,3))

vec.fit_transform(df.text)
vec.vocabulary_

{'people': 6,
 'watch': 11,
 'campusx': 0,
 'people watch': 7,
 'watch campusx': 12,
 'people watch campusx': 8,
 'campusx watch': 1,
 'campusx watch campusx': 2,
 'write': 13,
 'comment': 5,
 'people write': 9,
 'write comment': 14,
 'people write comment': 10,
 'campusx write': 3,
 'campusx write comment': 4}

# TF-IDF Vectorization

In [44]:
# TF-IDF Vectorization

TF-IDF stands for **Term Frequency-Inverse Document Frequency**. It is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

The main idea behind TF-IDF is to give higher scores to words that are frequent in a specific document but are rare across all documents in the entire corpus. This helps to filter out common words (like "the", "a", "is") that appear frequently in many documents and are therefore less informative, while highlighting words that are more specific and relevant to a particular document.

The purpose of using TF-IDF for text vectorization is to transform text data into a numerical representation that better captures the importance and relevance of words within a document relative to the entire corpus. This numerical representation can then be used as features for various machine learning tasks such as text classification, clustering, and information retrieval.


# TF-IDF Vectorization

TF-IDF stands for **Term Frequency-Inverse Document Frequency**. It is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

The main idea behind TF-IDF is to give higher scores to words that are frequent in a specific document but are rare across all documents in the entire corpus. This helps to filter out common words (like "the", "a", "is") that appear frequently in many documents and are therefore less informative, while highlighting words that are more specific and relevant to a particular document.

The purpose of using TF-IDF for text vectorization is to transform text data into a numerical representation that better captures the importance and relevance of words within a document relative to the entire corpus. This numerical representation can then be used as features for various machine learning tasks such as text classification, clustering, and information retrieval.


## Explain term frequency (tf)

### Subtask:
Define Term Frequency, provide the formula, and show a simple calculation example.


**Reasoning**:
Create a markdown cell to define Term Frequency, provide its formula, and include a calculation example using the provided DataFrame.



In [41]:
%%markdown
### Term Frequency (TF)

Term Frequency (TF) measures how frequently a term appears in a document. A higher Term Frequency for a word in a document indicates that the word is more relevant to that document.

The formula for Term Frequency of a term ($t$) in a document ($d$) is:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

Let's take the first document from our DataFrame: "people watch campusx".

To calculate the TF for the word "people" in this document:

Number of times "people" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3 (people, watch, campusx)

$$
\text{TF}(\text{"people"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

To calculate the TF for the word "watch" in this document:

Number of times "watch" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3

$$
\text{TF}(\text{"watch"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

To calculate the TF for the word "campusx" in this document:

Number of times "campusx" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3

$$
\text{TF}(\text{"campusx"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

### Term Frequency (TF)

Term Frequency (TF) measures how frequently a term appears in a document. A higher Term Frequency for a word in a document indicates that the word is more relevant to that document.

The formula for Term Frequency of a term ($t$) in a document ($d$) is:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

Let's take the first document from our DataFrame: "people watch campusx".

To calculate the TF for the word "people" in this document:

Number of times "people" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3 (people, watch, campusx)

$$
\text{TF}(\text{"people"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

To calculate the TF for the word "watch" in this document:

Number of times "watch" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3

$$
\text{TF}(\text{"watch"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

To calculate the TF for the word "campusx" in this document:

Number of times "campusx" appears in "people watch campusx" = 1
Total number of terms in "people watch campusx" = 3

$$
\text{TF}(\text{"campusx"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$


## Explain inverse document frequency (idf)

### Subtask:
Define Inverse Document Frequency, provide the formula, and show a simple calculation example.


**Reasoning**:
Create a markdown cell to define Inverse Document Frequency, provide the formula, and show calculation examples based on the dataframe.



In [42]:
%%markdown
### Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) measures how important a term is across the whole corpus. It helps to downweight terms that appear frequently in many documents and are thus less informative, while giving more weight to terms that are rare and potentially more significant.

The formula for Inverse Document Frequency of a term ($t$) in a corpus ($D$) is:

$$
\text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in the corpus}}{\text{Number of documents containing term } t}\right)
$$

Let's calculate the IDF for a few terms using our `df` DataFrame, which has 4 documents:

Total number of documents in the corpus = 4

**Term: "people"**

Number of documents containing "people": Document 0 ("people watch campusx") and Document 2 ("people write comment"). So, 2 documents contain "people".

$$
\text{IDF}(\text{"people"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

**Term: "campusx"**

Number of documents containing "campusx": Document 0 ("people watch campusx"), Document 1 ("campusx watch campusx"), and Document 3 ("campusx write comment"). So, 3 documents contain "campusx".

$$
\text{IDF}(\text{"campusx"}, D) = \log\left(\frac{4}{3}\right) = \log(1.333...) \approx 0.288
$$

**Term: "comment"**

Number of documents containing "comment": Document 2 ("people write comment") and Document 3 ("campusx write comment"). So, 2 documents contain "comment".

$$
\text{IDF}(\text{"comment"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

Notice that "campusx", which appears in more documents (3 out of 4), has a lower IDF value compared to "people" and "comment" (both appearing in 2 out of 4 documents). This demonstrates how IDF downweights terms that are common across the corpus.

### Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) measures how important a term is across the whole corpus. It helps to downweight terms that appear frequently in many documents and are thus less informative, while giving more weight to terms that are rare and potentially more significant.

The formula for Inverse Document Frequency of a term ($t$) in a corpus ($D$) is:

$$
\text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in the corpus}}{\text{Number of documents containing term } t}\right)
$$

Let's calculate the IDF for a few terms using our `df` DataFrame, which has 4 documents:

Total number of documents in the corpus = 4

**Term: "people"**

Number of documents containing "people": Document 0 ("people watch campusx") and Document 2 ("people write comment"). So, 2 documents contain "people".

$$
\text{IDF}(\text{"people"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

**Term: "campusx"**

Number of documents containing "campusx": Document 0 ("people watch campusx"), Document 1 ("campusx watch campusx"), and Document 3 ("campusx write comment"). So, 3 documents contain "campusx".

$$
\text{IDF}(\text{"campusx"}, D) = \log\left(\frac{4}{3}\right) = \log(1.333...) \approx 0.288
$$

**Term: "comment"**

Number of documents containing "comment": Document 2 ("people write comment") and Document 3 ("campusx write comment"). So, 2 documents contain "comment".

$$
\text{IDF}(\text{"comment"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

Notice that "campusx", which appears in more documents (3 out of 4), has a lower IDF value compared to "people" and "comment" (both appearing in 2 out of 4 documents). This demonstrates how IDF downweights terms that are common across the corpus.


# Tf-Idf calculation



In [43]:
%%markdown
### TF-IDF Calculation

TF-IDF is calculated by multiplying the Term Frequency (TF) and the Inverse Document Frequency (IDF) for a given term in a specific document.

The formula for TF-IDF of a term ($t$) in a document ($d$) within a corpus ($D$) is:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Let's calculate the TF-IDF for the term "people" in the first document ("people watch campusx") using the TF and IDF values we calculated previously:

From the TF explanation:
$$
\text{TF}(\text{"people"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

From the IDF explanation:
$$
\text{IDF}(\text{"people"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

Now, we multiply these two values to get the TF-IDF score:

$$
\text{TF-IDF}(\text{"people"}, \text{"people watch campusx"}, D) = \text{TF}(\text{"people"}, \text{"people watch campusx"}) \times \text{IDF}(\text{"people"}, D)
$$

$$
\text{TF-IDF}(\text{"people"}, \text{"people watch campusx"}, D) \approx 0.333 \times 0.693 \approx 0.231
$$

This TF-IDF score of approximately 0.231 for the term "people" in the first document indicates its importance relative to the other terms in that document and across the entire corpus. A higher TF-IDF score suggests that the term is more unique and relevant to that specific document within the context of the whole collection of documents.

### TF-IDF Calculation

TF-IDF is calculated by multiplying the Term Frequency (TF) and the Inverse Document Frequency (IDF) for a given term in a specific document.

The formula for TF-IDF of a term ($t$) in a document ($d$) within a corpus ($D$) is:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Let's calculate the TF-IDF for the term "people" in the first document ("people watch campusx") using the TF and IDF values we calculated previously:

From the TF explanation:
$$
\text{TF}(\text{"people"}, \text{"people watch campusx"}) = \frac{1}{3} \approx 0.333
$$

From the IDF explanation:
$$
\text{IDF}(\text{"people"}, D) = \log\left(\frac{4}{2}\right) = \log(2) \approx 0.693
$$

Now, we multiply these two values to get the TF-IDF score:

$$
\text{TF-IDF}(\text{"people"}, \text{"people watch campusx"}, D) = \text{TF}(\text{"people"}, \text{"people watch campusx"}) \times \text{IDF}(\text{"people"}, D)
$$

$$
\text{TF-IDF}(\text{"people"}, \text{"people watch campusx"}, D) \approx 0.333 \times 0.693 \approx 0.231
$$

This TF-IDF score of approximately 0.231 for the term "people" in the first document indicates its importance relative to the other terms in that document and across the entire corpus. A higher TF-IDF score suggests that the term is more unique and relevant to that specific document within the context of the whole collection of documents.


Implementation

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

print("TF-IDF matrix:\n", tfidf_matrix.toarray())

TF-IDF matrix:
 [[0.49681612 0.         0.61366674 0.61366674 0.        ]
 [0.8508161  0.         0.         0.52546357 0.        ]
 [0.         0.57735027 0.57735027 0.         0.57735027]
 [0.49681612 0.61366674 0.         0.         0.61366674]]


In [46]:
tfidf_vectorizer.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

In [47]:
tfidf_vectorizer.idf_

array([1.22314355, 1.51082562, 1.51082562, 1.51082562, 1.51082562])

We can find the correctly ordered vocabulary ,beginning from index 0 unlike the dicts we have seen before using get_feature_names_out()

In [49]:
tfidf_vectorizer.get_feature_names_out()

array(['campusx', 'comment', 'people', 'watch', 'write'], dtype=object)