<a href="https://colab.research.google.com/github/astrovishalthakur/MachineLearning/blob/main/NaturalLanguageProcessing/Basics/BagOfWordsAndTfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [11]:
dic = [["people watch campusx", 1],
       ["campusx watch campusx", 1],
       ["people write comment", 0], 
       ["campusx write comment", 0]]

In [12]:
df = pd.DataFrame(dic, columns=["text", "output"])

In [13]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


# Bag of Words

## Simple unigram

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [17]:
bow = cv.fit_transform(df.text)

In [18]:
# vocab
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [19]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]


# Bigram

In [20]:
cv2 = CountVectorizer(ngram_range=(2, 2))

In [22]:
bow2 = cv2.fit_transform(df.text)

In [23]:
cv2.vocabulary_

{'campusx watch': 0,
 'campusx write': 1,
 'people watch': 2,
 'people write': 3,
 'watch campusx': 4,
 'write comment': 5}

In [37]:
print(bow2[0].toarray())
print(bow2[1].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]


# uni and bi grams

In [25]:
cv3 = CountVectorizer(ngram_range=(1, 2))

In [26]:
bow3 = cv3.fit_transform(df.text)

In [35]:
bow3.toarray()

array([[1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
       [2, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1],
       [1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1]])

In [27]:
cv3.vocabulary_

{'campusx': 0,
 'campusx watch': 1,
 'campusx write': 2,
 'comment': 3,
 'people': 4,
 'people watch': 5,
 'people write': 6,
 'watch': 7,
 'watch campusx': 8,
 'write': 9,
 'write comment': 10}

In [36]:
print(bow3[0].toarray())
print(bow3[1].toarray())

[[1 0 0 0 1 1 0 1 1 0 0]]
[[2 1 0 0 0 0 0 1 1 0 0]]


# Quadgram

In [29]:
cv4 = CountVectorizer(ngram_range=(4, 4))

In [30]:
cv4.fit_transform(df.text)

ValueError: ignored

## Throws error since max word in a sentence is 3

# Tf-Idf

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [32]:
TFDF = tfidf.fit_transform(df.text).toarray()

In [33]:
TFDF

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [34]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


The formula that is used to compute the **tf-idf** for a term t of a document d in a document set is 
# `tf-idf(t, d) = tf(t, d) * idf(t)`,
 and the idf is computed as 
 # `idf(t) = log [ n / df(t) ] + 1` (if smooth_idf=False), 
 where **n** is the total number of documents in the document set and **df(t)** is the document frequency of **t**; the document frequency is the number of documents in the document set that contain the term t. **The effect of adding “1”** to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as 
 # `idf(t) = log [ n / (df(t) + 1) ])`.

If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: 
# idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.
