## NLP Tutorial - Text Representation: TF-IDF
What is TF-IDF?
- TF stands for Term Frequency and denotes the ratio of number of times a particular word appeared in a Document to total number of words in the document.

   Term Frequency(TF) = [number of times word appeared / total no of words in a document]. 
   
- Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.

- IDF stands for Inverse Document Frequency and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

   Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)].
   
- In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0.

Finally:

   TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus=[
      "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
    "something is amzing"
]

In [6]:
v=TfidfVectorizer()
transformed_output=v.fit_transform(corpus)
print(v.vocabulary_)

{'thor': 26, 'eating': 11, 'pizza': 23, 'loki': 18, 'is': 17, 'ironman': 16, 'ate': 8, 'already': 0, 'apple': 6, 'announcing': 5, 'new': 21, 'iphone': 15, 'tomorrow': 27, 'tesla': 25, 'model': 20, 'google': 13, 'pixel': 22, 'microsoft': 19, 'surface': 24, 'amazon': 2, 'eco': 12, 'dot': 10, 'am': 1, 'biryani': 9, 'and': 4, 'you': 28, 'are': 7, 'grapessomething': 14, 'amzing': 3}


In [7]:
all_feature_names=v.get_feature_names_out()

for word in all_feature_names:
    index=v.vocabulary_.get(word)
    print(f"{word} {v.idf_[index]}")

already 2.386294361119891
am 2.386294361119891
amazon 2.386294361119891
amzing 2.386294361119891
and 2.386294361119891
announcing 1.2876820724517808
apple 2.386294361119891
are 2.386294361119891
ate 2.386294361119891
biryani 2.386294361119891
dot 2.386294361119891
eating 1.9808292530117262
eco 2.386294361119891
google 2.386294361119891
grapessomething 2.386294361119891
iphone 2.386294361119891
ironman 2.386294361119891
is 1.0
loki 2.386294361119891
microsoft 2.386294361119891
model 2.386294361119891
new 1.2876820724517808
pixel 2.386294361119891
pizza 2.386294361119891
surface 2.386294361119891
tesla 2.386294361119891
thor 2.386294361119891
tomorrow 1.2876820724517808
you 2.386294361119891


In [8]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [9]:
transformed_output.toarray()[:2]

array([[0.24302373, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.24302373, 0.        ,
        0.        , 0.40346113, 0.        , 0.        , 0.        ,
        0.        , 0.24302373, 0.10184147, 0.24302373, 0.        ,
        0.        , 0.        , 0.        , 0.72907118, 0.        ,
        0.        , 0.24302373, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.30902531, 0.57267658, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57267658, 0.        , 0.23998572, 0.        , 0.        ,
        0.        , 0.30902531, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.30902531, 0.        ]])

In [10]:
import pandas as pd

df=pd.read_csv("Ecommerce_data.csv")
print(df.shape)
df.sample(5)

(24000, 2)


Unnamed: 0,Text,label
6403,Charms Rudraksh American Diamond Gold Meena Om...,Household
19045,Digisol DG-KU1004 Mini USB KVM Switch with Aud...,Electronics
21216,NF&E Comfort Memory Foam Keyboard Wrist Rest S...,Electronics
4670,SD Enterprises Sonic Plastic Analogue Wall Clo...,Household
16846,Zacharias Unisex Wool Balaclava/Monkey Cap (Mu...,Clothing & Accessories


In [11]:
df.label.value_counts()

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: label, dtype: int64

In [13]:
df['label_num']=df.label.map({
    "Household":0,
    "Books":1,
    "Electronics":2,
    "Clothing & Accessories":3
})
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3


In [15]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(df.Text,df.label_num,test_size=0.20,random_state=2022,
                                             stratify=df.label_num)

In [16]:
print("Shape of X_train: ",X_train.shape)
print("Shape of X_test: ",X_test.shape)

Shape of X_train:  (19200,)
Shape of X_test:  (4800,)


In [17]:
y_train.value_counts()

0    4800
2    4800
3    4800
1    4800
Name: label_num, dtype: int64

In [18]:
y_test.value_counts()

0    1200
2    1200
3    1200
1    1200
Name: label_num, dtype: int64

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipe=Pipeline([
    ("vectorizer",TfidfVectorizer()),
    ("KNN",KNeighborsClassifier())
])

pipe.fit(X_train,y_train)

y_pred=pipe.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1200
           1       0.97      0.95      0.96      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.98      0.97      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [27]:
X_test[:5]

20706    Lal Haveli Designer Handmade Patchwork Decorat...
19166    GOTOTOP Classical Retro Cotton & PU Leather Ne...
15209    FabSeasons Camouflage Polyester Multi Function...
2462     Indian Superfoods: Change the Way You Eat Revi...
6621     Milton Marvel Insulated Steel Casseroles, Juni...
Name: Text, dtype: object

In [24]:
y_test[:5]

20706    0
19166    2
15209    3
2462     1
6621     3
Name: label_num, dtype: int64

In [25]:
y_pred[:5]

array([0, 2, 3, 1, 0], dtype=int64)

In [28]:
from sklearn.naive_bayes import MultinomialNB

pipe=Pipeline([
    ("vectorizer",TfidfVectorizer()),
    ("Multi NB",MultinomialNB())
])

pipe.fit(X_train,y_train)

y_pred=pipe.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1200
           1       0.98      0.92      0.95      1200
           2       0.97      0.97      0.97      1200
           3       0.97      0.99      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [29]:
from sklearn.ensemble import RandomForestClassifier

pipe=Pipeline([
    ("vectorizer",TfidfVectorizer()),
    ("R.F.C.",RandomForestClassifier())
])

pipe.fit(X_train,y_train)

y_pred=pipe.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.99      0.99      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [30]:
import spacy

nlp=spacy.load("en_core_web_sm")

def preprocess(text):
    doc=nlp(text)
    filtered_tokens=[]
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
        
    return " ".join(filtered_tokens)

In [31]:
df['prepro_tst']=df['Text'].apply(preprocess)

In [32]:
df.head()

Unnamed: 0,Text,label,label_num,prepro_tst
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0,Urban Ladder Eisner low Study Office Computer ...
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0,contrast live Wooden Decorative Box Painted Bo...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2,IO Crest SY PCI40010 PCI raid Host Controller ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3,ISAKAA Baby Socks bear 8 Years- Pack 4 6 8 12 ...
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3,Indira Designer Women Art Mysore Silk Saree Bl...


In [33]:
df.Text[0]

'Urban Ladder Eisner Low Back Study-Office Computer Chair(Black) A study in simple. The Eisner study chair has a firm foam cushion, which makes long hours at your desk comfortable. The flexible meshed back is designed for air-circulation and support when you lean back. The curved arms provide ergonomic forearm support. Adjust the height using the gas lift to find that comfortable position and the nylon castors make it easy to move around your space. Chrome legs refer to the images for dimension details any assembly required will be done by the UL team at the time of delivery indoor use only.'

In [34]:
df.prepro_tst[0]

'Urban Ladder Eisner low Study Office Computer Chair(Black study simple Eisner study chair firm foam cushion make long hour desk comfortable flexible mesh design air circulation support lean curved arm provide ergonomic forearm support adjust height gas lift find comfortable position nylon castor easy space chrome leg refer image dimension detail assembly require UL team time delivery indoor use'

In [35]:
X_train,X_test,y_train,y_test=train_test_split(df.prepro_tst,df.label_num,test_size=0.20,random_state=2022,
                                             stratify=df.label_num)

In [36]:
pipe=Pipeline([
    ("vectorizer",TfidfVectorizer()),
    ("R.F.C.",RandomForestClassifier())
])

pipe.fit(X_train,y_train)

y_pred=pipe.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.99      0.98      1200

    accuracy                           0.98      4800
   macro avg       0.98      0.98      0.98      4800
weighted avg       0.98      0.98      0.98      4800



## TF-IDF: Exercises
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### About Data: Emotion Detection
Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

- This data consists of two columns. - Comment - Emotion

- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the Multi-Class Classification.


In [46]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df=pd.read_csv("Emotion_classify_Data.csv")

#print the shape of dataframe
print(df.shape)

#print top 5 rows
df.head()

(5937, 2)


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [47]:
#check the distribution of Emotion
df.Emotion.value_counts()

anger    2000
joy      2000
fear     1937
Name: Emotion, dtype: int64

In [50]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df['Emotion_num']=df['Emotion'].map({
    "joy":0,"fear":1,"anger":2
})

#checking the results by printing top 5 rows
df.head()

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### Modelling without Pre-processing Text data


In [51]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling
X_train,X_test,y_train,y_test=train_test_split(df.Comment,df.Emotion_num,test_size=0.20,random_state=2022
                                               ,stratify=df.Emotion_num)

In [52]:
#print the shapes of X_train and X_test
print("Shape of X_train:",X_train.shape)
print("Shape of X_test:",X_test.shape)

Shape of X_train: (4749,)
Shape of X_test: (1188,)


#### Attempt 1 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.
###### Note:

- using CountVectorizer with only trigrams.
- use RandomForest as the classifier.
- print the classification report.


In [55]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
pipe=Pipeline([
    ("Countvectorizer",CountVectorizer(ngram_range=(3,3))),
    ("R.F",RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.57      0.27      0.36       400
           1       0.37      0.79      0.50       388
           2       0.53      0.22      0.31       400

    accuracy                           0.42      1188
   macro avg       0.49      0.42      0.39      1188
weighted avg       0.49      0.42      0.39      1188



### Attempt 2 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.
#### Note:

- using CountVectorizer with both unigram and bigrams.
- use Multinomial Naive Bayes as the classifier.
- print the classification report.

In [62]:
#import MultinomialNB from sklearn

from sklearn.naive_bayes import MultinomialNB

#1. create a pipeline object

pipe=Pipeline([
    ("Countvectorizer:",CountVectorizer(ngram_range=(1,2))),
    ("NB",MultinomialNB())
])

#2. fit with X_train and y_train

pipe.fit(X_train,y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       400
           1       0.87      0.83      0.85       388
           2       0.83      0.88      0.85       400

    accuracy                           0.86      1188
   macro avg       0.86      0.86      0.86      1188
weighted avg       0.86      0.86      0.86      1188



### Attempt 3 :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.
#### Note:

- using CountVectorizer with both unigram and Bigrams.
- use RandomForest as the classifier.
- print the classification report.

In [63]:
#1. create a pipeline object

pipe=Pipeline([
    ("Countvectorizer",CountVectorizer(ngram_range=(1,2))),
    ("R.F",RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.97      0.90       400
           1       0.96      0.88      0.92       388
           2       0.93      0.86      0.90       400

    accuracy                           0.90      1188
   macro avg       0.91      0.90      0.91      1188
weighted avg       0.91      0.90      0.91      1188



### Attempt 4 :

- using the sklearn pipeline module create a classification pipeline to classify the Data.
#### Note:

- using TF-IDF vectorizer for Pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [61]:
#import TfidfVectorizer from sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

#1. create a pipeline object

pipe=Pipeline([
    ("Countvectorizer",TfidfVectorizer()),
    ("R.F",RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.89      0.95      0.92       400
           1       0.92      0.90      0.91       388
           2       0.94      0.89      0.91       400

    accuracy                           0.91      1188
   macro avg       0.92      0.91      0.91      1188
weighted avg       0.92      0.91      0.91      1188



### Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [64]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [65]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df['preprocessed_comment']=df.Comment.apply(preprocess)

In [66]:
df.head()

Unnamed: 0,Comment,Emotion,Emotion_num,preprocessed_comment
0,i seriously hate one subject to death but now ...,fear,1,seriously hate subject death feel reluctant drop
1,im so full of life i feel appalled,anger,2,m life feel appalled
2,i sit here to write i start to dig out my feel...,fear,1,sit write start dig feeling think afraid accep...
3,ive been really angry with r and i feel like a...,joy,0,ve angry r feel like idiot trust place
4,i feel suspicious if there is no one outside l...,fear,1,feel suspicious outside like rapture happen


### Build a model with pre processed text

In [70]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train,X_test,y_train,y_test=train_test_split(df.preprocessed_comment,df.Emotion_num,test_size=0.20,random_state=2022
                                               ,stratify=df.Emotion_num)

#### Let's check the scores with our best model till now

- Random Forest
#### Attempt1 :
1. using the sklearn pipeline module create a classification pipeline to classify the Data.
#### Note:

- using CountVectorizer with both unigrams and bigrams.
- use RandomForest as the classifier.
- print the classification report.

In [71]:
#1. create a pipeline object
pipe=Pipeline([
    ("Countvectorizer",CountVectorizer(ngram_range=(1,2))),
    ("R.F",RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.94       400
           1       0.94      0.90      0.92       388
           2       0.91      0.94      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



#### Attempt 2 :

- using the sklearn pipeline module create a classification pipeline to classify the data.
##### Note:

- using TF-IDF vectorizer for pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [72]:
#1. create a pipeline object

pipe=Pipeline([
    ("Countvectorizer",TfidfVectorizer()),
    ("R.F",RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred

y_pred=pipe.predict(X_test)

#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94       400
           1       0.93      0.92      0.93       388
           2       0.94      0.91      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188

