# Text Representation - Bag of Words vs. TF-IDF on Twitter Data
* Notebook by Adam Lang
* Date: 7/26/2024

# Overview
* In this notebook we will go over classical text representation methods in NLP Data Science on a Twitter dataset using Bag of Words and TF-IDF.

## Get the data

In [1]:
## data path
data_path = "/content/drive/MyDrive/Colab Notebooks/Classical NLP/tweets-.csv"

In [2]:
## imports
import numpy as np
import pandas as pd

In [3]:
# read dataset
df = pd.read_csv(data_path)

In [4]:
# df head
df.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0.0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331.0,True,False
1,RT @Hemant_80: Did you vote on #Demonetization...,False,0.0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66.0,True,False
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0.0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12.0,True,False
3,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0.0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338.0,True,False
4,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0.0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120.0,True,False


## Pre-processing
* Drop all columns except Text column.

In [5]:
# df columns
df.columns

Index(['text', 'favorited', 'favoriteCount', 'replyToSN', 'created',
       'truncated', 'replyToSID', 'id', 'replyToUID', 'statusSource',
       'screenName', 'retweetCount', 'isRetweet', 'retweeted'],
      dtype='object')

In [6]:
# only keep the text column
df.drop(df.columns[1:], axis=1, inplace=True)

In [7]:
# check if col dropped
df.head()

Unnamed: 0,text
0,RT @rssurjewala: Critical question: Was PayTM ...
1,RT @Hemant_80: Did you vote on #Demonetization...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...
4,RT @satishacharya: Reddy Wedding! @mail_today ...


## Create Text Representations

## 1. Bag of Words
* We can do this in sklearn using the `CountVectorizer`

In [8]:
## import from sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
## create object for CountVectorizer
word_bow = CountVectorizer()

In [12]:
# fit on training dataset
word_bow.fit(df['text'].values)

In [13]:
## transform training data
word_vectors_bow = word_bow.transform(df['text'].values)

In [18]:
## get the BOW features
word_bow.get_feature_names_out()

array(['00', '000', '00716', ..., 'zzh5moxrtq', 'zzthdwqbfy',
       'zzyjzzuhlu'], dtype=object)

In [19]:
## visualize feature matrix
word_vectors_bow

<5157x13541 sparse matrix of type '<class 'numpy.int64'>'
	with 86437 stored elements in Compressed Sparse Row format>

Summary:
* We can see we have a sparse matrix.
* We have 5,157 documents and 13,541 unique words.

In [21]:
## convert matrix --> df --> get document representations
vocab = word_bow.get_feature_names_out()

new_df = pd.DataFrame(word_vectors_bow.toarray(), columns=vocab)
new_df.head()

Unnamed: 0,00,000,00716,0080,0081,0082,0083,0084,0085,0086,...,zxiusza2s7,zxuecwobqp,zyitjkbklc,zylu2al27f,zymrlzofxm,zyuakjdi4h,zz0mflmpfd,zzh5moxrtq,zzthdwqbfy,zzyjzzuhlu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Summary:
* We can see above we have a lot of zeros and thus a lot of sparsity.
* Let's reduce the sparsity.

### Reducing Sparsity of Matrix

#### 1. Preprocess document text

In [22]:
## import processing libraries
import spacy
import re

# load spacy english language model
nlp = spacy.load('en_core_web_sm')

In [23]:
# create preprocess function
def clean(text):

  # remove alphabetic words
  text = ' '.join(re.compile(r'[^a-zA-Z]+').split(text))

  # spacy object creation
  doc = nlp(text)

  # list to store clean text
  filtered_text = []

  # iterate and save word lemmas (root words)
  for token in doc:
    filtered_text.append(token.lemma_)

  return " ".join(word for word in filtered_text)

In [24]:
## apply clean text function
df['text_clean'] = df['text'].apply(clean)

In [25]:
# print dataset
df.head(10)

Unnamed: 0,text,text_clean
0,RT @rssurjewala: Critical question: Was PayTM ...,RT rssurjewala critical question be PayTM info...
1,RT @Hemant_80: Did you vote on #Demonetization...,RT Hemant do you vote on Demonetization on Mod...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",RT roshankar former finsec RBI Dy Governor CBD...
3,RT @ANI_news: Gurugram (Haryana): Post office ...,RT ANI news Gurugram Haryana Post office emplo...
4,RT @satishacharya: Reddy Wedding! @mail_today ...,RT satishacharya Reddy Wedding mail today cart...
5,@DerekScissors1: Indias #demonetization: #Bla...,DerekScissors India s demonetization Blackmo...
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,RT gauravcsawant Rs lakh loot from a bank in K...
7,RT @Joydeep_911: Calling all Nationalists to j...,RT Joydeep call all Nationalists to join Walk ...
8,RT @sumitbhati2002: Many opposition leaders ar...,RT sumitbhati many opposition leader be with n...
9,National reform now destroyed even the essence...,national reform now destroy even the essence o...


#### Generate `CountVectorizer` again

In [26]:
## arguments: default values
word_bow = CountVectorizer(binary=False, # count occurences of terms
                           lowercase=True, #lowercase terms
                           )

In [27]:
## fit and transform training data
word_vectors_bow = word_bow.fit_transform(df['text_clean'].values)

In [28]:
## now look at shape of matrix
word_vectors_bow

<5157x12813 sparse matrix of type '<class 'numpy.int64'>'
	with 86053 stored elements in Compressed Sparse Row format>

Summary:
* We can see we now have a lower number of unique terms at 12,813 less than the 13,000+ before.

In [29]:
## document repressentation
vocab = word_bow.get_feature_names_out()

# create df to store
new_df2 = pd.DataFrame(word_vectors_bow.toarray(), columns=vocab)

new_df2.head()

Unnamed: 0,aa,aaadhar,aaanupriyaaa,aadhaar,aadhar,aadhe,aadityagautom,aadmi,aagr,aaj,...,zymrlzofxm,zynql,zyuakjdi,zz,zzdxhds,zzh,zzl,zzthdwqbfy,zzygw,zzyjzzuhlu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 2. Keep only top frequent terms
* Lets narrow this to 5000 terms.

In [30]:
## change arguments
word_bow = CountVectorizer(binary=False, # count occurrences of terms
                           lowercase=True, #lowercase
                           max_features=5000, #max features
                           )

In [31]:
# fit and transform training dataset
word_vectors_bow = word_bow.fit_transform(df['text_clean'].values)

In [32]:
# shape of new matrix
word_vectors_bow

<5157x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 78074 stored elements in Compressed Sparse Row format>

Summary:
* We have now condensed this to the top 5000 terms.

In [33]:
# Document representation
vocab = word_bow.get_feature_names_out()

new_df3 = pd.DataFrame(word_vectors_bow.toarray(), columns=vocab)

In [34]:
# look at new_df3
new_df3.head()

Unnamed: 0,aa,aadhaar,aadhar,aadhe,aadmi,aajtak,aam,aamaadmi,aamaadmiparty,aamir,...,zt,zu,zv,zvup,zvvbjg,zwcmfzca,zxhhmuwceq,zyitjkbklc,zylu,zz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 3. Thresholding occurence of terms
* Build vocabulary with a specific threshold of terms from the dataset.

In [35]:
## change arguments for bow
word_bow = CountVectorizer(binary=False, # count occurrences of terms
                           lowercase=True, #lowercase terms
                           max_df=500, # max occurrence
                           min_df=10, # min occurrence
                           )

In [36]:
# fit and transform training data
word_vectors_bow = word_bow.fit_transform(df['text_clean'].values)

In [37]:
# new matrix shape
word_vectors_bow

<5157x967 sparse matrix of type '<class 'numpy.int64'>'
	with 41103 stored elements in Compressed Sparse Row format>

Summary:
* Now the vocabulary is only 967 unique terms.

In [38]:
# document representation
vocab = word_bow.get_feature_names_out()

new_df4 = pd.DataFrame(word_vectors_bow.toarray(), columns=vocab)

In [39]:
# new_df4
new_df4.head()

Unnamed: 0,aa,aadhaar,aadmi,aam,aamaadmiparty,aap,able,about,abt,accept,...,yet,yogi,you,young,your,youtube,youtubers,yrdeshmukh,yt,zone
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 4. N-gram BoW
* Now lets use bigrams instead of unigrams to build our vocabulary.

In [40]:
## argument update
word_bow = CountVectorizer(binary=False, #count occurrences of terms
                           lowercase=True, #lowercase
                           ngram_range=(2,2) #bi-gram creation
                           )

In [41]:
# fit and transform training data
word_vectors_bow = word_bow.fit_transform(df['text_clean'].values)

In [42]:
# shape of new matrix
word_vectors_bow

<5157x42679 sparse matrix of type '<class 'numpy.int64'>'
	with 84906 stored elements in Compressed Sparse Row format>

Summary:
* Now we can see the vocabulary has more than tripled in size with using bigrams instead of unigrams.
* The bigram advantage is that it can preserve word orders that the unigram can't.

In [43]:
## get features
word_bow.get_feature_names_out()

array(['aa gaye', 'aa https', 'aa lfy', ..., 'zzh moxrtq', 'zzl offa',
       'zzygw em'], dtype=object)

In [44]:
## document representation
vocab = word_bow.get_feature_names_out()

new_df5 = pd.DataFrame(word_vectors_bow.toarray(), columns=vocab)

In [45]:
# new_df5
new_df5.head()

Unnamed: 0,aa gaye,aa https,aa lfy,aa lvsy,aa mazc,aa pje,aa rahe,aa to,aa yogi,aaadhar expansion,...,zwpql frwn,zwsoa google,zyitjkbklc via,zylu al,zymrlzofxm https,zz al,zz mflmpfd,zzh moxrtq,zzl offa,zzygw em
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Summary:
* The resulting matrix is very sparse.
* We could refine this by experimenting with various n-gram shapes.

## TF-IDF
* We will use the `TfidfVectorizer` from sklearn.

In [46]:
## import tfidf vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
## create object for TfidfVectorizer
word_tfidf = TfidfVectorizer()

In [48]:
# fit and transform training data
word_vectors_tfidf = word_tfidf.fit_transform(df['text_clean'].values)

In [49]:
## shape of matrix
word_vectors_tfidf

<5157x12813 sparse matrix of type '<class 'numpy.float64'>'
	with 86053 stored elements in Compressed Sparse Row format>

Summary:
* Aha! We have the same number of feature as the original matrix we created with the BOW.

In [50]:
## document representation
vocab = word_tfidf.get_feature_names_out()

tfidf_df = pd.DataFrame(word_vectors_tfidf.toarray(), columns=vocab)

In [51]:
## print new df
tfidf_df.head()

Unnamed: 0,aa,aaadhar,aaanupriyaaa,aadhaar,aadhar,aadhe,aadityagautom,aadmi,aagr,aaj,...,zymrlzofxm,zynql,zyuakjdi,zz,zzdxhds,zzh,zzl,zzthdwqbfy,zzygw,zzyjzzuhlu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.244035,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Summary:
* We can see the feature vectors are floats.

#### Modify argument values


In [52]:
# update args
word_tfidf = TfidfVectorizer(ngram_range=(2,2)) #bi-grams

In [53]:
# fit and transform training data
word_vectors_tfidf = word_tfidf.fit_transform(df['text_clean'].values)

In [54]:
# shape of matrix
word_vectors_tfidf

<5157x42679 sparse matrix of type '<class 'numpy.float64'>'
	with 84906 stored elements in Compressed Sparse Row format>

Summary:
* Same num of features as BOW with creating bigrams.

In [55]:
## get matrix
vocab = word_tfidf.get_feature_names_out()

tfidf_df2 = pd.DataFrame(word_vectors_tfidf.toarray(), columns=vocab)

In [56]:
## new df
tfidf_df2.head()

Unnamed: 0,aa gaye,aa https,aa lfy,aa lvsy,aa mazc,aa pje,aa rahe,aa to,aa yogi,aaadhar expansion,...,zwpql frwn,zwsoa google,zyitjkbklc via,zylu al,zymrlzofxm https,zz al,zz mflmpfd,zzh moxrtq,zzl offa,zzygw em
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Summary
* We were able to compare and contrast the 2 common "classical NLP" methods for text representation: BOW vs. TF-IDF.
* Both have their use cases and we can see the advantages vs. disadvantages of each technique.