# Exercise 3. Text Representation
### Text, Web and Social Media Analytics

In this exercise, we will derive the document representaion of the preprocessed (stemmed) newsgroups dataset. We will use sklearn, as well as gensim to derive the bag of words document representation. We will calculate the following representations for each package: 

- Absolute frequencies
- Relative frequencies
- TF-IDF frequencies
- N-grams

We first import all the libraries we will be using.

In [None]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import Binarizer
from gensim.corpora import Dictionary
from gensim.models import TfidfModel

We now load the preprocessed dataset from the previous exercise by using the pickle package. We then print the first row to make sure it was loaded correctly. 

In [None]:
stemmed_data = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/stemmed_data.p', 'rb'))

print(stemmed_data.iloc[0])

content         car wonder enlighten car saw dai door sport ca...
target                                                          7
target_names                                            rec.autos
Name: 0, dtype: object


## Bag-of-Words in Scikit-Learn

Here, we define a CountVectorizer object, which calculates the absolute term frequency; this means that the value of each word is the number of appearances in the document. We also define two parameters, 'max_df' and 'min_df'. These two parameters are set to 0.95 and 0.05 respectively, which specifies to leave out any vocabulary that has a document frequency higher than 95% and lower than 5%.  

After this, we fit and transform our data to generate a matrix representation of our text. We then print the features in the matrix, as well as the shape of the matrix and the first few values to get an idea of how the matrix looks.

In [None]:
count_vectorizer = CountVectorizer(max_df=0.95, min_df=0.05)
count_vectorizer_matrix = count_vectorizer.fit_transform(stemmed_data['content'])

print('Features:\n{}\n'.format(count_vectorizer.get_feature_names()))
print('Matrix Shape:\n{}\n'.format(count_vectorizer_matrix.shape))
print('First Values:\n{}'.format(count_vectorizer_matrix[:5, :5].todense()))

Features:
['abl', 'accept', 'actual', 'address', 'advanc', 'ago', 'agre', 'allow', 'american', 'answer', 'anybodi', 'appreci', 'apr', 'area', 'articl', 'ask', 'assum', 'avail', 'awai', 'bad', 'base', 'believ', 'best', 'better', 'big', 'bit', 'book', 'bui', 'call', 'car', 'card', 'care', 'case', 'caus', 'chang', 'check', 'chip', 'christian', 'claim', 'close', 'com', 'come', 'complet', 'consid', 'control', 'cost', 'cours', 'current', 'dai', 'data', 'david', 'deal', 'design', 'differ', 'discuss', 'drive', 'edu', 'effect', 'email', 'end', 'engin', 'exampl', 'exist', 'expect', 'experi', 'fact', 'far', 'fax', 'feel', 'file', 'final', 'follow', 'forc', 'free', 'game', 'gener', 'get', 'given', 'go', 'god', 'good', 'got', 'govern', 'great', 'group', 'guess', 'gui', 'hand', 'happen', 'hard', 'have', 'heard', 'help', 'high', 'home', 'hope', 'human', 'idea', 'import', 'includ', 'info', 'inform', 'interest', 'internet', 'isn', 'issu', 'john', 'kei', 'kill', 'kind', 'know', 'larg', 'law', 'left', 'l

Now, we define a TfidfVectorizer object, which will calculate the relative term frequency of each word; this means the number of times that a word appears in the document is divided by the length of the document. To do this, we define two parameters, 'use_idf' to false, so the inverse-document-frequency reweighting is not calculated, and 'norm' to 'l1', so the value of all the words in the document sums to one. We also set the parameters 'max_df' and 'min_df' like we did last time. 

We then print the feature names, the matrix shape and the first few values to get an idea of how the matrix looks.

In [None]:
tfidf_vectorizer_l1 = TfidfVectorizer(max_df=0.95, min_df=0.05, use_idf=False, norm='l1')
tfidf_vectorizer_l1_matrix = tfidf_vectorizer_l1.fit_transform(stemmed_data['content'])

print('Features:\n{}\n'.format(tfidf_vectorizer_l1.get_feature_names()))
print('Matrix Shape:\n{}\n'.format(tfidf_vectorizer_l1_matrix.shape))
print('First Values:\n{}'.format(tfidf_vectorizer_l1_matrix[:5, :5].todense()))

Features:
['abl', 'accept', 'actual', 'address', 'advanc', 'ago', 'agre', 'allow', 'american', 'answer', 'anybodi', 'appreci', 'apr', 'area', 'articl', 'ask', 'assum', 'avail', 'awai', 'bad', 'base', 'believ', 'best', 'better', 'big', 'bit', 'book', 'bui', 'call', 'car', 'card', 'care', 'case', 'caus', 'chang', 'check', 'chip', 'christian', 'claim', 'close', 'com', 'come', 'complet', 'consid', 'control', 'cost', 'cours', 'current', 'dai', 'data', 'david', 'deal', 'design', 'differ', 'discuss', 'drive', 'edu', 'effect', 'email', 'end', 'engin', 'exampl', 'exist', 'expect', 'experi', 'fact', 'far', 'fax', 'feel', 'file', 'final', 'follow', 'forc', 'free', 'game', 'gener', 'get', 'given', 'go', 'god', 'good', 'got', 'govern', 'great', 'group', 'guess', 'gui', 'hand', 'happen', 'hard', 'have', 'heard', 'help', 'high', 'home', 'hope', 'human', 'idea', 'import', 'includ', 'info', 'inform', 'interest', 'internet', 'isn', 'issu', 'john', 'kei', 'kill', 'kind', 'know', 'larg', 'law', 'left', 'l

We now define another TfidVectorizer object, which will calculate the actual TF-IDF frequency; this means that the frequency of each word is multiplied by the logarithm of the total number of documents divided by the total number of documents that contain that word. We set the 'max_df' and 'min_df' parameters like before, and also set 'smooth_idf' to false, so an extra document with all words is not used for value smoothing. 

Then we also print the feature names, the matrix shape and the first few values.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.05, smooth_idf=False)
tfidf_vectorizer_matrix = tfidf_vectorizer.fit_transform(stemmed_data['content'])

print('Features:\n{}\n'.format(tfidf_vectorizer.get_feature_names()))
print('Matrix Shape:\n{}\n'.format(tfidf_vectorizer_matrix.shape))
print('First Values:\n{}'.format(tfidf_vectorizer_matrix[:5, :5].todense()))

Features:
['abl', 'accept', 'actual', 'address', 'advanc', 'ago', 'agre', 'allow', 'american', 'answer', 'anybodi', 'appreci', 'apr', 'area', 'articl', 'ask', 'assum', 'avail', 'awai', 'bad', 'base', 'believ', 'best', 'better', 'big', 'bit', 'book', 'bui', 'call', 'car', 'card', 'care', 'case', 'caus', 'chang', 'check', 'chip', 'christian', 'claim', 'close', 'com', 'come', 'complet', 'consid', 'control', 'cost', 'cours', 'current', 'dai', 'data', 'david', 'deal', 'design', 'differ', 'discuss', 'drive', 'edu', 'effect', 'email', 'end', 'engin', 'exampl', 'exist', 'expect', 'experi', 'fact', 'far', 'fax', 'feel', 'file', 'final', 'follow', 'forc', 'free', 'game', 'gener', 'get', 'given', 'go', 'god', 'good', 'got', 'govern', 'great', 'group', 'guess', 'gui', 'hand', 'happen', 'hard', 'have', 'heard', 'help', 'high', 'home', 'hope', 'human', 'idea', 'import', 'includ', 'info', 'inform', 'interest', 'internet', 'isn', 'issu', 'john', 'kei', 'kill', 'kind', 'know', 'larg', 'law', 'left', 'l

Here, we define a Binarizer object, which performs a one-hot encoding on the matrix that we had from the CountVectorizer. This means that the values for each word represent if the word is present or not in the document by using ones and zeros. 

We then print the shape and first few values to confirm that the encoding is working properly. We can see that now all values different than zero are replaced by ones. 

In [None]:
binarizer = Binarizer()
binarizer_matrix = binarizer.fit_transform(count_vectorizer_matrix)

print('Matrix Shape:\n{}\n'.format(binarizer_matrix.shape))
print('First Values:\n{}'.format(binarizer_matrix[:5, :5].todense()))

Matrix Shape:
(11314, 236)

First Values:
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 1 0 1]
 [0 0 0 1 0]
 [0 0 0 0 0]]


We now want to create a quick comparison of the first three methods shown above for the first document. We remove all the columns that have words that do not appear in the document and append all the rows into a single dataframe. We can now see how the values for each word changes from method to method. 

In [None]:
first_doc_cv = pd.DataFrame(count_vectorizer_matrix[0].todense(), columns=count_vectorizer.get_feature_names()).rename(index={0: "Abs. Freq."})
first_doc_tfid_l1 = pd.DataFrame(tfidf_vectorizer_l1_matrix[0].todense(), columns=tfidf_vectorizer_l1.get_feature_names()).rename(index={0: "Rel. Freq."})
first_doc_tfid = pd.DataFrame(tfidf_vectorizer_matrix[0].todense(), columns=tfidf_vectorizer.get_feature_names()).rename(index={0: "TF-IDF"})
first_doc_bin = pd.DataFrame(binarizer_matrix[0].todense(), columns=count_vectorizer.get_feature_names()).rename(index={0: "One-Hot"})
first_doc_df = first_doc_cv.append(first_doc_tfid_l1).append(first_doc_tfid).append(first_doc_bin)
first_doc_df = first_doc_df.loc[:, (first_doc_df != 0).all()]
first_doc_df

Unnamed: 0,call,car,dai,engin,info,know,look,mail,small,thank,wonder,year
Abs. Freq.,1.0,5.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
Rel. Freq.,0.058824,0.294118,0.058824,0.058824,0.058824,0.058824,0.117647,0.058824,0.058824,0.058824,0.058824,0.058824
TF-IDF,0.148914,0.850723,0.140981,0.171929,0.17316,0.097393,0.236555,0.139354,0.17722,0.121796,0.169629,0.121642
One-Hot,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Bag-of-Words using Gensim

First, we tokenize each document and create a list of words for all of the documents, resulting with a list of lists of words.

In [None]:
corpus_gen = [doc.split() for doc in stemmed_data['content']]

print(corpus_gen[0])

['car', 'wonder', 'enlighten', 'car', 'saw', 'dai', 'door', 'sport', 'car', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'car', 'histori', 'info', 'funki', 'look', 'car', 'mail', 'thank']


Now, we create a Dictionary object where we give it the list of lists that we generated in the previous step. We also define two parameters, 'no_below' and 'no_above', which mean the minimum number of documents and the maximum percentage of documents, that we want a word to be in to be considered in the dictionary. We then print the result to understand how it looks like.

In [None]:
id2word = Dictionary(corpus_gen)
id2word.filter_extremes(no_below=566, no_above=0.95,)

print(id2word)

Dictionary(236 unique tokens: ['call', 'car', 'dai', 'engin', 'info']...)


To understand better how the Dictionary object works, we print some of its attributes. We first print the actual dictionary of key-value pairs, having the word and its id, as well as all the words considered in the dictionary and, finally, the number of documents each word appears in according to its id. 

In [None]:
print(id2word.token2id)
print(id2word.token2id.keys())
print(id2word.dfs)

{'call': 0, 'car': 1, 'dai': 2, 'engin': 3, 'info': 4, 'know': 5, 'look': 6, 'mail': 7, 'small': 8, 'thank': 9, 'wonder': 10, 'year': 11, 'answer': 12, 'base': 13, 'card': 14, 'edu': 15, 'experi': 16, 'final': 17, 'gui': 18, 'messag': 19, 'number': 20, 'report': 21, 'send': 22, 'actual': 23, 'advanc': 24, 'anybodi': 25, 'better': 26, 'bit': 27, 'expect': 28, 'feel': 29, 'good': 30, 'got': 31, 'great': 32, 'heard': 33, 'help': 34, 'life': 35, 'like': 36, 'line': 37, 'machin': 38, 'mayb': 39, 'new': 40, 'opinion': 41, 'peopl': 42, 'plai': 43, 'price': 44, 'probabl': 45, 'question': 46, 'read': 47, 'real': 48, 'recent': 49, 'start': 50, 'take': 51, 'time': 52, 'us': 53, 'wai': 54, 'address': 55, 'articl': 56, 'chip': 57, 'com': 58, 'far': 59, 'inform': 60, 'person': 61, 'phone': 62, 'point': 63, 'pretti': 64, 'requir': 65, 'stuff': 66, 'system': 67, 'thing': 68, 'write': 69, 'wrote': 70, 'mean': 71, 'possibl': 72, 'right': 73, 'set': 74, 'tell': 75, 'understand': 76, 'world': 77, 'ye': 78

Here, we create the bag of words were we take into account the absolute frequency of each word for each document, so basically it tells us how many times a certain word appears in a certain document. 

In [None]:
corpus1 = [id2word.doc2bow(doc) for doc in corpus_gen]

print(corpus1[0])

[(0, 1), (1, 5), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]


Here, we take the bag of words we created previously and calculate the relative frequency for each word. To do this, we take the absolute frequency from the previous bag of words and divide it into the total number of words in the document. 

In [None]:
corpus2 = [[(token[0], (token[1] / sum(n for _, n in doc))) for token in doc] for doc in corpus1]

print(corpus2[0])

[(0, 0.058823529411764705), (1, 0.29411764705882354), (2, 0.058823529411764705), (3, 0.058823529411764705), (4, 0.058823529411764705), (5, 0.058823529411764705), (6, 0.11764705882352941), (7, 0.058823529411764705), (8, 0.058823529411764705), (9, 0.058823529411764705), (10, 0.058823529411764705), (11, 0.058823529411764705)]


Here, we perform a one-hot encoding, where we write a one, if the word appears in the document, or zero otherwise. 

In [None]:
corpus3 = [[(token[0], 1) for token in doc] for doc in corpus1]

print(corpus3[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]


Now, we actually calculate the TF-IDF, taking into account all the documents where each word appears in. 

In [None]:
tfidf = TfidfModel(dictionary=id2word, normalize=True)
corpus4 = [tfidf[id2word.doc2bow(doc)] for doc in corpus_gen]

print(corpus4[0])

[(0, 0.1438541968916984), (1, 0.8659993628972853), (2, 0.13288860724363213), (3, 0.17566681903543382), (4, 0.1773677410716569), (5, 0.07263746099242874), (6, 0.20301211255383858), (7, 0.1306393577394443), (8, 0.18297928466403593), (9, 0.10636899048674578), (10, 0.1724873903769856), (11, 0.10615603475135126)]


We want to create a dataframe with the previous calculated frequencies to compare it with the frequencies we got from sklearn. In order to do this, we first get the first document and the words that appear in it to use as columns.

In [None]:
features = []
for key1, value1 in id2word.iteritems():
  for key2, value2 in corpus1[0]:
    if key1 == key2:
      features.append(value1)

print(features)

['call', 'car', 'dai', 'engin', 'info', 'know', 'look', 'mail', 'small', 'thank', 'wonder', 'year']


Now we can append all the results for the first document into a single dataframe.

In [None]:
doc1 = pd.DataFrame([[pair[1] for pair in corpus1[0]]], columns=features, index=['Abs. Freq.'])
doc2 = pd.DataFrame([[pair[1] for pair in corpus2[0]]], columns=features, index=['Rel. Freq'])
doc3 = pd.DataFrame([[pair[1] for pair in corpus3[0]]], columns=features, index=['One-Hot'])
doc4 = pd.DataFrame([[pair[1] for pair in corpus4[0]]], columns=features, index=['TF-IDF'])

corpus_df = doc1.append(doc2).append(doc4).append(doc3)
corpus_df

Unnamed: 0,call,car,dai,engin,info,know,look,mail,small,thank,wonder,year
Abs. Freq.,1.0,5.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
Rel. Freq,0.058824,0.294118,0.058824,0.058824,0.058824,0.058824,0.117647,0.058824,0.058824,0.058824,0.058824,0.058824
TF-IDF,0.143854,0.865999,0.132889,0.175667,0.177368,0.072637,0.203012,0.130639,0.182979,0.106369,0.172487,0.106156
One-Hot,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We print the dataframe that we had generated from the sklearn's frequencies for comparison. We can see that all values are the same, except for the TF-IDF frequencies. This is due to a difference in the formula for the IDF used by sklearn and gensim to calculate the weights, which are the following: 

Scikit-Learn: $idf(word) = log(\frac{docs_{total}}{docs_{word}}) + 1$

Gensim: $idf(word) = log_2(\frac{docs_{total}}{docs_{word}})$

In [None]:
first_doc_df

Unnamed: 0,call,car,dai,engin,info,know,look,mail,small,thank,wonder,year
Abs. Freq.,1.0,5.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
Rel. Freq.,0.058824,0.294118,0.058824,0.058824,0.058824,0.058824,0.117647,0.058824,0.058824,0.058824,0.058824,0.058824
TF-IDF,0.148914,0.850723,0.140981,0.171929,0.17316,0.097393,0.236555,0.139354,0.17722,0.121796,0.169629,0.121642
One-Hot,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## N-grams

We now create n-grams by using the CountVectorizer class from Scikit-Learn. We only have to pass the parameter 'ngram_range' to define the size of the wanted combinations. In our case, we are using (2, 2), which means we only want bigrams or combinations of only two words. We also define the parameters 'max_df' and 'min_df' like we have done before, but we can see a difference in this case, where only three features (bigrams) appear in more than 5% of the documents. If we remove the parameter 'min_df', we get over 800,000 features. 

In [None]:
n_gram = CountVectorizer(ngram_range=(2,2), max_df=0.95, min_df=0.05)
n_gram_matrix = n_gram.fit_transform(stemmed_data['content'])

print(n_gram.get_feature_names())

['articl apr', 'edu write', 'write articl']


We transform the matrix we got from the previous step into a dataframe and print the head. We can see that the weights are the absolute frequency of each pair of words. 

In [None]:
n_gram_df = pd.DataFrame(n_gram_matrix.todense(), columns=n_gram.get_feature_names())
n_gram_df.head()

Unnamed: 0,articl apr,edu write,write articl
0,0,0,0
1,0,0,0
2,0,0,0
3,0,1,1
4,0,0,0


When we print the maximum for each of the features, we can see that 'article apr' and 'write article' appears a maximum of four times in a single document, while 'edu write' appears two times.

In [None]:
for col in n_gram_df.columns:
  print('{}: {}'.format(col, n_gram_df[col].max()))

articl apr: 4
edu write: 2
write articl: 4
