<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Natural Language Processing: Vectorization
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 4: Topic 38</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [17]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import string
import re

# Notice that these vectorizers are from `sklearn` and not `nltk`!
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load in original satire data and normalized corpus

In [18]:
satire_df = pd.read_csv(
    'data/satire_nosatire.csv')
satire_df.head()

Unnamed: 0,body,target
0,Noting that the resignation of James Mattis as...,1
1,Desperate to unwind after months of nonstop wo...,1
2,"Nearly halfway through his presidential term, ...",1
3,Attempting to make amends for gross abuses of ...,1
4,Decrying the Senate’s resolution blaming the c...,1


In [19]:
corpus = pd.read_csv(
    'data/satire_norm.csv').drop(
    columns = ['Unnamed: 0'])
corpus

Unnamed: 0,body
0,note resignation james mattis secretary defens...
1,desperate unwind month nonstop work investigat...
2,nearly halfway presidential term donald trump ...
3,attempt make amends gross abuse power time int...
4,decry senate resolution blame crown prince bru...
...,...
995,britain opposition leader jeremy corbyn push a...
996,turkey take fight islamic state militant syria...
997,malaysia seek reparation goldman sachs group i...
998,israeli court sentence palestinian year impris...


#### Feature Extraction for NLP

- learn vector representation of tokenized data
- representing text in form for ML model:
    - encoding semantic information in numeric form
- A simple (yet surprisingly effective) method for many tasks: **Bag-of-words (BoW)**.

"Bag" of words: **information about the order of words in the document discarded**. 

- Intuition behind BoW: documents similar if they have similar token frequency distribution. 



<img src = "Images/bag_of_words.png" >

Represented as **document-term matrix**:
- columns are tokens
- rows are documents
- values are token counts for given document.

<img src = "Images/document_term_matrix.png" >

#### Vectorization with sklearn

Sklearn has a few methods for constructing document-term frequency matrices
- CountVectorizer
- TfidfVectorizer
- HashVectorizer

#### `CountVectorizer`: simplest of the vectorizers
- Term counts for each document in corpus
- has options for cutting too common/too uncommon words

- CountVectorizer(min_df, max_df)

    - min_df: percentage lower cutoff for document frequncy of a term

    - max_df: percentage upper cutoff (corpus specific stop words)

**Important hyperparameters to tune when in pipeline**

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
corpus.body.head()

0    note resignation james mattis secretary defens...
1    desperate unwind month nonstop work investigat...
2    nearly halfway presidential term donald trump ...
3    attempt make amends gross abuse power time int...
4    decry senate resolution blame crown prince bru...
Name: body, dtype: object

In [38]:
# Convert our preprocessed strings (normalized token sequence) to a matrix of token counts

vec = CountVectorizer(min_df = 0.06, max_df = 0.95)
X = vec.fit_transform(corpus['body'])

# .get_feature_names() attribute useful
countvec_df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
countvec_df.head()

Unnamed: 0,able,accord,account,accuse,act,action,actually,add,additional,administration,...,whole,win,woman,word,work,worker,world,write,year,yet
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0


Note that the output before converting to array is **sparse matrix**

In [36]:
X

<1000x2780 sparse matrix of type '<class 'numpy.int64'>'
	with 110619 stored elements in Compressed Sparse Row format>

There and many zeros:
- Typical of document-term matrix
- Compressed Sparse Row (CSR) representation enhances memory/computation resources.

In [39]:
countvec_df.shape

(1000, 482)

Substantially smaller feature set now:
- Some algorithms can handle count data with this many features for modeling purposes.

Important thing to think about when engineering cutoffs: class imbalance

- if extreme may have cut off relevant predictors for minority document class.

In [24]:
satire_df['target'].value_counts()

1    500
0    500
Name: target, dtype: int64

Not a problem here.

Vectorization complete.

In [25]:
corpus.body

0      note resignation james mattis secretary defens...
1      desperate unwind month nonstop work investigat...
2      nearly halfway presidential term donald trump ...
3      attempt make amends gross abuse power time int...
4      decry senate resolution blame crown prince bru...
                             ...                        
995    britain opposition leader jeremy corbyn push a...
996    turkey take fight islamic state militant syria...
997    malaysia seek reparation goldman sachs group i...
998    israeli court sentence palestinian year impris...
999    least people die due landslide flood trigger t...
Name: body, Length: 1000, dtype: object

In [26]:
countvec_df

Unnamed: 0,able,accord,account,accuse,act,action,actually,add,additional,administration,...,whole,win,woman,word,work,worker,world,write,year,yet
0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
996,1,0,0,0,0,0,0,1,1,0,...,0,0,0,0,3,1,0,1,1,0
997,0,0,0,0,1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
998,0,1,0,0,1,0,0,0,0,0,...,0,0,2,0,0,0,0,1,2,0


#### The TfidfVectorizer (Term Frequency Inverse Document Frequency)

An approach to weight tokens based on how rare/common in corpus:
- want to downweight words that are too common throughout corpus.

- TF (Term Frequency):
    - Count of the word in the document
    - divided by the total number of words in the document.

- IDF (inverse document frequency)
    - how much information a word possesses for document differentiation

$$idf(w) = log (\frac{number\ of\ documents}{num\ of\ documents\ containing\ w})$$

**word present in every document likely not useful for document differentiation**

**Putting together**: TF-IDF

$$ w_{ij} = tf_{ij} \log(\frac{N}{df_{ij}}) $$

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tf_vec = TfidfVectorizer()
X_tfidf = tf_vec.fit_transform(corpus['body'])

vec_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tf_vec.get_feature_names())
vec_tfidf.head()

Unnamed: 0,aaaaaaah,aaaaargh,aaargh,aah,aap,aaron,ab,abandon,abandoning,abandonment,...,zone,zoo,zoom,zozovitch,zte,zuckerberg,zuercher,zych,zzzzzst,δημοκρατία
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
X_tfidf

<1000x18460 sparse matrix of type '<class 'numpy.float64'>'
	with 145229 stored elements in Compressed Sparse Row format>

Much larger matrix: doesn't manually throw away features

Just downweights features that are too rare or too common

In [30]:
vec_tfidf.iloc[313].sort_values(ascending=False)[:10]

nerd        0.647733
power       0.374180
company     0.229251
billion     0.140334
facebook    0.128603
ruthless    0.125814
control     0.105062
evil        0.103523
people      0.094816
fine        0.093608
Name: 313, dtype: float64

Let's compare the tfidf to the count vectorizer output for one document.

In [43]:
countvec_df.iloc[313].sort_values(ascending=False)[:10]

power          18
company        10
people          7
control         5
also            4
trade           3
commission      3
information     3
use             3
data            3
Name: 313, dtype: int64

In [16]:
vec_tfidf.iloc[313].sort_values(ascending=False)[:10]

nerd        0.647733
power       0.374180
company     0.229251
billion     0.140334
facebook    0.128603
ruthless    0.125814
control     0.105062
evil        0.103523
people      0.094816
fine        0.093608
Name: 313, dtype: float64

The tfidf downweighted common words:
- "also", which might have made it into the stopword list.
- Assigns "nerds" more weight than power (factoring in count and idf) 