# Vectorization
Let's start of with a where we left of last time. Simple classification using a bag-of-words. Today we will only be using the multinomial naive bayes to keep the focus on the vectorization.

** Task go through the script in studygroups read through the script, make sure you understanding the working as you go through it.**

if you find any if this uninterpretable and or difficult, remember that I will be in the class or on zoom if you need help.

In [48]:
# add classification script to path (as well as data)
import sys
import os
path = os.path.join("..", "class_05")  # create path - will be different depending on mac vs windows

sys.path.append(path)  # add path

# create path for imdb dataset
imdbpath = os.path.join("..", "class_05", "imdb")
print(imdbpath)

../class_05/imdb


In [2]:
from classification import read_imdb

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# read data
imdb = read_imdb(imdbpath)

# train test split
X_train, X_test, y_train, y_test = train_test_split(imdb.text, imdb.tag)
text_clf = Pipeline([('vect', CountVectorizer()), 
                     ('clf', MultinomialNB())])

# train model
text_clf.fit(X_train, y_train)

# estimate performance
predictions = text_clf.predict(X_test)

acc = sum(predictions == y_test)/len(y_test)
print(f"Our model obtained a performance accuracy of {acc}")


Our model obtained a performance accuracy of 0.81


---
Now let is try to add our own tokenizer:

In [20]:
# a simple test set:
doc = ["NLP is very very fun", 
       "NLP teachers are fun",
       "a teacher is a person"]

from text_preprocessor import Text

# define a wrapper function which only returns tokens and handles list
def tokenization_wrapper(txt):
    TextObject = Text(txt)
    TextObject.tokenize(method="nltk")
    tokens = TextObject.get_tokens()
    return tokens


vectorizer = CountVectorizer(tokenizer=tokenization_wrapper, lowercase=False)
bow = vectorizer.fit_transform(doc)
bow.todense()
# TASK: Change the above code use your lemmatization function instead of pure tokenization. What would you expect would change?

matrix([[1, 0, 0, 1, 1, 0, 0, 0, 2],
        [1, 0, 1, 1, 0, 0, 0, 1, 0],
        [0, 2, 0, 0, 1, 1, 1, 0, 0]])

---
You can make this matrix more visually appealing (and understandable) quite well using this trick:

In [21]:
import pandas as pd
pd.DataFrame(bow.todense(), columns=vectorizer.get_feature_names())
# Notice that teacher and teachers are two different tokens is this problematic?
# TASK: the visualization code is a bit unclear to understand. Make a function on the form:
# vizualize(bow, vectorizer)
# which produces the table below

Unnamed: 0,NLP,a,are,fun,is,person,teacher,teachers,very
0,1,0,0,1,1,0,0,0,2
1,1,0,1,1,0,0,0,1,0
2,0,2,0,0,1,1,1,0,0


---
We now want to look a bit more into what goes on in the vectorization

In [5]:
# using 1-grams and bigrams:
vectorizer = CountVectorizer(ngram_range=(1, 2))

bow = vectorizer.fit_transform(doc)
bow.todense()
# TASK change it to only use 2-grams and 3-grams

matrix([[0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1],
        [1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])

In [6]:
# to use stopwords
from nltk.corpus import stopwords
stopwordlist = stopwords.words('english')

vectorizer = CountVectorizer(stop_words=stopwordlist)
bow = vectorizer.fit_transform(doc)
bow.todense()
# TASK what words were removed using the stopwordlist, might these be meaningful?

matrix([[1, 1, 0, 0, 0],
        [1, 1, 0, 0, 1],
        [0, 0, 1, 1, 0]])

In [7]:
# another way to remove stopwords
vectorizer = CountVectorizer(max_df=0.9) # remove words which appear in 90% of the documents
bow = vectorizer.fit_transform(doc)
bow.todense()

matrix([[0, 1, 1, 1, 0, 0, 0, 2],
        [1, 1, 0, 1, 0, 0, 1, 0],
        [0, 0, 1, 0, 1, 1, 0, 0]])

In [8]:
# To make a binary classification (is the word there or not)
vectorizer = CountVectorizer(binary=True)
bow = vectorizer.fit_transform(doc)
bow.todense()


matrix([[0, 1, 1, 1, 0, 0, 0, 1],
        [1, 1, 0, 1, 0, 0, 1, 0],
        [0, 0, 1, 0, 1, 1, 0, 0]])

In [9]:
# remove words which appear less than min_df (i.e. words which appear only once)
vectorizer = CountVectorizer(min_df=2)
bow = vectorizer.fit_transform(doc)
bow.todense()

# min_df can also be a number from 0-1. If so it is a percentage

matrix([[1, 1, 1],
        [1, 0, 1],
        [0, 1, 0]])

---
# Tf-Idf

tf-idf for a specific term (word) $t$ in a given document $d$, as you probably know from the lectures, is calculated using the

$tf-idf(t, d) = tf(t, d) * idf(t)$

Where $tf$ is the frequency of term $t$ in the document $d$.and $idf$ is the inverse document frequency of the term $t$. In sklearn $idf$ is calculated as follows:

$idf(t) = log \frac{1 + n}{1 + df(t)} + 1$

it is then normalized using the l2 norm (the euclidian norm for all math people out there). It simply makes sure it the length of vector for the document is one. It is calculated as follows:

$\frac{1}{\sqrt{v_1^2 + v^2 + ...}}$

As an example the norm of v would be:

$
  \begin{align}
    ||v|| &= ||\begin{bmatrix}
           1 \\
           1 \\
           0 \\
         \end{bmatrix}|| = \frac{1}{\sqrt{1 + 1 + 0}} = \frac{1}{\sqrt{2}} 
  \end{align}
  $


The tf idf in sklearn uses the default parameters:
```python
norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False
```

where `norm` toggles what type of normalization it should use. `use_idf` toggles whether is should multiple tf with idf not not.
`smooth_idf` the is the `+1` fraction. i.e. setting it to false would do the following: 

$idf(t) = log \frac{n}{df(t)} + 1$

This is the same as the add one smoothing the naive bayes assignment.

Interestingly tf-idf can also be derived using information theory as **the amount of information gained seeing the word weighted by your probability of seeing it**. It is thus reasonable that it is good for performance, not only that, but it also fits nicely into cognitive theories such as bayesian brain and the free energy principles.

Read a lot more on tf-idf in sklearn [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

There is two ways to call do tf-idf in scikit-learn. Either using the `TfidfVectorizer` or transforming the `CountVectorizer` using `TfidfTransformer`. I will just be using the first one for simplicity. You can pass `TfidfVectorizer` all the same arguments as we have used for the `CountVectorizer`.


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
ti = tfidf.fit_transform(doc)
ti.todense()
pd.DataFrame(ti.todense(), columns=tfidf.get_feature_names())

Unnamed: 0,are,fun,is,nlp,person,teacher,teachers,very
0,0.0,0.31757,0.31757,0.31757,0.0,0.0,0.0,0.835133
1,0.562829,0.428046,0.0,0.428046,0.0,0.0,0.562829,0.0
2,0.0,0.0,0.47363,0.0,0.622766,0.622766,0.0,0.0


In [11]:
# we can add all the previous stuff as well
tfidf = TfidfVectorizer(tokenizer=tokenization_wrapper,
                        stop_words=stopwordlist,
                        min_df=1,
                        max_df=0.8,
                        ngram_range=(1,2),
                        lowercase=False,
                        binary=False
                        )
ti = tfidf.fit_transform(doc)
ti.todense()
pd.DataFrame(ti.todense(), columns=tfidf.get_feature_names())

Unnamed: 0,NLP,NLP fun,NLP teachers,fun,person,teacher,teacher person,teachers,teachers fun
0,0.517856,0.680919,0.0,0.517856,0.0,0.0,0.0,0.0,0.0
1,0.373022,0.0,0.490479,0.373022,0.0,0.0,0.0,0.490479,0.490479
2,0.0,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0


# Dimensionality reduction
As the matrix for language classification typically becomes quite large an idea is to reduce the dimensionality of the matrix. Typically used techniques reducing dimensionality in data include Principal component analysis (PCA) or singular value decompositions (SVD). Note that PCA was introduced 2nd semester Experimental Methods so people should be familiar (alternatively there is also a video in the highly recommended readings for class). PCA is also used in computing the intelligence metric, g-factor. PCA is calculated using SVD which, for the mathematically inclined, is the generalization of eigendecompositions (finding eigenvalues and eigenvectors) to non-diagnizable matrices. 

Other approaches include latent semantic analysis (LSA) which similar to PCA is build upon SVD. LSA was later reinterpreted in a probalistic framework (pLSA) which was generalized into Latent Dirichlet Allocation (LDA) which is what is the topic model we will go into in week 44. LSA is also more efficient for tf-idf and tf matrices.



In [12]:
from sklearn.decomposition import PCA

tfidf = TfidfVectorizer(ngram_range=(1,3))
ti = tfidf.fit_transform(doc)
pca = PCA(n_components=3) # should be bigger for real data
output = pca.fit_transform(ti.todense())  # it is required to make the tf-idf matrix (ti) to dense to do PCA, but this is computionally heavy (thus the LSA)

print("the explained variance in the dataset of each of the three components:")
print(pca.explained_variance_.round(3)) 

the explained variance in the dataset of each of the three components:
[0.503 0.437 0.   ]


In [18]:
from sklearn.decomposition import TruncatedSVD as LSA
lsa = LSA(n_components=3) # should be bigger for real data
output = lsa.fit_transform(ti)  # notice the lack of todense

print("the explained variance in the dataset of each of the three components:")
print(lsa.explained_variance_.round(2)) 
print(sum(lsa.explained_variance_)) 

print("\nthe reduced matrix:")
print(output)

print("\nthe previous matrix:")
print(ti.todense())
# notice how many of the entries in the tf-idf (similar to the bow matrices) contain a lot of zeros, which indicate that the information can typically be reduced dramatically.


the explained variance in the dataset of each of the three components:
[0.15 0.22 0.2 ]
0.580812570292299

the reduced matrix:
[[ 8.32569347e-01  0.00000000e+00  5.53920828e-01]
 [ 8.32569347e-01 -1.14946747e-15 -5.53920828e-01]
 [ 9.57011382e-16  1.00000000e+00 -6.36713973e-16]]

the previous matrix:
[[0.51785612 0.68091856 0.         0.51785612 0.         0.
  0.         0.         0.        ]
 [0.37302199 0.         0.49047908 0.37302199 0.         0.
  0.         0.49047908 0.49047908]
 [0.         0.         0.         0.         0.57735027 0.57735027
  0.57735027 0.         0.        ]]


---
# Tasks
1) putting it all together. Create a pipeline which does the both the vectorization, dimensionality reduction and the classification using the imdb dataset. Set the original test set aside as a validation set and make a new train test split.
here the train is for training on. Test is for testing performance and the validation set is to validate performance (which we will do in 7)

2) Change the preprocessing to get a better performance (don't spent to long here, just a little bit of tuning)

3) Apply the same model to the 20 newsgroup data (see next block on how to read it in). Again split up the train set into a train and test set. leave the actual test set i.e. validation set for 7 (loaded in by setting `subset="test"`).

4) apply the same model to the sms spam data from class 4. (spam.csv). Remember to split it into a validation set and a train set and then again split the train set into a train and a test set.

5) When you finetune your model now do you see that you preprocessing choices you made were only good for specific types of dataset  or did they generalize well? Can you find general trends with works better than others? E.g. Is stopwords a good idea? Does tf-idf outperform bag of words? Found out 5 of such 'truths' and post then on element

6) Finally finetune yours models. Optimize the vectorization as much as you want to. Use cross validation or grid search on the train/test set. Until you are satisfied. (it does not have to be the same preprocessings steps for all three dataset anymore)

7) Finally apply your model to validation sets and report your models performance on element. It should have the form:

*"We obtained a performance of XX% on validation set of (the dataset) using (your model specifications)..."*

In [51]:
# your code here (or in another document)


---
## reading in the 20 newsgroup data

In [50]:
from sklearn.datasets import fetch_20newsgroups

# select which categories to load
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', # only load the training set
                                  categories=categories)

# transform the data (not necessary but nice)
df = pd.DataFrame({"text": twenty_train.data, "targets": twenty_train.target})
df["category"] = df.targets.apply(lambda x: categories[x])

# examine the data (first five rows)
df.head()

# SUGGESTION: loading dataset like this can be hard to read later on. Wrapping it all into a function e.g. read_20_news make the code easy to read an easy to reuse later.

Unnamed: 0,text,targets,category
0,From: sd345@city.ac.uk (Michael Collier)\nSubj...,1,soc.religion.christian
1,From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\...,1,soc.religion.christian
2,From: djohnson@cs.ucsd.edu (Darin Johnson)\nSu...,3,sci.med
3,From: s0612596@let.rug.nl (M.M. Zwart)\nSubjec...,3,sci.med
4,From: stanly@grok11.columbiasc.ncr.com (stanly...,3,sci.med
