<center><b>DIGHUM101</b></center>
<center>5-4: Classification</center>

---

# Learning objectives

- Build a Logistic regression text classifier
- Cross-validation

In [None]:
# Import libraries

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np 
import os
import nltk
nltk.download("movie_reviews")
from nltk.corpus import movie_reviews # Note we're importing the dataset here!
import pandas as pd
import re # regular expressions
import seaborn as sns
# CV (multiple train/test splitting)
from sklearn.model_selection import cross_val_score, train_test_split
# Regression
from sklearn.linear_model import LogisticRegression
# Tools to create our DTMs
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
# Speed up your machine learning setup
from sklearn.pipeline import Pipeline
# Mix up our training and test sets
from sklearn.utils import shuffle

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

# Classification of Textual Data

How can we translate this simple model of binary classification to text? Let's look at a corpus from `nltk` and build your own classifier using sklearn's machine learning `Pipeline`.

In [None]:
# Extract our x (reviews) and y (judgements) variables
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
judgements = [movie_reviews.categories(fileid)[0] for fileid in movie_reviews.fileids()]

In [None]:
# Save in a dataframe
movies = pd.DataFrame({"Reviews" : reviews, 
                      "Judgements" : judgements})

In [None]:
print(movies.shape)
movies.head()

In [None]:
movies.Reviews[0]

As you might expect, this is a corpus of IMDB movie reviews. Someone went through and read each review, labeling it as either "positive" or "negative". The task we have before us is to create a model that can accurately predict whether a never-before-seen review is positive or negative. 


## Some basic recap of getting counts, count vectors etc.

In [None]:
# Create a list of tokens per document, and a flat list of all tokens
tokens_list = [m.split() for m in movies.Reviews]
tokens_flat = [token for sublist in tokens_list for token in sublist]

In [None]:
# How many total tokens?
len(tokens_flat)

In [None]:
# How many unique tokens (terms)?
uniq_tokens = list(set(tokens_flat))
len(uniq_tokens)

In [None]:
'world' in 'dsjksdfjkfsd world jhsdzjds'

In [None]:
int('world' in 'dsjksdfjkfsd world jhsdzjds')

In [None]:
# how often does a certain word occur?
sum([int('world' in d) for d in movies.Reviews])

In [None]:
# Creating your own doc-term matrix
vec_doc0 = [tokens_list[0].count(term) for term in uniq_tokens]
print(vec_doc0)

# Defining our x and y

From the `movie_reviews` object let's take out the reviews and the judgement:

Let's look at a random review and its judgement:

In [None]:
print("The human annotator's review was:", movies.Judgements[1])
print()
print(movies.Reviews[1][:500])

So right now we have a dataframe of movie reviews in the `Reviews` variable and a list of their corresponding judgements in the `Judgements` column. Awesome. What does this sound like to you? Independent and dependent variables? You'd be right! `Reviews` is our x variable. `Judgements` is our y variable. 

# Exploring TF-IDF scores and classes


Let's use `.sample` to randomize our movies (which are currently sorted with all negative reviews first, then all positive reviews). We also slice the result to get only the 50 first rows (just to make the visualization a bit easier to inspect).

In [None]:
sub_movies = movies.sample(frac=1,random_state=11)[:50]

# sorting again by judgements
sub_movies = sub_movies.sort_values(by='Judgements', ascending=True)

In [None]:
sub_movies

Let's see how many positive and negative reviews we have now.

In [None]:
sub_movies.Judgements.value_counts()

Time to run the TF-IDF vectorizer on our little dataset.

In [None]:
vectorizer = TfidfVectorizer(min_df=10,stop_words = "english")
tfidf = vectorizer.fit_transform(sub_movies.Reviews)

# turning our sparse into a dense array
tfidf = tfidf.toarray()

# getting the names of our output features
feature_names = vectorizer.get_feature_names_out()

In [None]:
tfidf

The above array shows the TF-IDF values for all the words in our vocab, for all our documents. Using matplotlib's `imshow` we can visualize this array. The lighter dots refer to higher tf-idf values.

In [None]:
plt.figure(figsize=(10,50))
_ = plt.imshow(tfidf);

In [None]:
plt.figure(figsize=(10,5))
_ = plt.imshow(tfidf[:,30:40]);

In [None]:
# Getting the array for ID of the relevant word
tfidf[:, 31]

In [None]:
# ...and get the associated word for this ID
feature_names[31]

In [None]:
_2 = plt.imshow(tfidf[:,70:80])

In [None]:
tfidf[:, 74]

In [None]:
feature_names[74]

# Confidence scores

The next thing we can do is build a model on our test data, then look at the confidence scores that it will create for our test set.

In [None]:
# Creating a list of 1s and 0s instead of "pos" and "neg", this will be helpful later
int_list = []
for each in movies.Judgements:
    if each in "pos":
        int_list.append(1)
    else:
        int_list.append(0)

movies['Judgements_int'] = int_list

Let's first reassign x and y for simplicity. While we're at it, we're going to set the `random_state` for our computer. Remember that this makes our result reproducible. We'll also `shuffle` so that we randomize the order of our observations, and when we split the testing and training data it won't be in a biased order. However, start learning about [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling) and when you should use it instead of `shuffle`!

In [None]:
x, y = shuffle(np.array(movies.Reviews), np.array(movies.Judgements_int), random_state = 1)

# Split the data 
x_train, x_test, y_train, y_test = train_test_split(x, y)

# Vectorize the data
vectorizer = TfidfVectorizer(min_df=5,stop_words = "english")
vectorizer.fit(x)
x_train = vectorizer.transform(x_train)
x_test = vectorizer.transform(x_test)

feature_names = vectorizer.get_feature_names_out()

In [None]:
# Run logistic regression
logit_class = LogisticRegression(random_state = 0, penalty = "l2", C = 1000, max_iter=1000)
model = logit_class.fit(x_train, y_train)

model.score(x_test, y_test)

Note that we added in an L2 penalization parameter because we have many more independent variables from our `dtm` (i.e., words) than we have observations. For more info about "solver" algorithms (that are used to minimize the cost function), see [here](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions/52388406#52388406).

Let's show the confidence scores for our test data - the first 10 rows.

In [None]:
model.predict_proba(x_test)[:10]

We can force these into 1s and 0s (essentially recreating the classification) as follows:

In [None]:
y_predict = [int(p[1] > 0.5) for p in model.predict_proba(x_test)]

Now let's have a look at a comparison between our predicted y-values and the actual y-values of our test data.

In [None]:
y_predict[:20]

In [None]:
y_test[:20]

# Cross-validated pipepline

For fhe next example, we'll do the same, but this time in a "pipeline", which allows us to assemble the several steps we've been going over. Check out the [pipeline documentation here](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).


We'll also look at k-fold cross-validation: this procedure is a standard method for estimating the performance of a machine learning algorithm on a dataset.

Let's start with shuffling the data again.

In [None]:
x, y = shuffle(np.array(movies.Reviews), np.array(movies.Judgements), random_state = 1)

With `sklearn`'s text pipelines, we can quickly build a text classifier in only a few lines of Python. 

This pipeline does three things in a row:

1. `CountVectorizer`

2. `TfidfTransformer`

3. `LogisticRegression`

Let's walk through them step by step once more.

1. `CountVectorizer` does the same as before. It changes all the texts to quickly normalized words, and then simply counts the frequency of each word occuring in the corpus for each document. The feature array for each document at this point is simply the length of all unique words in a corpus, with the count for the frequency of each. This is the most basic way to provide features for a classifier - a document term matrix!

2. Remember that tfidf (term frequency inverse document frequency) is an algorithm that aims to find words that are important to specific documents. It does this by taking the term frequency (tf) for a specific term in a specific document, and multiplying it by the term's inverse document frequency (idf), which is the total number of documents divided by the number of documents that contain the term at least once. `TfidfTransformer` transforms the `CountVectorizer` into a tf-idf representation. 

A tfidf value is calculated for each term for each document. The feature arrays for a document is now the tfidf values. 

> Remember! The tfidf matrix is similar to our document term matrix, only now the values have been weighted according to their distribution across documents.

The pipeline then sends these tfidf feature arrays to 

3. `LogisticRegression`, what we learned from notebook 4-3. 


In [None]:
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2),min_df=5,stop_words = "english")),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LogisticRegression(random_state = 0, penalty = "l2", C = 1000, max_iter=200))
                     ])

Remember training/test splitting? Lets do this `cv = 20` times! Read [here](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) for more info on cross-validation. For now, we will use this to calculate the variance in the results obtained by our model.

In [None]:
# Get the cross_
scores = cross_val_score(text_clf, x, y, cv = 20)

# Print the scores, and the mean of the scores
print(scores, np.mean(scores))

# Print the standard deviation to see degree of variance in the results obtained by our model
print(scores.std())


As you can see, the different train/test splits result in slightly different accuracy scores. However,
the variance of these scores is about 3,5%, meaning that our model has a very low variance--i.e., the prediction that we obtained on one test set is not by chance.

# Important Features

After we train the model we can index the tfidf matrix for the words with the most significant coefficients (i.e. the ones best fit to predict the outcome class), to get the most helpful features.

we can use the `named_steps` method, which takes a dict key, to access methods from each of the steps in our pipeline. This is how we'll get the feature names for the `CountVectorizer` we used in the pipeline.

In [None]:
top25pos = np.argsort(model.coef_[0])[-25:]
print("Top features for positive reviews:")
print(list(feature_names[j] for j in top25pos))
print()
print("Top features for negative reviews:")
top25neg = np.argsort(model.coef_[0])[:25]
print(list(feature_names[j] for j in top25neg))

# Prediction

We can also use our model to classify new reviews – all we have to do is extract the tfidf features from the raw text and send them to the model as our features (independent variables):

In [None]:
new_bad_review = "This was the greatest amazing good movie ever!"

tfidf_rev = vectorizer.transform([new_bad_review])

model.predict(tfidf_rev)

In [None]:
new_good_review = "This movie was bad, awesome and good!"

tfidf_rev = vectorizer.transform([new_good_review])

model.predict(tfidf_rev)

We could also have a look at the probabilities that our model is assigning to a new review.

In [None]:
new_confusing_review = "I hated most things about it but it was okay"
# add "I guess" to the end of this sentence to see what happens

tfidf_rev = vectorizer.transform([new_confusing_review])

print(model.predict_proba(tfidf_rev))