# [LEGALST-190] Lab 3/20: TF-IDF and Classification

<img src = "https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.png?v=dd7153fcc7fa" style = "width:500px; height: 275px;" />

This lab will cover the term frequency-inverse document frequency method, and classification algorithms in machine learning.

Estimated Lab time: 30 minutes

In [None]:
# Dependencies
from datascience import *
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import confusion_matrix
from sklearn import svm
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
import itertools
import seaborn as sn
%matplotlib inline

# The Data

For this lab, we'll use a dataset that was drawn from a Kaggle collection of questions posed on stackexchange (a website/forum where people ask and answer questions about statistics, programming etc.)

The data has the following features:

- "Id": The Id number for the question
- "Body": The text of the answer
- "Tag": Whether the question was tagged as dealing with python, xml, java, json, or android

Your task will be to extract features from the "Body" column, and use those features to predict class membership, denoted by "Tag."

In [None]:
stack_data = pd.read_csv('data/stackexchange.csv', encoding='latin-1')
stack_data.head(5)

# Section 1: TF-IDF Vectorizer

The term frequency-inverse document frequency (tf-idf) vectorizer is a statistic that measures similarity within and across documents. Term frequency refers to the number of times a term shows up within a document. Inverse document frequency is the logarithmically scaled inverse fraction of the documents that contains the word, and penalizes words that occur frequently. Tf-idf multiplies these two measures together.

Check out the documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

#### Question 1: Why is tf-idf a potentially more attractive vectorizer than the standard count vectorizer?

Let's get started! First, extract the "Body" column into its own numpy array called "text_list"

In [None]:
# Extract Text Data


Next, initialize a term frequency-inverse document frequency (tf-idf) vectorizer. Check out the documentation to fill in the arguments: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
tf = TfidfVectorizer(analyzer=, 
                     ngram_range=(), 
                     min_df = , 
                     stop_words = )

Next, use the "fit_transform" method to take in the list of documents, and convert them into a document term matrix. Use "get_feature_names()" and "len" to calculate how many features this generates.

#### Question 2: The dimensionality explodes quickly. Why might this be a problem as you use more data?

Calculate the tf-idf scores for the first document in the corpus. Do the following:

1. Use ".todense()" to turn the tfidf matrix into a dense matrix (get rid of the sparsity)
2. Create an object for th document by calling the 0th index of the dense matrix, converting it to a list. Try something like: document = dense[0].tolist()[0]
3. Calculate the phrase scores by using the "zip" command to iterate from 0 to the length of the document, retraining scores greater than 0.
4. Sort the scores using the "sorted" command
5. Print the top 20 scores with their feature names

# Section 2: Classification Algorithms

One of the main tasks in supervised machine learning is classification. In this case, we will develop algorithms that will predict a question's tag based on the text of its answer.

The first step is to split our data into training, validation, and test sets. 

## Naive Bayes

[Naive Bayes classifers](http://scikit-learn.org/stable/modules/naive_bayes.html) classify observations by making the assumption that features are all independent of one another. Do the following:

1. Initialize a Naive Bayes classifier method with "MultinomialNB()"
2. Fit the model on your training data
3. Predict on the validation data and store the predictions
4. Use "np.mean" to calculate how correct the classier was on average
5. Calcualte the confusion matrix using "confusion_matrix," providing the true values first and the predicted values second.

Let's plot the confusion matrix! Use the following code from the ["seaborn"](https://seaborn.pydata.org/generated/seaborn.heatmap.html) package to make a heatmap out of the matrix.

In [None]:
# Transform confusion matrix into a dataframe
nb_df_cm = pd.DataFrame(nb_cf_matrix, range(5),
                  range(5))

# Rename the column and row indices
nb_df_cm = nb_df_cm.rename(index=str, columns={0: "python", 1: "xml", 2: "java", 3: "json", 4: "android"})
nb_df_cm.index = ['python', 'xml', 'java', 'json', 'android']

# Plot the confusion matrix
plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)#for label size
sn.heatmap(nb_df_cm, 
           annot=True,
           annot_kws={"size": 16})

plt.title("Naive Bayes Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

#### Question 3: Do you notice any patterns? Are there any patterns in misclassification that are worrisome?

## Multinomial Logistic Regression

Next, let's try [multinomial logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)! Follow the same steps as with Naive Bayes, and plot the confusion matrix.

## SVM

Now do the same for a [Support Vector Machine](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

#### Question 4: How did each of the classifiers do? Which one would you prefer the most?

## Test Final Classifier

Choose your best classifier and use it to predict on the test set. Report the mean accuracy and confusion matrix. 