# Practical 9

## Text Mining
***

Read in some packages.

In [1]:
# Import pandas to read in data
import numpy as np
import pandas as pd

# Import models and evaluation functions
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
from sklearn import cross_validation

# Import vectorizers to turn text into numeric
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import plotting
import matplotlib.pylab as plt
%matplotlib inline



## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `data/books.csv` contains 2,000 Amazon book reviews. The data set contains two features: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [2]:
!head -3 data/books.csv

review_text,positive
"THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money.I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life",0
"I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist see

Let's read the data into a pandas data frame. You'll notice two new attributed in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [3]:
data = pd.read_csv("data/books.csv", quotechar="\"", escapechar="\\")

In [4]:
data.head()

Unnamed: 0,review_text,positive
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


### Text as a set of features
Going from text to numeric data is very easy. Let's take a look at how we can do this. We'll start by separating out our X and Y data.

In [5]:
X_text = data['review_text']
Y = data['positive']

In [15]:
X_text

0       THis book was horrible.  If it was possible to...
1       I like to use the Amazon reviews when purchasi...
2       THis book was horrible.  If it was possible to...
3       I'm not sure who's writing these reviews, but ...
4       I picked up the first book in this series (The...
5       Not only do I disagree with his opinions, but ...
6       &quot;I have received your new book against th...
7       This book was on somebody's Amazon.com LISTMAN...
8       I am not sure whatever possessed me to buy thi...
9       When Professor Polk describes the sweep of his...
10      This is a story of the lives of 5 people who a...
11      This is the sixth book in the Left Behind seri...
12      This book is essentially about how culture is ...
13      Oh, yes - this is a very fine book... for me t...
14      Hands down this is Bukowski's single worst boo...
15      Very average book!  Frei doesn't go down as on...
16      This book had a good message, and could have h...
17      Finall

Next, we will turn `X_text` into just `X` -- a numeric representation!

In [17]:
# Create a vectorizer that will track text as binary features
binary_vectorizer = CountVectorizer(binary=True)

# Let the vectorizer learn what tokens exist in the text data
binary_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = binary_vectorizer.transform(X_text)

### Modeling
We have a ton of features, let's use them in some different models.

In [8]:
# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.848


Let's try using full counts instead of a binary representation. I've just copy and pasted what is above and removed the `binary=True` from the vectorizer.

In [18]:
# Create a vectorizer that will track text as binary features
count_vectorizer = CountVectorizer()

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.849


Let's try using TF-IDF.

In [19]:
# Create a vectorizer that will track text as binary features
tfidf_vectorizer = TfidfVectorizer()

# Let the vectorizer learn what tokens exist in the text data
tfidf_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = tfidf_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.87


In [29]:
X

<2000x22743 sparse matrix of type '<class 'numpy.float64'>'
	with 204058 stored elements in Compressed Sparse Row format>

### Group work
#### Features
Tfidf is looking pretty good! How about adding n-grams? Stop words? Lowercase transforming?

`CountVectorizer()` and `TfidfVectorizer()` can be modified to handle all of these things. Work in groups and try a few different combinations of these settings for anything you want: binary counts, numeric counts, tf-idf counts. Here is how you would use these settings:

- "`ngram_range=(1,2)`": would include unigrams and bigrams
- "`stop_words="english"`": would use a standard set of English stop words
- "`lowercase=False`": would turn off lowercase transformation (it is actually on by default)!

You can use some of these like this:

`tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), lowercase=False)`

#### Models
You can also swap out the line creating a logistic regression with one making a naive Bayes. This is also one line:

`naive_bayes = BernoulliNB()`

You can then go ahead and use `naive_bayes` inplace of `logistic_regression`.

In [47]:
# Work with your teams here!
# Try different features, models, or both!
# What is the highest AUC you can get?
from sklearn import svm
from sklearn import tree

# Create a vectorizer that will track text as binary features
tfidf_vectorizer = TfidfVectorizer(stop_words="english")

# Let the vectorizer learn what tokens exist in the text data
tfidf_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X_tfidf = tfidf_vectorizer.transform(X_text)

# Create models
logistic_regression = LogisticRegression()
support_vector_m = svm.SVC(kernel="linear")
decision_tree = tree.DecisionTreeClassifier(criterion="entropy")

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X_tfidf, Y, scoring="roc_auc", cv=4)
aucs_svm = cross_validation.cross_val_score(support_vector_m, X_tfidf, Y, scoring="roc_auc", cv=4)
aucs_tree = cross_validation.cross_val_score(decision_tree, X_tfidf, Y, scoring="roc_auc", cv=4)

print("TFIDF:")
# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for LogReg classifier is " + str(round(np.mean(aucs), 3)))
print("Area under the ROC curve for SVM classifier is " + str(round(np.mean(aucs_svm), 3)))
print("Area under the ROC curve for tree classifier is " + str(round(np.mean(aucs_tree), 3)))

print("\n")
print("Count vectorizer:")
# Create a vectorizer that will track text as binary features
count_vectorizer = CountVectorizer(stop_words="english")

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X_count = count_vectorizer.transform(X_text)

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X_count, Y, scoring="roc_auc", cv=5)
aucs_svm = cross_validation.cross_val_score(support_vector_m, X_count, Y, scoring="roc_auc", cv=4)
aucs_tree = cross_validation.cross_val_score(decision_tree, X_count, Y, scoring="roc_auc", cv=4)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))
print("Area under the ROC curve for SVM classifier is " + str(round(np.mean(aucs_svm), 3)))
print("Area under the ROC curve for tree classifier is " + str(round(np.mean(aucs_tree), 3)))

TFIDF:
Area under the ROC curve for LogReg classifier is 0.857
Area under the ROC curve for SVM classifier is 0.856
Area under the ROC curve for tree classifier is 0.661


Count vectorizer:
Area under the ROC curve for our classifier is 0.84
Area under the ROC curve for SVM classifier is 0.799
Area under the ROC curve for tree classifier is 0.644


## Feature Engineering
We have examined two ways of dealing with categorical data: binarizing/dummy variables and numerical scaling. We will practice these here.

In [21]:
data = pd.read_csv("data/categorical.csv")

In [22]:
data

Unnamed: 0,Minutes,Gender,Marital,Satisfaction,Churn
0,100,Male,Single,Low,0
1,220,Female,Married,Very Low,0
2,500,Female,Divorced,High,1
3,335,Male,Single,Neutral,0
4,450,Male,Married,Very High,1


### Binarizing
Get a list of features you want to binarize, go through each feature and create new features for each level.

In [23]:
features_to_binarize = ["Gender", "Marital"]

# Go through each feature
for feature in features_to_binarize:
    # Go through each level in this feature (except the last one!)
    for level in data[feature].unique()[0:-1]:
        # Create new feature for this level
        data[feature + "_" + level] = pd.Series(data[feature] == level, dtype=int)
    # Drop original feature
    data = data.drop([feature], 1)

In [24]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,Low,0,1,1,0
1,220,Very Low,0,0,0,1
2,500,High,1,0,0,0
3,335,Neutral,0,1,1,0
4,450,Very High,1,1,0,1


### Numeric scaling
We can also replace text levels with some numeric mapping we create

In [25]:
data['Satisfaction'] = data['Satisfaction'].replace(['Very Low', 'Low', 'Neutral', 'High', 'Very High'], 
                                                    [-2, -1, 0, 1, 2])

In [26]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,-1,0,1,1,0
1,220,-2,0,0,0,1
2,500,1,1,0,0,0
3,335,0,0,1,1,0
4,450,2,1,1,0,1
