# Feature engineering, text mining, and text classification


Let us first read in some packages. Nothing new here

In [1]:
# Import pandas to read in data
import numpy as np
import pandas as pd

# Import models and evaluation functions
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
#from sklearn import cross_validation
#from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

# Import vectorizers to turn text into numeric
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import plotting
import matplotlib.pylab as plt
%matplotlib inline

## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `files/books.csv` contains 2,000 Amazon book reviews. 

Go here to get the dataset: http://bit.ly/someBookReviews

The data set contains two features: the first column (contained in quotes) is the review text. 
The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [6]:
!head -3/books.csv

'head' is not recognized as an internal or external command,
operable program or batch file.






Let's read the data into a pandas data frame. You'll notice two new attributes in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [7]:
data = pd.read_csv("books.csv", quotechar="\"", escapechar="\\")

In [8]:
data.head()

Unnamed: 0,review_text,positive
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


In [9]:
data.iloc[1]['review_text']

'I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist seems implausible and reaches for effect and panders to theshrink bashers of the world. This play for effect tone throughout the book was very distasteful to me'

### Text as a set of features
Going from text to numeric data is very easy. Let's take a look at how we can do this. We'll start by separating out our X and Y data.

In [10]:
X_text = data['review_text']
Y = data['positive']

Next, we will turn `X_text` into just `X` -- a numeric representation!

In [24]:
# Create a vectorizer that will track text as binary features
binary_vectorizer = CountVectorizer(binary=True)

# Let the vectorizer learn what tokens exist in the text data
binary_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = binary_vectorizer.transform(X_text)

### Modeling
We have a ton of features, let's use them in some different models.

In [25]:
# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation

from sklearn.model_selection import cross_val_score
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average accuracy rounded to three decimal points
print("Accuracy of our classifier is " + str(round(np.mean(accs), 3)))



Accuracy of our classifier is 0.768


Let's try using full counts instead of a binary representation. I've just copy and pasted what is above and removed the `binary=True` from the vectorizer.

In [26]:
# Create a vectorizer that will track text as counted features
count_vectorizer = CountVectorizer()

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average AUC rounded to three decimal points
print("Accuracy of our classifier is " + str(round(np.mean(accs), 3)))



Accuracy of our classifier is 0.786


Let's try using TF-IDF.

In [14]:
# Create a vectorizer that will track text as binary features
tfidf_vectorizer = TfidfVectorizer()

# Let the vectorizer learn what tokens exist in the text data
tfidf_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = tfidf_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average AUC rounded to three decimal points
print("Accuracy of our classifier is " + str(round(np.mean(aucs), 3)))

Accuracy of our classifier is 0.784




## Feature Engineering

At the start of this class, we explored two ways of dealing with categorical data: binarizing and numerical scaling. I would like to show how to do these two things in Python. We will use the same simple 5 record data from class.

Go here to get the data: http://bit.ly/someCategoricalData


In [15]:
data = pd.read_csv("categorical.csv")

In [16]:
data

Unnamed: 0,Minutes,Gender,Marital,Satisfaction,Churn
0,100,Male,Single,Low,0
1,220,Female,Married,Very Low,0
2,500,Female,Divorced,High,1
3,335,Male,Single,Neutral,0
4,450,Male,Married,Very High,1


### Binarizing
Get a list of features you want to binarize, go through each feature and create new features for each level.

In [17]:
features_to_binarize = ["Gender", "Marital"]

# Go through each feature
for feature in features_to_binarize:
    # Go through each level in this feature (except the last one!)
    for level in data[feature].unique()[0:-1]:
        # Create new feature for this level
        data[feature + "_" + level] = pd.Series(data[feature] == level, dtype=int)
    # Drop original feature
    data = data.drop([feature], 1)

In [18]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,Low,0,1,1,0
1,220,Very Low,0,0,0,1
2,500,High,1,0,0,0
3,335,Neutral,0,1,1,0
4,450,Very High,1,1,0,1


### Numeric scaling
We can also replace text levels with some numeric mapping we create

In [19]:
data['Satisfaction'] = data['Satisfaction'].replace(['Very Low', 'Low', 'Neutral', 'High', 'Very High'], 
                                                    [-2, -1, 0, 1, 2])

In [20]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,-1,0,1,1,0
1,220,-2,0,0,0,1
2,500,1,1,0,0,0
3,335,0,0,1,1,0
4,450,2,1,1,0,1


## In class

Go to the data set in the beggining of this lecture. 

The challenge is to build a predictive model, but now I expect you to do some text preprocessing on the review text.
Some things you can do are the following
- use n-grams instead of 1-grams
- remove stopwords
- lower case
- use a stemmer 
- remove punctuation
- ....anything else you might imagine.
In this task, a potentially very helpful package will be NLTK. As always, it is  very easy to find all the above functions pre-coded on the internet simply by Googling! (e.g. "NLTK remove stopwords," and so on)

In [22]:
data=pd.read_csv("books.csv", quotechar="\"", escapechar="\\")

In [39]:
X_text = data['review_text']
Y = data['positive']

In [40]:
# Create a vectorizer that will track text as binary features
binary_vectorizer = CountVectorizer(binary=True)

# Let the vectorizer learn what tokens exist in the text data
binary_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = binary_vectorizer.transform(X_text)

In [41]:
# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation

from sklearn.model_selection import cross_val_score
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=9)

# Print out the average accuracy rounded to three decimal points
print("Accuracy of our classifier is " + str(round(np.mean(accs), 3)))



Accuracy of our classifier is 0.789
