# Text classification

In this notebook, you'll practice (almost) everything you've learnt in the workshop. You're going to read in a bunch of documents, perform preprocessing, and then train and evaluate a text classifier.

### Data

I've downloaded the "Blog Authorship" corpus from from [here](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm). This is a corpus of 19,320 bloggers gathered from blogger.com in August 2004. The corpus has a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog has been tagged with the blogger's (self-identified) gender, age, industry and astrological star sign. At a later time, I'd encourage you to read [the paper](http://u.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf) that describes the corpus.

Each blog is in a separate xml file. The names of the file indicate the blogger id in the corpus, then their gender, age, industry and start sign. Within the xml file, there are two tags: date and post. We're going to ignore the date tag. All the data we want is in the post tag.

### Task
There are lots of things you could do with this, but we're going to try to build a classifier to predict an blogger's age bracket.

In [None]:
%matplotlib inline
import os
import re
import math
import glob
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from string import punctuation
from xml.etree import ElementTree as ET
import numpy as np
import pandas as pd

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import seaborn as sns

RAW_DATA_DIR = '../data/blogs'
DATA_DIR = '../data'

## Read in the data

The first thing we want to do is read in all the data we'll need. We need both the text of the blog posts and the various attributes of the blogger that we're interested in.

In [None]:
def extract_properties_from_fname(fname):
    fname = os.path.basename(fname)
    return fname.split('.')

def rounddown(n):
    return math.floor(n/10) * 10

def extract_age_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    age = int(properties[2])
    rounded_age = rounddown(age)
    string_age = str(rounded_age) + 's'
    return string_age

def extract_gender_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    gender = properties[1]
    return gender

def extract_id_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    num = int(properties[0])
    return num

def extract_industry_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    industry = properties[3]
    return industry

def extract_starsign_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    starsign = properties[4]
    return starsign

def extract_all_text(fname):
    e = ET.parse(fname)
    root = e.getroot()
    posts = root.findall('post')
    text = [post.text for post in posts]
    return ' '.join(text)

def extract_data(fname):
    age = extract_age_from_fname(fname)
    gender = extract_gender_from_fname(fname)
    starsign = extract_starsign_from_fname(fname)
    industry = extract_industry_from_fname(fname)
    try:
        text = extract_all_text(fname)
    except ET.ParseError:
        text = np.NaN
    return age, gender, starsign, industry, text

def load_blogs():
    prepared_fname = os.path.join(DATA_DIR, 'blogs.csv')
    if os.path.exists(prepared_fname):
        return pd.read_csv(prepared_fname)
    fname_pattern = os.path.join(RAW_DATA_DIR, '*.xml')
    fnames = glob.glob(fname_pattern)
    data = []
    for fname in fnames:
        data.append(extract_data(fname))
    df = pd.DataFrame(data, columns=['age', 'gender', 'starsign', 'industry', 'text'])
    df.dropna(how='any', inplace=True)
    df.to_csv(prepared_fname, index=False)
    return df

data = load_blogs()

In [None]:
texts = list(data['text'])
response = list(data['gender'])
print("The first response is:\n", response[0])
print("\nAnd here's the associated text:\n", texts[0][:500])

## Preprocess data

Now you have two variables called `texts` and `response`. `texts` is a list of strings, where each string is the text contents of a single blog post. `response` is also a list of strings, where each string is a description of an attribute of the blogger who wrote that post. Before we can do anything, we'll have to clean up this data a little. The responses are ok, but the text data itself is pretty dirty.

#### Challenge
Your task now is to preprocess `texts` as much as you'd like. At the end of your preprocessing, we want to still have a list of strings for each blog post called `cleaned_texts`. That is, we want:

`["this is the first blog post", "hello this is my second blog post", ..., "the final blog post"]`.

What this means is that if you decide to do any tokenization, or any steps that involve tokenization, you'll have to join the tokens back together so that each blog post is a string (not a list of strings itself).

Here are some suggestions on what preprocessing you could do:
- remove punctuation
- lower case everything
- remove extra whitespace
- replace any URLs with something like " URL "
- replace any digits with " DIGIT "
- remove any stopwords
- remove any words less than 3 characters in length
- stem/lemmatize words

In [None]:
def remove_punctuation(text):
    return ''.join([ch for ch in text if ch not in punctuation])

def remove_whitespace(text):
    whitespace_pattern = r'\s+'
    no_whitespace = re.sub(whitespace_pattern, ' ', text)
    return no_whitespace.strip()

def remove_url(text):
    url_pattern = r'https?:\/\/.*[\r\n]*'
    URL_SIGN = ' URL '
    return re.sub(url_pattern, URL_SIGN, text)

def remove_digits(text):
    digit_pattern = '\d+'
    DIGIT_SIGN = ' DIGIT '
    return re.sub(digit_pattern, DIGIT_SIGN, text)

def tokenize(text):
    try:
        return word_tokenize(text)
    except:
        return text.split()
    
stops = stopwords.words('english')

def remove_stopwords(text):
    tokenized_text = tokenize(text)
    no_stopwords = [token for token in tokenized_text if token not in stops]
    return ' '.join(no_stopwords)

stemmer = PorterStemmer()

def stem(text):
    tokenized_text = tokenize(text)
    stems = [stemmer.stem(token) for token in tokenized_text]
    return ' '.join(stems)

In [None]:
def clean(text):
    text = text.lower()
    text = remove_punctuation(text)
    text = remove_whitespace(text)
    text = remove_url(text)
    text = remove_digits(text)
    #text = remove_stopwords(text)
    #text = stem(text)
    return text

In [None]:
cleaned_texts = [clean(text) for text in texts]
assert type(cleaned_texts) == type([]), "cleaned_texts should be a list"
assert type(cleaned_texts[0]) == type(''), "each element in cleaned_texts should be a string"

## DTM/TF-IDF

#### Challenge
Now let's take our list of strings `cleaned_texts` and turn it into a DTM, with either counts or TF-IDF scores. It's up to you which one you choose. Here's the documentation for the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and here's the documentation for [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). I'd suggest limited the `max_features` to 5000 and setting `binary=True`. Feel free to play around with other options too!

In [None]:
vectorizer = CountVectorizer(max_features=5000, binary=True)
features = vectorizer.fit_transform(cleaned_texts)

## Classification

Now we're on to the actual classification step. The first thing we need to do here is split our data into a training and a test set. This is so we can evaluate the quality of our classifier.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, response, test_size=0.2)

In [None]:
def fit_logistic_regression(X_train, y_train):
    model = LogisticRegressionCV(Cs=5, penalty='l1', cv=3, solver='liblinear', refit=True)
    model.fit(X_train, y_train)
    return model

def conmat(model, X_test, y_test):
    """Wrapper for sklearn's confusion matrix."""
    labels = model.classes_
    y_pred = model.predict(X_test)
    c = confusion_matrix(y_test, y_pred)
    sns.heatmap(c, annot=True, fmt='d', 
                xticklabels=labels, 
                yticklabels=labels, 
                cmap="YlGnBu", cbar=False)
    plt.ylabel('Ground truth')
    plt.xlabel('Prediction')
    
def test_model(model, X_train, y_train):
    conmat(model, X_test, y_test)
    print('Accuracy: ', model.score(X_test, y_test))
    
def interpret(vectorizer, model):
    vocab = [(v,k) for k,v in vectorizer.vocabulary_.items()]
    vocab = sorted(vocab, key=lambda x: x[0])
    vocab = [word for num,word in vocab]
    important = pd.DataFrame(model.coef_).T
    if len(model.classes_) == 2:
        important.columns = [model.classes_[0]]
    else:
        important.columns = model.classes_
    important['word'] = vocab
    return important

### Train classification model

In [None]:
model = fit_logistic_regression(X_train, y_train)

### Test classification model

In [None]:
test_model(model, X_test, y_test)

### Interpreting what our model learnt

In [None]:
important = interpret(vectorizer, model)
important.sort_values(by='10s', ascending=False).head(10)

### Using our trained model to make predictions

In [None]:
new_blog_posts = ["I hate being a teenager so much! School is so boring! I don't even do my chemisty homework haha \
                  I just copy the soltuions from my friendz yay lol",
                  "This is another post about my new job and life in this big city."]

cleaned_new_blog_posts = [clean(post) for post in new_blog_posts]
new_features = vectorizer.transform(cleaned_new_blog_posts)
predictions = model.predict(new_features)
list(zip(new_blog_posts, predictions))