# Text classification

In this notebook, you'll practice (almost) everything you've learnt in the workshop. You're going to read in a bunch of documents, perform preprocessing and EDA, and then train and evaluate a text classifier. Hopefully, you'll feel confident enough to do this largely by yourself, but feel free to refer back to previous notebooks or ask questions.

### Data

I've downloaded the "Blog Authorship" corpus from from [here](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm). This is a corpus of 19,320 bloggers gathered from blogger.com in August 2004. The corpus has a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog has been tagged with the blogger's (self-identified) gender, age, industry and astrological star sign. At a later time, I'd encourage you to read [the paper](http://u.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf) that describes the corpus.

Each blog is in a separate xml file. The names of the file indicate the blogger id in the corpus, then their gender, age, industry and start sign. Within the xml file, there are two tags: date and post. We're going to ignore the date tag. All the data we want is in the post tag.

### Task
There are lots of things you could do with this, but we're going to try to build a classifier to predict an blogger's age bracket.

### Time
- Teaching: 10 minutes
- Exercises: 50 minutes

In [None]:
%matplotlib inline
import os
import re
import glob
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from string import punctuation
from xml.etree import ElementTree as ET
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

## Read in the data

The first thing we want to do is read in all the data we'll need. We need both the text of the blog posts and the age of the blogger.

In [None]:
DATA_DIR = '../data/blogs'
fname_pattern = os.path.join(DATA_DIR, '*.xml')

In [None]:
def extract_properties_from_fname(fname):
    fname = os.path.basename(fname)
    return fname.split('.')

def extract_age_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    age = int(properties[2])
    return age

def extract_gender_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    gender = properties[1]
    return gender

def extract_id_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    num = int(properties[0])
    return num

def extract_industry_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    industry = properties[3]
    return industry

def extract_starsign_from_fname(fname):
    properties = extract_properties_from_fname(fname)
    starsign = properties[4]
    return starsign

In [None]:
def extract_all_text(fname):
    e = ET.parse(fname)
    root = e.getroot()
    posts = root.findall('post')
    text = [post.text for post in posts]
    return ' '.join(text)

In [None]:
def extract_data(fname):
    response = extract_age_from_fname(fname)
    num = extract_id_from_fname(fname)
    try:
        text = extract_all_text(fname)
    except ET.ParseError:
        text = np.NaN
    return num, response, text

In [None]:
fnames = glob.glob(fname_pattern)
data = {}
for fname in fnames[:1000]:
    num, response, text = extract_data(fname)
    data[num] = [response, text]

In [None]:
df = pd.DataFrame.from_dict(data, orient='index')
df.columns = ['age', 'text']
df.head()

Remove blogs with parsing errors

In [None]:
df = df[df['text'].notnull()]
df = df[df['age'].notnull()]
df = df[df['age']<40]

## Preprocess data


In [None]:
def remove_punctuation(text):
    return ''.join([ch for ch in text if ch not in punctuation])

In [None]:
def remove_whitespace(text):
    whitespace_pattern = r'\s+'
    no_whitespace = re.sub(whitespace_pattern, ' ', text)
    return no_whitespace.strip()

In [None]:
def remove_url(text):
    url_pattern = r'https?:\/\/.*[\r\n]*'
    URL_SIGN = ' URL '
    return re.sub(url_pattern, URL_SIGN, text)

In [None]:
def remove_digits(text):
    digit_pattern = '\d+'
    DIGIT_SIGN = ' DIGIT '
    return re.sub(digit_pattern, DIGIT_SIGN, text)

In [None]:
def tokenize(text):
    try:
        return word_tokenize(text)
    except:
        return text.split()

In [None]:
stops = stopwords.words('english')

def remove_stopwords(text):
    tokenized_text = tokenize(text)
    no_stopwords = [token for token in tokenized_text if token not in stops]
    return ' '.join(no_stopwords)

In [None]:
stemmer = PorterStemmer()

def stem(text):
    tokenized_text = tokenize(text)
    stems = [stemmer.stem(token) for token in tokenized_text]
    return ' '.join(stems)

In [None]:
def clean(text):
    text = remove_punctuation(text)
    text = remove_whitespace(text)
    text = text.lower()
    text = remove_url(text)
    text = remove_digits(text)
    #text = remove_stopwords(text)
    #text = stem(text)
    return text

In [None]:
df['clean_text'] = df['text'].apply(clean)
df.head()

## EDA

TBA

## Classification



In [None]:
countvectorizer = CountVectorizer(max_features=5000, binary=True)
X = countvectorizer.fit_transform(df['clean_text'])
features = X.toarray()
features

In [None]:
bins = list(range(10, 41, 10))
labels = ['Under ' + str(i) for i in bins][:-1]
response = pd.cut(df['age'], bins=bins, right=True, labels=labels)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, response, test_size=0.2)

In [None]:
def fit_logistic_regression(X_train, y_train):
    model = LogisticRegressionCV(Cs=5, penalty='l1', cv=3, solver='liblinear', refit=True)
    model.fit(X_train, y_train)
    return model

def conmat(model, X_test, y_test):
    """Wrapper for sklearn's confusion matrix."""
    labels = model.classes_
    y_pred = model.predict(X_test)
    c = confusion_matrix(y_test, y_pred)
    sns.heatmap(c, annot=True, fmt='d', 
                xticklabels=labels, 
                yticklabels=labels, 
                cmap="YlGnBu", cbar=False)
    plt.ylabel('Ground truth')
    plt.xlabel('Prediction')
    
def test_model(model, X_train, y_train):
    conmat(model, X_test, y_test)
    print('Accuracy: ', model.score(X_test, y_test))
    
def interpret(vectorizer, model):
    vocab = [(v,k) for k,v in vectorizer.vocabulary_.items()]
    vocab = sorted(vocab, key=lambda x: x[0])
    vocab = [word for num,word in vocab]
    coef = list(zip(vocab, model.coef_))
    important = pd.DataFrame(lr.coef_).T
    important.columns = model.classes_
    important['word'] = vocab
    return important

In [None]:
lr = fit_logistic_regression(X_train, y_train)

In [None]:
test_model(lr, X_test, y_test)

In [None]:
important = interpret(countvectorizer, lr)
important.sort_values(by='Under 10', ascending=False).head(10)

In [None]:
important.sort_values(by='Under 20', ascending=False).head(10)

In [None]:
important.sort_values(by='Under 30', ascending=False).head(10)