# Text classification

In this notebook, you'll practice (almost) everything you've learnt in the workshop. You're going to read in a bunch of documents, perform preprocessing and EDA, and then train and evaluate a text classifier. Hopefully, you'll feel confident enough to do this largely by yourself, but feel free to refer back to previous notebooks or ask questions.

### Data

I've downloaded the "Blog Authorship" corpus from from [here](http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm). This is a corpus of 19,320 bloggers gathered from blogger.com in August 2004. The corpus has a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog has been tagged with the blogger's (self-identified) gender, age, industry and astrological star sign. At a later time, I'd encourage you to read [the paper](http://u.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf) that describes the corpus.

Each blog is in a separate xml file. The names of the file indicate the blogger id in the corpus, then their gender, age, industry and start sign. Within the xml file, there are two tags: date and post. We're going to ignore the date tag. All the data we want is in the post tag.

### Task
There are lots of things you could do with this, but we're going to try to build a classifier to predict an blogger's age bracket.

### Time
- Teaching: 10 minutes
- Exercises: 50 minutes

In [1]:
%matplotlib inline
import os
import re
import glob
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from string import punctuation
from xml.etree import ElementTree as ET
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

## Read in the data

The first thing we want to do is read in all the data we'll need. We need both the text of the blog posts and the age of the blogger.

In [8]:
# Make sure to unzip the blogs.zip first
DATA_DIR = 'data/blogs'
fname_pattern = os.path.join(DATA_DIR, '*.xml')