This notebook explores text classification, introducing a majority class baseline and analyzing the affect of hyperparameter choices on accuracy.

In [None]:
import sys
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
import numpy as np

In [None]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            # sample text data is already tokenized; if yours is not, do so here            
            X.append(text)
            Y.append(label)
    return X, Y

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/text_classification_sample_data"

In [None]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

In [None]:
def majority_class(trainY, devY):
    # your code here

Baselines are critical as a point of reference to understand how well a text classification method is performing.  One of the simplest of these is the *majority class* baseline: for every point in the test data, predict the label that shows up most frequently **in the training data**.  Implement that basline for your data.

In [None]:
majority_class(trainY,devY)

Scikit-learn's [GridSearchCV](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html) is a convenient function for evaluating performance across a range of parameters.  For more control, let's write our own grid search function here.  Explore the performance for different parameter settings of [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (e.g., binary, stopword removal, lowercasing, etc.)

In [None]:
scores=[]
names=[]

feat_vals=[50, 100, 500, 1000, 5000, 10000, 50000]

le = preprocessing.LabelEncoder()
le.fit(trainY)
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

idx=0

for feat_val in feat_vals:

    # split the string on whitespace because we assume it has already been tokenized
    vectorizer = CountVectorizer(max_features=feat_val, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

    X_train = vectorizer.fit_transform(trainX)
    X_dev = vectorizer.transform(devX)

    print ("%s of %s trials" % (idx, len(feat_vals)))

    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2')
    logreg.fit(X_train, Y_train)
    scores.append(logreg.score(X_dev, Y_dev))
    names.append("feat_value:%s" % (feat_val))
    idx+=1

In [None]:
# Let's plot these results (may need to execute twice to diplay graph)
pd_results=pd.DataFrame({"value":names, "accuracy":scores})
pd_results.plot.bar(x='value', y='accuracy', figsize=(14,6))
pd_results

Some parameters interact with each other (like the number of features and the regularization strength). Perform grid search on a combination of features to evaluate how their interaction affects accuracy.

In [None]:
scores=[]
names=[]

feat_vals=[50, 100, 500, 1000, 5000, 10000, 50000]
C_values=[0.001, 0.1, 1, 5, 10]

le = preprocessing.LabelEncoder()
le.fit(trainY)
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

idx=0

for feat_val in feat_vals:

    # split the string on whitespace because we assume it has already been tokenized
    vectorizer = CountVectorizer(max_features=feat_val, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

    X_train = vectorizer.fit_transform(trainX)
    X_dev = vectorizer.transform(devX)

    for C_val in C_values:
        
        print ("%s of %s trials" % (idx, len(feat_vals)*len(C_values)))

        logreg = linear_model.LogisticRegression(C=C_val, solver='lbfgs', penalty='l2')
        logreg.fit(X_train, Y_train)
        scores.append(logreg.score(X_dev, Y_dev))
        names.append("feat_value:%s-C:%s" % (feat_val, C_val))
        idx+=1

In [None]:
pd_results=pd.DataFrame({"value":names, "accuracy":scores})
pd_results.plot.bar(x='value', y='accuracy', figsize=(14,6))
pd_results