This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  A check grade will be given to generic features that apply across arbitrary text classification problems (e.g., a feature for bigrams); check+ will be given for at least one feature that reveals your own understanding of your data. What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

Q0: Briefly describe your data (including the categories you're predicting)

In [None]:
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np

In [None]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            X.append(text)
            Y.append(label)
    return X, Y

In [None]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/text_classification_sample_data"

In [None]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

In [None]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [None]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)
dem_dictionary=set(["republican","cut", "opposition"])
repub_dictionary=set(["growth","economy"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [None]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

Q1: Add first new feature function here.  Describe your feature and why you think it will help.

In [None]:
def new_feature_class_one(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

Q2: Add second new feature function here. Describe your feature and why you think it will help.

In [None]:
def new_feature_class_two(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [None]:
def build_features(trainX, feature_functions):
    data=[]
    for doc in trainX:
        feats={}

        # sample text data is already tokenized; if yours is not, do so here
        tokens=doc.split(" ")
        
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [None]:
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [None]:
# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [None]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY))  

In [None]:
majority_class(trainY,devY)

Explore the impact of different feature functions by evaluating them below:

In [None]:
features=[political_dictionary_feature]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[political_dictionary_feature, unigram_feature]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)