This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  A check grade will be given to generic features that apply across arbitrary text classification problems (e.g., a feature for bigrams); check+ will be given for at least one feature that reveals your own understanding of your data. What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

Q0: Briefly describe your data (including the categories you're predicting)

In [143]:
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np
from nltk.tokenize import casual_tokenize

In [25]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            X.append(text)
            Y.append(label)
    return X, Y

In [26]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/text_classification_sample_data"
yahoo_dir = "../data/text_classification/yahoo"

In [27]:
trainX, trainY=read_data("%s/train.tsv" % yahoo_dir)
devX, devY=read_data("%s/dev.tsv" % yahoo_dir)

In [162]:
data = pd.DataFrame({"qn": trainX, "label": trainY})

In [164]:
data[data.label=='1']

Unnamed: 0,qn,label
0,Did you hear about my buddy Bill Fold?,1
1,Does the word 'prego' or 'preggers' annoy you?,1
3,Fancy Dress ideas for Bestival?,1
7,What will happen when we run out of lolliepops?,1
8,What's the absolutely funniest cat video ever?...,1
15,Question about the show 'The Boondocks' on Car...,1
16,Can evolution and Christianity really coexist ...,1
17,How to succeed in the us?,1
18,What's some prons and cons of being a Lesbian ...,1
19,Now read this freestyle poem for some of yall.?,1


In [23]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [29]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)
dem_dictionary=set(["republican","cut", "opposition"])
repub_dictionary=set(["growth","economy"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [30]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

Q1: Add first new feature function here.  Describe your feature and why you think it will help.

A note that I have found a new dataset that interests me more. It is a set of quesions (~1000) asked on Yahoo Answers platform. Each question is labeled as 0 if it is informational, 1 if it is conversational. 

The first feature I am creating is the number of second-person pronouns in the questions that are strongly indicative of conversational intention, for example, "you", "your". After a few experimentations, I also added in third-person pronouns which improved prediction.

In [156]:
PRONOUNS = ['you', 'your', 'he', 'him', 'his', 'she', 'her', 'hers', 'they', 'their', 'theirs']
def new_feature_class_one(tokens):
    feats={}
    num_pronouns = 0
    for word in tokens:
        if word in PRONOUNS:
            num_pronouns += 1
    
    feats["num_pronouns"]= num_pronouns 
    return feats

Q2: Add second new feature function here. Describe your feature and why you think it will help.

In [180]:
## Tried word count, it was bad
## Sentiment polarity?
POLAR_THRESHOLD = 2
def read_AFINN_dictionary(filename):
    neutral=[]
    polar=[]
    with open(filename) as file:
        for line in file:
            cols=line.rstrip().split("\t")
            word=cols[0]
            value=int(cols[1])
            if value < -POLAR_THRESHOLD or value > POLAR_THRESHOLD:
                polar.append(word)
            else:
                neutral.append(word)
    
    return set(neutral), set(polar)
neutral, polar=read_AFINN_dictionary("../data/AFINN-111.txt")

def new_feature_class_two(tokens):
    feats={}
    num_polar = 0
    num_neutral = 0
    for word in tokens:
        if word in neutral:
            num_neutral +=1
        if word in polar:
            num_polar +=1
    feats["num_polar"] = num_polar
    #feats['num_neutral']  = num_neutral
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [145]:
## my notes: This is creating a dictionary of feature values for each row
def build_features(trainX, feature_functions):
    data=[]

    ## creating feature values row by row
    for doc in trainX:
        feats={}

        # sample text data is already tokenized; if yours is not, do so here
        
        tokens=list(map(lambda x: x.lower(), casual_tokenize(doc)))
        ## TODO: in the future, can preprocess more (words attached to punctuation)
        
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [13]:
# **** my note: mapping each unseen feature to an index along the way 
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [14]:
# my note: finally understood what this is doing; lil_matrix is like a interface to represent sparse matrix 
# using efficient data structure such as linked list

# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [86]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY))  
    return logreg, feature_vocab

In [48]:
majority_class(trainY,devY)

0	0.559


Explore the impact of different feature functions by evaluating them below:

In [80]:
features=[political_dictionary_feature]
pipeline(trainX, devX, trainY, devY, features)

## This equals the baseline because the politic-affliated words prob didn't appear in the text at all

Accuracy: 0.559


In [146]:
features=[political_dictionary_feature, unigram_feature]
mod, feature_dict = pipeline(trainX, devX, trainY, devY, features) 

Accuracy: 0.725


## This is for my own investigation:

In [147]:
tb  = pd.DataFrame({"coef": mod.coef_[0], "feature": list(feature_dict.keys())})
print("Most informational unigrams:")
print(tb.sort_values(by = "coef")[:15])
print("\nMost conversational unigrams:")
print(tb.sort_values(by = "coef", ascending = False)[:15])

Most informational unigrams:
          coef           feature
92   -1.210408  UNIGRAM_pregnant
43   -0.981028      UNIGRAM_help
184  -0.973434      UNIGRAM_from
165  -0.948730      UNIGRAM_math
35   -0.906279     UNIGRAM_where
1088 -0.856271       UNIGRAM_car
283  -0.850324  UNIGRAM_computer
378  -0.688535      UNIGRAM_song
17   -0.684154       UNIGRAM_how
200  -0.679598        UNIGRAM_it
880  -0.678904  UNIGRAM_dandruff
1383 -0.672418       UNIGRAM_i'd
833  -0.668898   UNIGRAM_macbook
1621 -0.666162    UNIGRAM_iphone
1441 -0.658944    UNIGRAM_emails

Most conversational unigrams:
          coef             feature
292   1.262289       UNIGRAM_names
295   1.260988        UNIGRAM_girl
750   1.220323        UNIGRAM_year
638   1.181029       UNIGRAM_don't
1307  1.173219  UNIGRAM_girlfriend
54    1.090171          UNIGRAM_we
1     1.075641         UNIGRAM_you
397   1.057599       UNIGRAM_think
209   1.052321          UNIGRAM_he
1445  1.024545      UNIGRAM_scared
239   1.018263      UNIGRAM

In [154]:
features=[new_feature_class_one]
mod1, f_dict1 = pipeline(trainX, devX, trainY, devY, features)
print("coefficient for presence of second-person pronouns %f" % mod1.coef_)

Accuracy: 0.686
coefficient for presence of second-person pronouns 0.906653


In [155]:
features=[unigram_feature, new_feature_class_one]
mod1, f_dict1 = pipeline(trainX, devX, trainY, devY, features)
tb1  = pd.DataFrame({"coef": mod1.coef_[0], "feature": list(f_dict1.keys())})
print("Most informational features:")
print(tb1.sort_values(by = "coef")[:15])
print("\nMost conversational features:")
print(tb1.sort_values(by = "coef", ascending = False)[:15])

Accuracy: 0.735
Most informational features:
          coef           feature
93   -1.210675  UNIGRAM_pregnant
185  -1.008397      UNIGRAM_from
36   -0.994127     UNIGRAM_where
44   -0.970349      UNIGRAM_help
166  -0.940137      UNIGRAM_math
284  -0.840340  UNIGRAM_computer
1089 -0.827120       UNIGRAM_car
201  -0.762077        UNIGRAM_it
379  -0.727824      UNIGRAM_song
274  -0.722435         UNIGRAM_x
18   -0.691610       UNIGRAM_how
222  -0.688498     UNIGRAM_sites
881  -0.676616  UNIGRAM_dandruff
1622 -0.669261    UNIGRAM_iphone
1384 -0.666510       UNIGRAM_i'd

Most conversational features:
          coef             feature
296   1.276150        UNIGRAM_girl
751   1.270171        UNIGRAM_year
293   1.244788       UNIGRAM_names
1308  1.153987  UNIGRAM_girlfriend
639   1.152731       UNIGRAM_don't
55    1.123358          UNIGRAM_we
9     1.099396        num_pronouns
398   1.048428       UNIGRAM_think
85    0.999019          UNIGRAM_so
1446  0.983579      UNIGRAM_scared
820   0.937

In [181]:
features=[new_feature_class_two]
mod2, f_dict2 = pipeline(trainX, devX, trainY, devY, features)
mod2.coef_

Accuracy: 0.608


array([[0.68315631]])

In [182]:
features=[new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.725


(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=10000, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
           verbose=0, warm_start=False), {'num_polar': 1, 'num_pronouns': 0})

In [160]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.716


(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=10000, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
           verbose=0, warm_start=False),
 {'UNIGRAM_did': 0,
  'UNIGRAM_you': 1,
  'UNIGRAM_hear': 2,
  'UNIGRAM_about': 3,
  'UNIGRAM_my': 4,
  'UNIGRAM_buddy': 5,
  'UNIGRAM_bill': 6,
  'UNIGRAM_fold': 7,
  'UNIGRAM_?': 8,
  'num_pronouns': 9,
  'word_count': 10,
  'UNIGRAM_does': 11,
  'UNIGRAM_the': 12,
  'UNIGRAM_word': 13,
  "UNIGRAM_'": 14,
  'UNIGRAM_prego': 15,
  'UNIGRAM_or': 16,
  'UNIGRAM_preggers': 17,
  'UNIGRAM_annoy': 18,
  'UNIGRAM_how': 19,
  'UNIGRAM_to': 20,
  'UNIGRAM_draw': 21,
  'UNIGRAM_goth': 22,
  'UNIGRAM_anime': 23,
  'UNIGRAM_girls': 24,
  'UNIGRAM_fancy': 25,
  'UNIGRAM_dress': 26,
  'UNIGRAM_ideas': 27,
  'UNIGRAM_for': 28,
  'UNIGRAM_bestival': 29,
  'UNIGRAM_can': 30,
  'UNIGRAM_i': 31,
  'UNIGRAM_get': 32,
  'UNIGRAM_yahoo': 33,
