This notebook contextualizes accuracy against a majority class baseline, and analyzes the most important features for classification.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn import linear_model
import numpy as np
import pandas as pd
import re
import nltk

In [3]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            # sample text is already tokenized; if yours is not, do so here
            text=cols[1] ## Note: I didn't end up tokenizing here because later countVectorizer takes raw string
            X.append(text)
            Y.append(label)
    return X, Y

In [4]:
# Change this to the directory with your data (from the CheckData_TODO.ipynb exercise).  
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/text_classification"

In [5]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

Q1: Implement the majority class baseline for your data that we went over in `Hyperparameters.ipynb`

In [6]:
def majority_class(trainY, devY):
    # your code here
    
    classes = pd.Series(trainY).value_counts().index
    major_class =  pd.Series(trainY).value_counts().index[0] if (pd.Series(trainY).value_counts())[0] > (pd.Series(trainY).value_counts())[1] else trainY.value_counts().index[1] 
    
    ## calculate accuracy
    pred_table = pd.DataFrame({"pred": major_class,
                               "label": devY})
    pred_table['correct']= pred_table.pred == pred_table.label
    
    num_correct = pred_table.correct.value_counts()[True]
    accuracy = num_correct/pred_table.shape[0]
    return accuracy

In [7]:
majority_class(trainY,devY)

0.484

Q2: After experimenting with hyperparameter choices in class, what is the best accuracy that you uncovered on your development data?  Which hyperparameter choices led to that accuracy?  Plug in the values here and execute the cell to yield the accuracy. 

In [8]:
def preprocess(text):
    text = text.replace("_NEWLINE_", " ")
    text = text.replace("_TAB_", " ")
    text = re.sub(r"(?:https?:\S+)", "", text)
    text = " ".join(nltk.word_tokenize(text))
    return text

In [9]:
trainX = list(map(preprocess, trainX))
devX = list(map(preprocess, devX))

In [16]:
le = preprocessing.LabelEncoder()
le.fit(trainY)
Y_train=le.transform(trainY)
Y_dev=le.transform(devY)

# split the string on whitespace because we assume it has already been tokenized
vectorizer = CountVectorizer(max_features=10000, analyzer=str.split, lowercase=False, strip_accents=None, binary=True)

X_train = vectorizer.fit_transform(trainX)
X_dev = vectorizer.transform(devX)
logreg = linear_model.LogisticRegression(C=5, solver='lbfgs', penalty='l2')
model=logreg.fit(X_train, Y_train)
print("Accuracy: %.3f" % logreg.score(X_dev, Y_dev))


Accuracy: 0.604


Q3: For binary classification using logistic regression, the parameters of the learned model are given in `model.coef_[0]`.  Print out the 25 features that are most associated with each class (i.e., the 25 parameters that have the largest positive values and the 25 parameters with largest negative values).  For reference, consider the `inverse_transform` function in [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.transform) to get the class labels that correspond to positive(=1) and negative(=0), and the `vocabulary_` function in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to yield the index for each vocabulary term.


In [23]:
def analyze_weights(learned_model, label_encoder, count_vectorizer):
    # your code here
    if pd.Series(Y_dev).value_counts().index[0] == 0:
        pos_class = pd.Series(le.inverse_transform(Y_dev)).value_counts().index[1]
        neg_class = pd.Series(le.inverse_transform(Y_dev)).value_counts().index[0]
    else:
        pos_class = pd.Series(le.inverse_transform(Y_dev)).value_counts().index[0]
        neg_class = pd.Series(le.inverse_transform(Y_dev)).value_counts().index[1]
        
    feature_table = pd.DataFrame({"vocab": list(count_vectorizer.vocabulary_.keys()),
                                 "coef": list(learned_model.coef_[0])})
    sorted_tbl = feature_table.sort_values("coef")
    num_example = 25
    print("***Most indicative of %s:" % neg_class)
    
    for i in range(num_example):
        print("%s : %s" % (sorted_tbl.iloc[i,].vocab, sorted_tbl.iloc[i,].coef))
    
    print("\n***Most indicative of %s:" % pos_class)
    for i in range(num_example):
        print("%s : %s" % (sorted_tbl.iloc[-(i+1),].vocab, sorted_tbl.iloc[-(i+1),].coef))
    return
    

In [24]:
analyze_weights(model, le, vectorizer)

***Most indicative of female:
cuban : -2.7776604888510605
connect : -2.6684411415710656
Luiy : -2.5223663585854856
äìî¥Ÿ_ôÖ_ : -2.457317325795894
productivity : -2.3859072586970136
TheFoursBar : -2.3301905976389934
_ôÖ‰_ôÖ‰ : -2.2156273495385093
Expect : -2.1923208167671002
Lockup : -2.1792412822869416
gradually : -2.153857759777557
Spotify : -2.1210152635398045
SaccWhack : -2.107796295753613
resolutioncomplete : -2.081467834658541
civil : -2.056021933786774
trip : -2.053072173318958
1/26 : -2.041208367309916
STORY : -1.9933183836582529
manly : -1.9920504013133007
announcements : -1.9480358133325983
charities : -1.9447137321564663
delivery : -1.9246644514904652
Hass_Dinerroo : -1.909367400886759
regs : -1.9036663939715572
Discover : -1.8495071095273072
BJJ : -1.843915708085739

***Most indicative of male:
somehow : 2.844051864237782
shoplift : 2.785507222774734
NewYearsResolution.. : 2.5506861320265535
August : 2.3705265270440536
attract : 2.317958348064095
manner : 2.2887254448366865

  if diff:
  if diff:
