This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  A check grade will be given to generic features that apply across arbitrary text classification problems (e.g., a feature for bigrams); check+ will be given for at least one feature that reveals your own understanding of the data. What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

You are free to use any of the following datasets for this exercise, or to use your own (if you have your own labeled data, I would encourage you to use it!).  If you use your own data, just be sure to format it like the examples below; each directory has a `train.tsv`, `dev.tsv` and `test.tsv` file, where each file is tab-separated (label in the first column and text in the second column).

* [Sentiment Analysis](https://ai.stanford.edu/~amaas/data/sentiment/) (Positive/Negative): `data/lmrd`
* [Congressional Speech](https://www.cs.cornell.edu/home/llee/data/convote.html) (Democrat/Republican): `data/convote`
* Library of Congress Subject Classication ([21 categories](https://en.wikipedia.org/wiki/Library_of_Congress_Classification)): `data/loc`
 

**Q0**: Briefly describe your data (including the categories you're predicting).  If you're using your own data, tell us about it; if you're using one of the datasets above, tell us something that shows you've looked at the data.

The dataset consists of movie reviews obtained from IMDB and a respective column with their binary sentiment analysis, defined as either 'positive' or 'negative'. According to the description of the dataset, it contains 50,000 reviews split evenly into half (25k as train data and 25k as test data). Additionally, the dataset seems to be balanced, since the overall distribution of labels is also 25k positive and 25k negative sentiments. 
This binary sentiment is exactly the category we are trying to predict with our model. However, one of the challenges is that lots of these reviews are extensive and involve more than just a quality analysis of the movie, many times offering some kind of summary and storyline discussion and inclinations towards specific actors and actresses. For example: 

*``The movie Titanic makes it much more then just a  "night to remember" . It re writes a tragic history event that will always be talked about and will never been forgotten . Why so criticised ? I have no idea . Could/will they ever make a movie like Titanic that is so moving and touching every time you watch it . Could they ever replace such an epic masterpiece . It will be almost impossible . The director no doubt had the major impact on the film . A simple disaster film ( boring to watch ) converted to an unbelievable romance . Yes I 'm not the Romance type either , but that should not bother you , because you will never see a romance like this . Guaranteed ! Everything to the amazing effects , to the music , to the sublime acting . The movie creates an amazing visual and a wonderful feeling . Everything looks very real and live . The legend herself "TITANIC" is shown brilliantly in all classes , too looks , too accommodation . The acting was the real effect . Dicaprio and Winslet are simply the best at playing there roles . No one could have done better . They are partly the reason why the film is so great . I guess it 's not too much to talk about . The plot is simple , The acting is brilliant , based on a true story , Probably more then half of the consumers that watch the film will share tears , thanks to un imaginable ending which can never be forgotten . Well if you have n't seen this film your missing out on something Hesterical , and a film to idolise for Hollywood . Could it get better ? No . Not at all . The most moving film of all time , do n't listen to people , see for yourself then you will understand . A landmark . ( do n't be surprised if you cry too )"``

Therefore, as presented below, in one of my features, I will try to take advantage of that review writing style to verify if there is a way to predict its sentiment based on the names of the actors mentioned. 

In [1]:
import sys
from collections import Counter
import operator
from sklearn import preprocessing
from sklearn import linear_model
from nltk import word_tokenize
import pandas as pd
from scipy import sparse
import numpy as np
from math import sqrt 

In [2]:
def read_data_old(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=word_tokenize(cols[1])
            X.append(text)
            Y.append(label)
    return X, Y

In [43]:
def read_data(filename, maximum_data_points = 1000):
    X = []
    Y = []
    data_count = 0  # Counter for the number of data points read

    with open(filename, encoding="utf-8") as file:
        for line in file:
            if data_count >= maximum_data_points:
                break  # Exit the loop if the maximum number of data points is reached

            cols = line.rstrip().split("\t")
            label = cols[0]
            text = word_tokenize(cols[1])
            X.append(text)
            Y.append(label)
            data_count += 1  # Increment the data point counter

    return X, Y

In [9]:
# Change this to the directory with the data you will be using.
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/lmrd"

In [44]:
trainX, trainY=read_data("%s/train.tsv" % directory, maximum_data_points=1000)
devX, devY=read_data("%s/dev.tsv" % directory, maximum_data_points=1000)

In [45]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [6]:
# Here's a sample dictionary if we were using the convote political data
dem_dictionary=set(["republican","cut", "opposition"])
repub_dictionary=set(["growth","economy"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [13]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

**Q1**: Add first new feature function here.  Describe your feature and why you think it will help.

In [14]:
#Generic bigrams feature that could be applicable across arbitrary texts
def new_feature_class_one(tokens):
    feats = {}
    for i in range(len(tokens) - 1):
        bigram = f"BIGRAM_{tokens[i]}_{tokens[i + 1]}"
        feats[bigram] = 1
    return feats

**Q2**: Add second new feature function here. Describe your feature and why you think it will help.

In [15]:
# Looking at the IMDB reviews, I could notice that many of those reviews also included some kind of brief summary of the movie, or comments related to the actual storyline and not only the quality of the movie. So, I would expect that some of the sentiment scores could be related not only the movie quality but also if the movie is a happy versus sad movie. Therefore, I am proposing a check on name of actors and actresses that are associated with either happy or sad movies in the comments and using that as a feature.
#Specific dictionary analysis using as reference samples of lists of the funniest actors and actresses (https://www.imdb.com/list/ls051583078/; https://www.imdb.com/list/ls062039274/) and samples from lists of sad-movies actors/actresses (https://www.therichest.com/expensive-lifestyle/10-male-actors-who-always-cry-in-their-movies/; https://www.buzzfeed.com/noradominick/actors-amazing-performances-sad-tv-scenes)
sad_movies_actors_and_actresses = ["hanks", "dicaprio", "depp", "phoenix", "crowe", "streep", "blanchett", "kidman", "theron", "moore"]
happy_movies_actors_and_actresses = ["ferrell", "carrey", "sandler", "pratt", "johnson", "witherspoon", "aniston", "bullock", "adams", "roberts"]


def new_feature_class_two(tokens):
    feats={}
    for name in tokens:
        if name in sad_movies_actors_and_actresses:
            feats["name_in_sad_movie_dictionary"]=1
        if name in happy_movies_actors_and_actresses:
            feats["name_in_happy_movie_dictionary"]=1
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [16]:
def build_features(trainX, feature_functions):
    data=[]
    for tokens in trainX:
        feats={}

        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [17]:
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [18]:
# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [19]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY)) 
    return logreg, feature_vocab

In [20]:
def print_weights(clf, vocab, n=10):

    reverse_vocab=[None]*len(clf.coef_[0])
    for k in vocab:
        reverse_vocab[vocab[k]]=k

    if len(clf.classes_) == 2:
        
        weights=clf.coef_[0]
        for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
            print("%.3f\t%s" % (weight, feature))

        print()

        for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
            print("%.3f\t%s" % (weight, feature))

    else:  
        for i, cat in enumerate(clf.classes_):

            weights=clf.coef_[i]

            for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
                print("%s\t%.3f\t%s" % (cat, weight, feature))
            print()

In [46]:
majority_class(trainY,devY)

pos	0.500


Explore the impact of different feature functions by evaluating them below:

In [53]:
features=[unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.804


IF you want to print the coefficients for any of the models you train, you can do so like this.

In [23]:
print_weights(clf, vocab)

-1.314	UNIGRAM_worst
-0.995	UNIGRAM_bad
-0.854	UNIGRAM_waste
-0.629	UNIGRAM_just
-0.617	UNIGRAM_?
-0.569	UNIGRAM_awful
-0.560	UNIGRAM_free
-0.555	UNIGRAM_looking
-0.547	UNIGRAM_dull
-0.533	UNIGRAM_disappointing

0.902	UNIGRAM_love
0.699	UNIGRAM_wonderful
0.684	UNIGRAM_everyone
0.684	UNIGRAM_very
0.646	UNIGRAM_loved
0.615	UNIGRAM_excellent
0.588	UNIGRAM_well
0.565	UNIGRAM_first
0.534	UNIGRAM_My
0.520	UNIGRAM_always


In [37]:
features=[political_dictionary_feature, unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

NameError: name 'political_dictionary_feature' is not defined

In [48]:
features=[new_feature_class_one]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.771


In [49]:
features=[new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.500


In [50]:
features=[new_feature_class_two, unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.806


In [51]:
features=[new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.771


In [52]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.819
