[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp24/blob/main/5.classification/HW4_FeatureExploration_TODO.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

You are free to use any of the following datasets for this exercise, or to use your own (if you have your own labeled data with at least 500 examples from at least two classes, I would encourage you to use it!).  If you use your own data, just be sure to format it like the examples below; each directory has a `train.tsv`, `dev.tsv` and `test.tsv` file, where each file is tab-separated (label in the first column and text in the second column).

* [Sentiment Analysis](https://ai.stanford.edu/~amaas/data/sentiment/) (Positive/Negative)
* [Congressional Speech](https://www.cs.cornell.edu/home/llee/data/convote.html) (Democrat/Republican)
* Library of Congress Subject Classication ([21 categories](https://en.wikipedia.org/wiki/Library_of_Congress_Classification))

For whichever dataset you pick, download the data first using the code below.


In [None]:
# get LMRD data
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/lmrd/train.tsv -O lmrd_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/lmrd/dev.tsv -O lmrd_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/lmrd/test.tsv -O lmrd_test.tsv

In [None]:
# get Convote data
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/convote/train.tsv -O convote_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/convote/dev.tsv -O convote_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/convote/test.tsv -O convote_test.tsv

In [None]:
# get LoC data
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/loc/train.tsv -O loc_train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/loc/dev.tsv -O loc_dev.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/refs/heads/main/data/loc/test.tsv -O loc_test.tsv

**Q0**: Briefly describe your data (including the categories you're predicting).  If you're using your own data, tell us about it; if you're using one of the datasets above, tell us something that shows you've looked at the data.

In [None]:
import sys
from collections import Counter
import operator
from sklearn import preprocessing
from sklearn import linear_model

from nltk import word_tokenize
import nltk
nltk.download('punkt')

import pandas as pd
from scipy import sparse
import numpy as np

In [None]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=word_tokenize(cols[1])
            X.append(text)
            Y.append(label)
    return X, Y

In [None]:
# Change this to the directory with the data you will be using.
# The directory should contain train.tsv, dev.tsv and test.tsv
data="loc"

In [None]:
trainX, trainY=read_data("%s_train.tsv" % data)
devX, devY=read_data("%s_dev.tsv" % data)

In [None]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]

    correct=0.
    for label in devY:
        if label == majority:
            correct+=1

    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [None]:
# Here's a sample dictionary if we were using the convote political data
dem_dictionary=set(["republican","cut", "opposition"])
repub_dictionary=set(["growth","economy"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [None]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

**Q1**: Add first new feature function here.  Describe your feature and why you think it will help.

In [None]:
def new_feature_class_one(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

**Q2**: Add second new feature function here. Describe your feature and why you think it will help.

In [None]:
def new_feature_class_two(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [None]:
def build_features(trainX, feature_functions):
    data=[]
    for tokens in trainX:
        feats={}

        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [None]:
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1

    return feature_vocab

In [None]:
# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature
# values will be 0 for most documents (note: why?), and we don't want to save them all in
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [None]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)

    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY))
    return logreg, feature_vocab

In [None]:
def print_weights(clf, vocab, n=10):

    reverse_vocab=[None]*len(clf.coef_[0])
    for k in vocab:
        reverse_vocab[vocab[k]]=k

    if len(clf.classes_) == 2:

        weights=clf.coef_[0]
        for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
            print("%.3f\t%s" % (weight, feature))

        print()

        for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
            print("%.3f\t%s" % (weight, feature))

    else:
        for i, cat in enumerate(clf.classes_):

            weights=clf.coef_[i]

            for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
                print("%s\t%.3f\t%s" % (cat, weight, feature))
            print()

In [None]:
majority_class(trainY,devY)

Explore the impact of different feature functions by evaluating them below:

In [None]:
features=[unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

If you want to print the coefficients for any of the models you train, you can do so like this.

In [None]:
print_weights(clf, vocab)

In [None]:
features=[political_dictionary_feature, unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

---

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.

1.  Download a copy of the notebook file: `File > Download > Download .ipynb`.

2.  Print the notebook as PDF (via your browser, or tools like [nbconvert](https://nbconvert.readthedocs.io/en/latest/)).