Here, we use lasso regression to try to drive some of the weights to 0 as it is reasonable to assume that
some words in the input may not be relevant and thus should be removed from the data.

The lasso'd data (with some of the words removed) is outputted into the folder lasso_data/

In [1]:
# Imports
import numpy as np
from sklearn.linear_model import SGDClassifier
import pickle

In [2]:
# Load the data
x_train = pickle.load(open("x_train.p", "rb"))
y_train = pickle.load(open("y_train.p", "rb"))
x_test = pickle.load(open("x_test.p", "rb"))
word_labels = pickle.load(open("word_labels.p", "rb"))

In [3]:
# Creates a linear model using l1 regularization with the given alpha and loss function
# Fits the model on x_data and y_data.
# Returns the final weights of the model
def run_lasso(x_data, y_data, alpha, loss):
    print("Running lasso with alpha = " + str(alpha) + " loss = " + str(loss))
    print("Using alpha = " + str(alpha) + " and loss = " + str(loss))
    
    model = SGDClassifier(loss = loss, alpha = alpha, penalty = "l1")
    model.fit(x_data, y_data)
    
    weights = model.coef_
    return weights

In [4]:
# Given the weights from a lasso model and the word labels, returns those words
# which were eliminated from the model (have weights of 0). 
def interpret_weights(weights, word_labels):
    dead_values = []
    
    for i in range(len(weights)):
        weight = weights[i]
        if weight == 0:
            dead_values.append(i)
            
            
    print("Percent of words eliminated: " + str(len(dead_values) / len(weights)))
    print("Words eliminated: ")
    for i in dead_values:
        print(word_labels[i], end = " ")

In [5]:
# Shaves the training set based on the lasso weights. Does this by getting rid of words with a
# weight of 0. Pickles the modified dataset into filename
def shave_training_set(lasso_weights, filename):
    global x_train
    global y_train
    global x_test
    global word_labels
    
    dead_values = []
    new_x_train = np.copy(x_train)
    new_x_test = np.copy(x_test)
    new_word_labels = np.copy(word_labels)
    
    for i in range(len(weights)):
        weight = weights[i]
        if weight == 0:
            dead_values.append(i)
    
    new_x_train = np.delete(new_x_train, dead_values, 1)
    new_x_test = np.delete(new_x_test, dead_values, 1)
    new_word_labels = np.delete(new_word_labels, dead_values)
            
    pickle.dump(new_x_train, open("lasso_data/x_train_" + filename + ".p", 'wb'))
    pickle.dump(new_x_test, open("lasso_data/x_test_" + filename + ".p", 'wb'))
    pickle.dump(new_word_labels, open("lasso_data/word_labels_" + filename + ".p", 'wb'))

In [6]:
# We are going to try 5 different lasso models
weights = run_lasso(x_train, y_train, 1e-2, "hinge")[0]
interpret_weights(weights, word_labels)
shave_training_set(weights, "lasso_1")

Running lasso with alpha = 0.01 loss = hinge
Using alpha = 0.01 and loss = hinge
Percent of words eliminated: 0.96
Words eliminated: 
book one movi like veri time ha use onli work make realli stori first becaus look want think go cd film better ani way product song see could know also thing charact music mani say littl im review ever new never peopl watch doe back play bought give still find dvd need made got end old two take come interest put seem day ive thought sound lot cant everi purchas star found feel fan befor anoth qualiti start write someth version author worth show doesnt long part set last differ help must game classic novel keep listen order problem anyon expect actual page live real seri written price origin howev sinc hard plot right hope enough nice fun without though world easi amazon us big kid far person sure favorit month turn alway complet may believ pretti understand seen quit around bit cover man pictur edit point almost scene tell beauti said word video reader f

In [7]:
weights = run_lasso(x_train, y_train, 1e-3, "hinge")[0]
interpret_weights(weights, word_labels)
shave_training_set(weights, "lasso_2")

Running lasso with alpha = 0.001 loss = hinge
Using alpha = 0.001 and loss = hinge
Percent of words eliminated: 0.713
Words eliminated: 
veri get ha realli becaus look think cd film way product song see know thing music mani say im review new peopl watch play give find dvd need got end old come ive sound lot cant everi purchas star feel fan befor anoth qualiti start write someth show long set last differ game listen order problem anyon actual live real seri written origin sinc hard right hope without world amazon big kid far person sure turn may believ pretti quit around cover man pictur point almost scene said word video reader fact wait place chang whole happen reason second done yet came record inform call true fit piec pleas onc mean small band babi releas wish son save couldnt funni guy high style size line wont probabl might less full sever power wasnt case includ three someon hear kind children care run finish move collect short begin ago hour absolut effect follow light abl tog

In [8]:
weights = run_lasso(x_train, y_train, 1e-4, "hinge")[0]
interpret_weights(weights, word_labels)
shave_training_set(weights, "lasso_3")

Running lasso with alpha = 0.0001 loss = hinge
Using alpha = 0.0001 and loss = hinge
Percent of words eliminated: 0.23
Words eliminated: 
book wa one movi like time get hi use work want think go cd film ani product see music im never peopl find dvd need old sound lot found fan write show long differ game problem anyon actual written origin amazon kid turn believ pretti quit cover man pictur almost video fact wait friend chang whole reason second came call small band everyth releas wish son guy size wont three someon hear finish move saw pick young hold mind toy girl stuff home excit leav batteri clay class print along present view writer except god hit dri number creat suggest told matter king continu arriv huge black kept fantasi stand figur free car compar dark talent recent possibl forward direct text cost pop four 12 clear system within wear eye screen histor lyric forc instal certainli student card group cup larg pass emot previou fight quickli librari notic deserv teach older kne

In [9]:
weights = run_lasso(x_train, y_train, 1e-5, "hinge")[0]
interpret_weights(weights, word_labels)
shave_training_set(weights, "lasso_4")

Running lasso with alpha = 1e-05 loss = hinge
Using alpha = 1e-05 and loss = hinge
Percent of words eliminated: 0.117
Words eliminated: 
thi book one like time even becaus think go cd film better album ani way song music say new watch dvd made got old interest lot cant found fan write differ expect real written origin hard far believ pretti around pictur away wait whole came piec mean wish son three someon hear begin hour special pick girl boy daughter batteri war clay rest women class goe cours except list dri provid ye impress matter face disc figur offer anim break clear sit system within screen dog major air teach role insight messag walk format known episod memori due bottom relationship clearli fire success skin fell cultur red despit toward account univers la credit self struggl pair soni 

In [10]:
weights = run_lasso(x_train, y_train, 1e-6, "hinge")[0]
interpret_weights(weights, word_labels)
shave_training_set(weights, "lasso_5")

Running lasso with alpha = 1e-06 loss = hinge
Using alpha = 1e-06 and loss = hinge
Percent of words eliminated: 0.014
Words eliminated: 
cd doe last seen mysteri danc heavi instruct ball unlik citi select concert drop 