This notebook explores linear regression with L2 (ridge) and L1 (lasso) regularization, using the movie box office prediction data from [Joshi et al. 2010](http://www.cs.cmu.edu/~ark/movie$-data/).  Be sure to install beautifulsoup (a great python library for reading XML).

```sh 
conda install beautifulsoup4=4.7.1
```


In [14]:
import nltk
import numpy as np
from sklearn import linear_model
import sklearn.metrics
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
def read_movie_data(filename):
    
    trainX=[]
    Y_train=[]
    
    testX=[]
    Y_test=[]
    
    with open(filename) as file:
        soup=BeautifulSoup(file)
        movies=soup.findAll('instance')
        for movie in movies:
            split=movie["subpop"]
            y=float(movie.find('regy')["yvalue"])

            # we'll just take the first review in the data (each movie has multiple reviews)
            review=movie.find('text')
            
            tokens=nltk.word_tokenize(review.text)
            words=' '.join(tokens)
            if split == "train":
                trainX.append(words)
                Y_train.append(y)
            elif split == "test":
                testX.append(words)
                Y_test.append(y)
   
    return trainX, Y_train, testX, Y_test
    

In [None]:
def analyze_weights(learned_model, vocab, num_to_print, printZero=True):
    reverse_vocab = {v: k for k, v in vocab.items()}

    sort_index = np.argsort(learned_model.coef_)
    
    for k in reversed(sort_index[-num_to_print:]):
        if learned_model.coef_[k] != 0 or printZero:
            print ("%.5f\t%s" % (learned_model.coef_[k], reverse_vocab[k] ))
        
    print()

    for k in sort_index[:num_to_print]:
        if learned_model.coef_[k] != 0 or printZero:
            print ("%.5f\t%s" % (learned_model.coef_[k], reverse_vocab[k] ))

In [None]:
trainX, Y_train, testX, Y_test=read_movie_data("../data/7domains-train-dev.tl.xml")

In [None]:
vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,2), lowercase=True, strip_accents=None, binary=True)
X_train = vectorizer.fit_transform(trainX)
X_test = vectorizer.transform(testX)

Ridge regression is linear regression with L2 regularization; how does varying the regularization strength affect the accuracy (MAE)?  How does it affect the rank order of the most informative coefficients?  Play around with the parameters of the CountVectorizer above (varying the number of max_features, increasing the ngram range to include bigrams, trigrams, etc.).

In [None]:
# higher values of alpha = stronger regularization
ridge_regression = linear_model.Ridge(alpha=100, fit_intercept=True)
ridge_regression.fit(X_train, (Y_train))
preds=ridge_regression.predict(X_test)
mae=sklearn.metrics.mean_absolute_error(preds, (Y_test))
print("MAE: %.3f" % mae)
analyze_weights(ridge_regression, vectorizer.vocabulary_, 5)

Lasso is linear regression with L1 regularization, which pressures coefficients to not only be close to zero, but exactly zero.  Lasso provides features selection as a result of this, since parameters with 0 value are effectively removed from the model. How does varying the regularization strength here affect the number of non-zero coefficients?  How does it affect the rank order of the most informative coefficients?

In [None]:
lasso = linear_model.Lasso(alpha=100, fit_intercept=True, max_iter=10000)
lasso.fit(X_train, (Y_train))
preds=lasso.predict(X_test)
mae=sklearn.metrics.mean_absolute_error(preds, (Y_test))
print("MAE: %.3f" % mae)

count=0
for val in lasso.coef_:
    count+=1 if val != 0 else 0

print("Nonzero features: %s\n" % count)
analyze_weights(lasso, vectorizer.vocabulary_, 5, printZero=False)