## Overview

This notebook set deals with IMDB review data (from [here](https://ai.stanford.edu/~amaas/data/sentiment/)).  The dataset has been adjusted slightly for the purposes of this exercise to conform to the description that follows and to limit the total number of reviews.

The goal is to explore using linear least squares to predict a moview review's rating from the frequency with which different words appear in the review.

In [2]:
## These are the only libraries you should need
import numpy as np
import matplotlib.pyplot as plt

# below line just to make figures larger
plt.rcParams["figure.figsize"] = (20,10)

### Dataset
The dataset consists of reviews with their corresponding ratings.  Each data point is a different movie review.  The label, $y$, is the review's rating (from 1 to 10, with 10 being the best).

There are 1000 features, corresponding the the 1000 most common words across all reviews.  For example, the first feature corresponds to the word "the" and the second feature corresponds to the word "and."  The value of the feature is the nubmer of times that word appears in the review.  **Note: This is different from the dataset used for naive Bayes: the features and target are both real-valued.**

The code below loads in both a training dataset (`train`) and a testing dataset (`test`).  Most importantly, each as a `.X` and `.Y` field: numpy arrays of the $X$ and $Y$ matrices.  You can also find the words that correspond to each feature (`.featnames`).

In [1]:
import sys; sys.path.insert(0, '../..') # path to dataset.py (adjust if you have moved this file or dataset.py)
from dataset import loaddataset

#adjust the directory below if you have moved either the dataset files or this notebook
datasetdir = '../../datasets/' 

train = loaddataset(datasetdir+'moviereview-reg-train')
test = loaddataset(datasetdir+'moviereview-reg-test')

## Question 1:

<div style="color: #000000;background-color: #FFFFEE">
Complete the training and testing functions below for linear least squares.  You code should <b>not</b> add the extra "constant" feature.  Assume that is done already.  These functions are very short.
</div>

In [3]:
def learnlls(X,Y):
    # X is the data matrix of shape (m,n)
    # Y is are the target values of shape (m,)
    # function should return w of shape (n,)        
    return np.linalg.solve((X.T@X),X.T@Y)
    
def predictlls(X,w):
    # X is the (testing) data of shape (m,n)
    # w are the weights learned in linear least-squares regression of shape (n,)
    # function should return Y, the predicted values of shape (m,)
    return X@w
    
def testlls(X,Y,w):
    # X and Y are the testing data
    # w are the weights from linear least-squares regression
    # returns the *mean* squared error
    Ydelta = Y - predictlls(X,w)
    return (Ydelta*Ydelta).mean()

## Question 2:

<div style="color: #000000;background-color: #FFFFEE">
    
Using the functions above, compute the weights learned using linear least squares on the training data.  Then report the mean squared error of the resulting regressor on the training data and the testing data.

You will need to augment the features with an initial column of all 1s to allow for an offset for the learned function.
</div>    

In [8]:
def add1s(X):
    return np.hstack((np.ones((X.shape[0],1)),X))

w = learnlls(add1s(train.X),train.Y)
trainerr = testlls(add1s(train.X),train.Y,w)
testerr = testlls(add1s(test.X),test.Y,w)

print(f'training error: {trainerr}')
print(f'testing error: {testerr}')

training error: 5.699668546224271
testing error: 6.2007346197200075
