### Linear Regression Modeling Lab

This lab will walk you through the basics of building a linear regression model out of a training and test set using a variety of techniques, including:

 - estimating distributional fit
 - onehot and target encoding
 - measuring progress with cross validation scores
 - creating a custom loss function
 - properly using inferences from the training set to transform the test set
 
**Some of these columns might have missing values.  Decide on the best approach for filling them in based on what we did from last class.**

#### Step 1).  Upload the training and test set from the `\movies` folder in the `Data` folder.  Remove the 'id' column

In [1]:
# your code here
import pandas as pd
import numpy as np

train = pd.read_csv('../../Data/movies/train.csv', parse_dates=['release_date'])
test  = pd.read_csv('../../Data/movies/test.csv', parse_dates=['release_date'])

In [2]:
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

#### Step 2).  Using a Custom Loss Function

To avoid some of the pitfalls of using a loss function that measures squared error, we're going to modify it a little bit.  This is also a useful skill in practice because lots of projects will require something precise that's not available out-of-the-box in a library.

`Scitkit-Learn` allows for custom loss functions relatively easily

We're going to instead use the **mean squared log error**.  It has the following form:

$$ \frac{\sum{log_{e}(y - \bar{y})^2}}{n} $$

The easiest way to do this is the following:

 - take the log of y using `np.log1p` to avoid the hassles of dealing with negative values
 - fit your model to that, and then calculate the resulting mean squared error
 
So your job is two fold:
 - log transform the target variable (revenue)
 - create a function called `mean_squared_log_error` according to the specifications defined here:  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html, under the heading for the `scoring` argument
 - to test that you did this correctly, run a 10-fold univariate linear regression on the training set using the `popularity` column as `X` and `revenue` as y.  The average value of your scores should be 60.7

In [3]:
# your code here
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
train['revenue'] = np.log1p(train['revenue'])

def mean_squared_log_error(model, X, y):
    # our error
    error = model.predict(X) - y
    # the average value of its square
    mse = np.mean(error**2)
    return mse

lreg = LinearRegression()
X = train[['popularity']]
y = train['revenue']

scores = cross_val_score(estimator=lreg, X=X, y=y, scoring=mean_squared_log_error, cv=10)

In [4]:
# and get their average value
np.mean(scores)

60.73027393164422

#### Step 3).  Distributional Inference of Your Continuous Variables

In [5]:
# your code here
from scipy.stats import probplot
import matplotlib.pyplot as plt

num_cols = train.select_dtypes(include=np.number).columns.tolist()
num_cols.remove('revenue')

In [6]:
# get the r_squared values
r_squared_vals = [probplot(train[num_cols[i]])[1][2] for i in range(len(num_cols))]

In [7]:
# and get which one it is
min_idx = r_squared_vals.index(min(r_squared_vals))
num_cols[min_idx]

'vote_count'

In [8]:
# and now fit a regression model to determine if transforming it helps
score1 = cross_val_score(estimator=lreg, X=train[['vote_count']], y=y, scoring=mean_squared_log_error, cv=10)
score2 = cross_val_score(estimator=lreg, X=np.log1p(train[['vote_count']]), y=y, scoring=mean_squared_log_error, cv=10)

In [9]:
# the second version is the clear winner
np.mean(score1), np.mean(score2)

(65.94013966051395, 44.4699238212142)

#### Step 4).  Encoding the `Director` Column

The `Director` column is a good example of some of the challenges of dealing with categorical data.  If George Lucas or Steven Spielberg direct a film, there's a good chance that has a non-random impact on a film's bottom line.  However, there are a lot of unique values, most of which are probably non-impactful.  

Creating a column for everyone is probably not a good idea, but there's also no clear 'order' you could assign them just by looking at their labels.  

In this step you're going to try two different techniques to see which one works better on your dataset.

**Technique 1:**  Only include directors that have a value count of at least 10 *in your training set*, and set everything else to other.  

So:

 - transform the column accordingly (you can make a new column if that's easier)
 - transform the same column in your test set so that if a director's name *doesn't* appear in your new training column it gets set to `Other`

In [10]:
# we'll assume we don't know who the director is
train['director'].fillna('Unknown', inplace=True)
test['director'].fillna('Unknown', inplace=True)

# how often did each director appear in the training set?
director_counts = train.groupby('director')['director'].transform('count')
# whenever this value is less than 10, set the value to other
train['director1'] = np.where(director_counts > 10, train['director'], 'Other')

In [11]:
# and now transform the test set in a similar manner -- check to see if the unique values in the director column match
# what's in the training column
test['director1'] = np.where(test['director'].isin(train['director1']), test['director'], 'Other')

**Technique 2:** Use target encoding to transform the column instead, and use the results from your training set to transform your test set.  There are a lot of directors in your test set that are not in your training set, and this will result in missing values.  Fill these in with the column average.

**Bonus:** The method we're using here is a little blunt because our average value doesn't account for how often a particular value occurs.  A more nuanced approach to is to take some sort of weighted share between the overall column average and average of your particular unique value.  A good article on this is here:  https://maxhalford.github.io/blog/target-encoding-done-the-right-way/

In [12]:
# average value of each director's gross
avg_director_gross = train.groupby('director')['revenue'].mean()
# map those values to what we already have
train['director2'] = train['director'].map(avg_director_gross)

In [13]:
# do the same for the test column, and make sure to fill in missing values
test['director2'] = test['director'].map(avg_director_gross)

In [14]:
# fill in missing values with avg value of the revenue column
test['director2'].fillna(train['revenue'].mean(), inplace=True)

Use 10-fold univariate regression on both to see which one gives you a better result.

In [15]:
# your code here
dir_scores1 = cross_val_score(estimator=lreg, X=pd.get_dummies(train['director1']), y=y, scoring=mean_squared_log_error, cv=10)
dir_scores2 = cross_val_score(estimator=lreg, X=train[['director2']], y=y, scoring=mean_squared_log_error, cv=10)

In [16]:
# and the winner is....target encoding.....even if you remove the very strange fold the values are still higher
np.mean(dir_scores1), np.mean(dir_scores2)

(4.973936278303664e+22, 30.57244566730471)

#### Step 5).  Define your new version of `X` to be the 'winning' versions of what we've tested so far, as well as all of the original numeric columns

 - if they need to be onehot encoded, then do so -- make sure to concatenate training and test if you are doing so
 - Using the `standardscaler()` module, make sure to `fit` it on the training set and `transform` it on the test set to standardize your data

In [17]:
# import the module
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

# we never actually saved this before
train['vote_count'] = np.log1p(train['vote_count'])
test['vote_count'] = np.log1p(test['vote_count'])

# and grab all numberic columns -- these are the ones that won in this case
num_cols = train.select_dtypes(include=np.number).columns.tolist()
# except for this one
num_cols.remove('revenue')

# define X
X = train[num_cols]
# reduce the test set down to the same number of columns
test = test[num_cols]

# call fit and transform on the training set
X = sc.fit_transform(X)
# and then transform the test set as well
test = sc.transform(test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


#### Step 6).  To get an estimate of your models performance, use 10-fold cross validation on your training set

In [18]:
# your code here
scores = cross_val_score(estimator=lreg, X=X, y=y, scoring=mean_squared_log_error, cv=10)
np.mean(scores)

25.835122180347376

#### Step 7).  Now, before making your final predictions for your test sit, fit the model on all of your training data

In [19]:
# your code here
lreg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

#### Step 8).  Make a prediction on your test set, and save the results as a dataframe, using two columns:

 - **id**:  the id of your test set rows, numbering from 1 - 2000
 - **predictions**: your corresponding predictions
 
Submit this to a csv file, using the option `index=False`

In [20]:
# your code here
preds = lreg.predict(test)

In [21]:
# and put it into a dataframe
submission = pd.DataFrame({
    'id': np.arange(1, 2001),
    'prediction': np.expm1(preds)
})

In [22]:
# and if we wanted, we could submit this to a csv file in the following way
submission.to_csv('submission.csv', index=False)