### Linear Regression Modeling Lab

This lab will walk you through the basics of building a linear regression model out of a training and test set using a variety of techniques, including:

 - estimating distributional fit
 - onehot and target encoding
 - measuring progress with cross validation scores
 - creating a custom loss function
 - properly using inferences from the training set to transform the test set
 
**Some of these columns might have missing values.  Decide on the best approach for filling them in based on what we did from last class.**

#### Step 1).  Upload the training and test set from the `\movies` folder inside the `\Data` folder

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from statsmodels.tools import add_constant
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from IPython.display import display
pd.options.display.max_columns = None

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import probplot

In [3]:
train = pd.read_csv('/Users/devonbancroft/Desktop/DAT-10-14/class material/Unit3/Data/movies/train.csv')
test = pd.read_csv('/Users/devonbancroft/Desktop/DAT-10-14/class material/Unit3/Data/movies/test.csv')

#### Step 2).  Using a Custom Loss Function

To avoid some of the pitfalls of using a loss function that measures squared error, we're going to modify it a little bit.  This is also a useful skill in practice because lots of projects will require something precise that's not available out-of-the-box in a library.

`Scitkit-Learn` allows for custom loss functions relatively easily

We're going to instead use the **mean squared log error**.  It has the following form:

$$ \frac{\sum{log_{e}(y - \bar{y})^2}}{n} $$

The easiest way to do this is the following:

 - take the log of y using `np.log1p` to avoid the hassles of dealing with negative values
 - fit your model to that, and then calculate the resulting mean squared error
 
So your job is two fold:
 - log transform the target variable (revenue)
 - create a function called `mean_squared_log_error` according to the specifications defined here:  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html, under the heading for the `scoring` argument
 - to test that you did this correctly, run a 10-fold univariate linear regression on the training set using the `popularity` column as `X` and `revenue` as y.  The correct value should be 60.7.

In [None]:
# your code here

#### Step 3).  Distributional Inference of Your Continuous Variables

This dataset is far from normal.  Use the `probplot()` method to find the *least* normal variable among your numeric variables, judging by the r-squared value of the resulting line.  

Then, see if log-transforming improves its behavior at all.  Use a comparison between your validation scores in a univariate regression between the treated and untreated versions of the variable as your indicator of whether or not this made anything better.

In [None]:
# your code here

#### Step 4).  Encoding the `Director` Column

The `Director` column is a good example of some of the challenges of dealing with categorical data.  If George Lucas or Steven Spielberg direct a film, there's a good chance that has a non-random impact on a film's bottom line.  However, there are a lot of unique values, most of which are probably non-impactful.  

Creating a column for everyone is probably not a good idea, but there's also no clear 'order' you could assign them just by looking at their labels.  

In this step you're going to try two different techniques to see which one works better on your dataset.

**Technique 1:**  Only include directors that have a value count of at least 10 *in your training set*, and set everything else to other.  

So:

 - transform the column accordingly (you can make a new column if that's easier)
 - transform the same column in your test set so that if a director's name *doesn't* appear in your new training column it gets set to `Other`

In [None]:
# your code here

**Technique 2:** Use target encoding to transform the column instead, and use the results from your training set to transform your test set.  There are a lot of directors in your test set that are not in your training set, and this will result in missing values.  Fill these in with the column average.

**Bonus:** The method we're using here is a little blunt because our average value doesn't account for how often a particular value occurs.  A more nuanced approach to is to take some sort of weighted share between the overall column average and average of your particular unique value.  A good article on this is here:  https://maxhalford.github.io/blog/target-encoding-done-the-right-way/

In [None]:
# your code here

Use 10-fold univariate regression on both to see which one gives you a better result.

In [None]:
# your code here

#### Step 5).  Standardize Your Data using the `StandardScaler` module

 - make sure to `fit` it on the training set and `transform` it on the test set

In [None]:
# your code here

#### Step 6).  To get an estimate of your models performance, use 10-fold cross validation on your training set

In [None]:
# your code here

#### Step 7).  Now, before making your final predictions for your test sit, fit the model on all of your training data

In [None]:
# your code here

#### Step 8).  Make a prediction on your test set, and save the results as a dataframe, using two columns:

 - **id**:  the id of your test set rows
 - **prediction**: your corresponding predictions
 
Submit this to a csv file, using the option `index=False`

In [None]:
# your code here