## Assignment 4 -- Applying Machine Learning Algorithms

You will be given a data set, and will apply the techniques we have studied in class to predict a numeric response variable, and to evaluate alternative solutions.

The steps will be

1.  Exploratory data analysis and evaluation
  * Find and correct outliers and missing values
  * Find nonlinear relationships between the independent variables and the dependent variable, and transform the input appropriately
  * Find correlation in the independent variables and decide how if at all to address it
  

2. Apply learning techniques.  In each case you will train the algorithm and evaluate it using the *test* $R^2$ statistic.  You will explore different hyperparameter values to find the model you think will maximize test $R^2$ for an evaluation data set.  You will do this for
  * Linear regression exploring different variable sets using regular stepwise regression, Lasso, and Ridge Regression
  * Decision tree regression exploring different tree depths
  * Random forests and boosting exploring different parameter sets
  * Neural networks 


3. Choosing the best method.  You will choose one algorithm and parameter settings you are most confident with, and write a function that enables it to evaluate a new data set.

4. When I evaluate your solution, I will call this function on a new set of data, and score your solution (partially) on its results

Every part of this assignment has been covered in the notebooks we have looked at in class, so that should be your first source of information and inspiration. 

<b><span style="color: blue">Cells in blue indicate you should fill in your results -- either text or code.</span></b>

When you submit your code, please fill in the cells asked for, but do not add new cells or change the other cells.

I will run the cells in your submitted notebook in sequence, so make sure things are in the proper order, all the needed libraries have been imported, etc.

----------------------------
#### Note on your prediction functions

You will notice that for each technique, you are asked to provide a "prediction" function that takes an **X** matrix as input.  This **X** matrix will be in the format of the original data set you loaded.  So if in your data cleaning phase you added or deleted or transformed column values, each of these prediction functions must make the same transformations on its input prior to calling you model.


---------------------------------------

### Loading and Cleaning the Data Set

The data set is in a file named **data_set.csv** -- it has 11 independent variables -- some are numeric and some categorical -- and a single numeric response variable $y$.

In the first cleaning / analysis phase you should do the following
1. Look for outlier values.  When you find outliers, you can do one of two things
  * Throw away the data row altogether.  If many variables in the row seem uncommon, it is probably best to delete the row.
  * Replace the outlier with a "reasonable" value -- probably the mean or median value for that variable.   The reasoning is that the benefit of keeping the row outweighs the error introduced by having a made-up value in one variable
2. Look for missing values.  Most algorithms will throw away data rows that have *any* missing value.  You can just delete the row (especially if many values are missing) or assign it a "reasonable" value -- probably the mean value for that variable.  If there is an attribute that has many missing values, it is probably best to delete the whole column
3. Look for nonlinear relationships.   Most important is finding nonlinear relationships between one of the $x$ variables, and the $y$ variable.  For example, maybe $y$ depends on $x^2$ or $\log(x)$ rather than on $x$. In that case you need to guess at that relationship, and replace $x$ with a transformed value.  For example if it looks like $y$ depends on $x^2$ then just a column of $x^2$ values.  The easiest way to see these relationships is to do a pair plot between your X variables and y, including a trend line.  "Well behaved" $x$ variables tend to show no pattern except for (roughly) following the trend line.  If you are seeing other shapes, or sudden jumps in the behavior of $y$ as $x$ changes, something nonlinear is going on.
4. Look for correlations among the $x$ variables.  If you find a correlation you may want to delete one of the correlated variables, but it is not necessary -- you will have to experiment to see if it improves your predictions.  To find correlations, you can use the pair plot, or a correlation matrix, or a heatmap -- there are examples of all of these in the class notebooks.
5. Transform categorical variables to dummy (0/1 coded) variables

The result of this phase should be a matrix ${\bf X}$ and a vector ${\bf y}$ that comprise your training set.

Remember though, if you ever need to use your learned function to a new ${\bf X}$ data set, you need to transform the input ${\bf X}$ matrix the same way your transformed your test set -- otherwise your learned function will give bad results.

Although it looks like this cleaning phase happens before any analysis/learning, the two processes are interleaved.  Start with a simple linear regression model and do minimal cleaning on your data set, just to the point the LR model works.  Then you can test more subtle transformations to see if they make your models perform better.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from pandas.api.types import is_numeric_dtype

# Get rid of rows with invalid values
def correctDummies(df):
    df = df[df.x4.isin(['true', 'false'])]
    df = df[df.x2.isin(['r', 'b', 'y'])]
    return df

#  Add the dummy variables and get rid of x2 and x4.   Has to be done after correctDummies!
def convertDummies(df):
    dfnew = df.copy()
    x2_dummies = pd.get_dummies(dfnew.x2, prefix='x2')
    x4_dummies = pd.get_dummies(dfnew.x4, prefix='x4')
    x2_dummies.drop(x2_dummies.columns[0], axis=1, inplace=True)
    x4_dummies.drop(x4_dummies.columns[0], axis=1, inplace=True)
    dfnew.drop(['x2', 'x4'], axis=1, inplace=True)
    dfnew = pd.concat([dfnew, x2_dummies], axis=1)
    dfnew = pd.concat([dfnew, x4_dummies], axis=1)
    return dfnew

def remove_outliers(data):
    for name in list(data.columns):
        if is_numeric_dtype(data[name]):
            data = data[np.abs(data[name]-data[name].mean()) < (3*data[name].std())]
    return data

def remove_missing_values(data):
    data.dropna(inplace=True)
    return data 

def update_nonlinear_values(data):
    data.x8 = data.x8**2
    return data

def separate_column_values(data):
    data['x5_b'] = data['x5'] >= data_x5_mean
    data['x5_s'] = data['x5'] < data_x5_mean
    data['x5_b'] = data['x5_b'].astype(int)*data['x5']
    data['x5_s'] = data['x5_s'].astype(int)*data['x5']
    data.drop(['x5'], axis=1, inplace=True)
    return data

def deal_with_unordered_categories(data):
    for name in list(data.columns):
        if not is_numeric_dtype(data[name]):
            temp_dummies = pd.get_dummies(data[name], prefix=name)
            temp_dummies.drop(temp_dummies.columns[0], axis=1, inplace=True)
            data.drop(name, axis=1, inplace=True)
            data = pd.concat([data, temp_dummies], axis=1)
    return data

def cleandf(df):
    df = df.dropna(inplace=False)
    df = update_nonlinear_values(df)
    df = separate_column_values(df)
    df = correctDummies(df)
    df = convertDummies(df)
    return df

file_location = 'data_set.csv'
df = pd.read_csv(file_location, dtype={'x4': str})
df = remove_outliers(df)
data_x5_mean = df.x5.mean()
df = cleandf(df)
y = df.y
X = df.drop(['y'], axis=1, inplace=False)

-----------------------------
#### <span style="color: blue">*Your Summary of EDA / Cleaning Phase*</span>

<span style="color: blue">

*In this markdown cell please write up the transformations you made to the data set, and why you decided to make those transformations.*
1. Look for outlier values: If the differences between an individual value and the mean in a column are equal or greater than three times of standard deviation of the column, remove the data.<br><br>

2. Look for missing values and correct the values: If a missing value exists in a data, just remove this data. Here, only keep values with 'r', 'b', 'y' in 'x2' column and keep values with 'true' and 'false' in 'x4' column. Others are removed.<br><br>

3. Look for nonlinear relationships: After checking the pair plot between X variables and y, I only found that 'y' should depend on 'x8'^2. So, I add "data.x8 = data.x8^2". Also, 'x5' needs to separate two columns,'x5_b' and 'x5_s', by the mean value because there are two tendenies within this column.<br><br>

4. Look for correlations among the 𝑥 variables: After checking the correlation among these x variables, I only found 'x6' and 'x7' have a very high correlation(~0.968). However, after I run all the regressions, the results of keeping both in the X variables is better than removing it. So, I keep both variables in the data set and leave them to the regreesion models to judge".<br><br>

5. Transform categorical variables to dummy (0/1 coded) variables: Only tranform 'x2' and 'x4' to dummy variables.

</span>

---------------------------------

In the following cells you will try various learning techniques on your data set.  For each one you will finish with a prediction function for your best model.  For example for linear regression you will define a function **linear_regression_predict(X)** which will produce the predicted $y$ values for your model.  Remember that the **X** argument is un-transformed.

-----------------------------------------
### Linear Regression

In this section you will try linear regression, and also use Lasso, Ridge Regression and Forward Stepwise Regression to find the set of variables that give you the best $R^2$ score.   You will produce a markdown summary, then implementations of your four models


#### <span style="color: blue">Summary of Your Linear Regression Models</span>
<span style="color: blue">
In this markdown cell, summarize your results in building linear regression models for this data set.
For each method (full-model regression, forward stepwise regression, Lasso, and Ridge regression) report on the best model:  the variables in the model, the adjusted $R^2$, the estimated test accuracy, and in the case of Lasso, the optimal $\alpha$ value.  Can you explain the differences in the structure and performance of the alternative models?
    
1. Full-model regression: (using GridSearchCV)<br>
   Variables: 'x1', 'x3', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x5_b', 'x5_s', 'x2_r', 'x2_y', 'x4_true'<br>
   Adjusted R2: 0.961<br>
   Estimated test accuracy: 275849306.98 (MSE)<br>
2. Forward stepwise regression: <br> 
   Selected variables: 'x5_b', 'x1', 'x8', 'x3', 'x2_r', 'x2_y', 'x5_s'<br>
   Adjusted R2: 0.961<br>
   Estimated test accuracy: 275889252.69 (MSE)<br>
3. Lasso regression:<br> (using GridSearchCV)
   best 𝛼: 0.1<br>
   Adjusted R2: 0.961<br>
   Estimated test accuracy: 275849778.72 (MSE)<br>
4. Ridge regression:<br> (using GridSearchCV)
   best 𝛼: 1.0<br>
   Adjusted R2: 0.961<br>
   Estimated test accuracy: 275849310.36 (MSE)<br>
   
All the results are almost same. It is difficult to tell that a particular regression is the best one. In the regression structure, the major difference between full-model and forward stepwise regression is using different set of variables (full-model uses all 13 variables, and forward stepwise uses 6 variables). As for lasso and ridge regression, if I only think about the result and the implementation, the main difference between them is only the function name in the implementation. Both are using 𝛼 as their parameters.
</span>

In [2]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

def linear_regression_predict(xdf):
    return lr.predict(xdf)

In [3]:
import statsmodels.formula.api as smf

def forward_selected(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model, selected

fsmodel, _ = forward_selected(df, 'y')

def stepwise_regression_predict(xdf):
    return fsmodel.predict(xdf)

In [4]:
from sklearn import linear_model
best_alpha = 0.1
lasso = linear_model.Lasso(alpha=best_alpha)
lasso.fit(X, y)

def lasso_predict(xdf):
    return lasso.predict(xdf)

In [5]:
from sklearn import linear_model
best_alpha = 1.0
ridge = linear_model.Ridge(alpha=best_alpha)
ridge.fit(X, y)

def ridge_predict(xdf):
    return ridge.predict(xdf)

--------------------------------------
### Decision Tree Regressors and Ensemble Methods

Here you will build decision tree regression learners, and experiment to optimize algorithm parameters.  You will implement learners for 
* Decision trees
* Random forest
* Boosting

#### <span style="color: blue">Summary of Your Decision Tree and Ensemble Method Models</span>
<span style="color: blue">
In this markdown cell, summarize your results in building tree-based models for this data set.
For each method report on the best model:  the model parameters and the estimated test accuracy.<br>
1. Decision Tree:<br>
   Model parameters: Best depth = 9<br>
   Adjusted R2: 0.969<br>
   Estimated test accuracy: 216644138.99 (MSE)<br><br> 
2. Random Forest:<br>
   Model parameters: Best n_estimators = 200, Best max_features = 13
<br>
   Adjusted R2: 0.995<br>
   Estimated test accuracy: 37983485.32 (MSE)<br><br>
3. Boosting:<br>
   Model parameters: Best estimators = 200, Best learning rate = 1.0 <br>
   Adjusted R2: 0.916<br>
   Estimated test accuracy: 591237762.77 (MSE)
</span>

In [6]:
from sklearn import tree
best_depth = 9
dtree = tree.DecisionTreeRegressor(max_depth = best_depth)
dtree.fit(X, y)
def decision_tree_predict(xdf):
    return dtree.predict(xdf)

In [7]:
from sklearn.ensemble import RandomForestRegressor
best_n_estimators, best_max_features = 200, 13
forest = RandomForestRegressor(n_estimators=best_n_estimators,
                                       max_features=best_max_features)
forest.fit(X,y)
def random_forest_predict(xdf):
    return forest.predict(xdf)

In [8]:
from sklearn.ensemble import AdaBoostRegressor
best_n_estimators, best_learning_rate = 200, 1.0
booster = AdaBoostRegressor(n_estimators=best_n_estimators, 
                                    learning_rate=best_learning_rate)
booster.fit(X,y)
def adaboost_predict(xdf):
    return booster.predict(xdf)

------------------------------------
### Neural Networks

For this part you will use the *keras* library to implement a neural net regression function.  You will experiment with the structure of the network to optimze for $R^2$.  Remember that your neural net implements a *predict* method, and you can use *sklearn.metrics.r2_score* to evaluate your model.  

#### <span style="color: blue">Summary of Neural Network Solution</span>
<span style="color: blue">
In this markdown cell, summarize your results in building the neural network predictor, including the estimated test accuracy and the model parameterers<br>

Adjusted R2: 0.961<br>
Estimated test accuracy: 276246803.38 (MSE)<br><br>
Model parameters:
* One input layer with 13 inputs and 50 outputs.<br>
* Two hidden layers with 50 nodes.<br>
* One output layer with 50 inputs and 1 output.<br>
optimizer='adam', loss='mean_squared_error',epochs=200,validation_split=0.1<br>
* Using earlystop callback function:<br>
monitor='loss', min_delta=0.0000, patience=9, mode='auto', restore_best_weights=True<br>
</span>

In [9]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Dense(50, activation='relu', input_shape = (X.shape[1],)))
model.add(Dense(50, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer="adam", loss='mean_squared_error')
earlystop = EarlyStopping(monitor='loss', min_delta=0.0000, patience=9, verbose=0, mode='auto', restore_best_weights=True)
callbacks_list = [earlystop]
model.fit(X, y, epochs=200, callbacks=callbacks_list, verbose=0);

def neural_net_predict(xdf):
    return model.predict(xdf)

Using TensorFlow backend.


### Scoring Your Work
In the following code cell, implement a method best_model_predict(X) where X is the same shape as the original training set in the data file.  I will call this function on a new data set generated by the same function, but not part of the training set.  Use whatever method and parameter settings you think will perform best.   **Remember** the ${\bf X}$ matrix I will call your predict function with will be like the original data matrix, so if you did any transformations on the data set, you will have to do transformation on this matrix too.  It is guaranteed that the data set I used will not have any missing values or deliberate outliers.

In [10]:
def best_model_predict(x_matrix):
    
    return random_forest_predict(x_matrix)

In [12]:
## I will copy code into this cell which will (a) read in the evaluation data frame, 
## (b) call your predict function, and (c) compute a score for your model on my evaluation data set


{'linear_regression': 6.61, 'stepwise_regression': 6.612, 'lasso_regression': 6.618, 'ridge_regression': 6.61, 'decision_tree_regression': 5.117, 'random_forest_regression': 2.018, 'adaboost_regression': 18.21, 'neural_net_regression': 7.09}
