# Final Project Baseline
Yang Wei Neo, Emily Rapport, Hilary Yamtich

## Load Libraries and Data

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import csv
from rfpimp import *
import numpy as np
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostClassifier 
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.decomposition import PCA 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
# note: this notebook requires pandas 0.21.0 or newer
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import testing_utils as testing
import model_training_utils as model_train
import ensemble_model_utils as ensemble
import math
from datetime import datetime as dt
import re as re
import pickle as pk
import logging

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO

from dateutil import parser
import datetime

In [None]:
# Load the pickle file that contains the clean data and other useful stuff?
infile = open('./clean_data_pickle','rb')
data = pk.load(infile)
infile.close()

with open('clean_test_data.pkl', 'rb') as infile:
    kaggle_test_data = pk.load(infile)
    
feature_importances = pd.read_csv('feature_importance.csv')
feature_importances.columns = ['feature', 'importance']

--------

# Cross Validation

We started with a simple 15% dev set, but we have found that for this amount of data, the differences in the models and their scores on the dev sets can vary significantly based on which rows end up in the train and dev sets. Repeated random sub-sampling cross validation helps us get more consistent results.

Note that we do not split out the dev data using the most recent years, which would be the proper way to create a dev set if our task were explicitly to predict future home prices. The test data appears to have rows from all the years represented in the train set, so we built dev sets that sample from across the train set. 

In [None]:
# still to do : choose one version of pandas to use so that our code all agrees
# and I don't have to read in a new dataset here 
NUM_CROSS_VALS = 10

In [None]:
# get the list of different cross val splits
cross_val_list = []
for i in range(NUM_CROSS_VALS):
    split_idx = int(data.shape[0] * .85)
    # line below is what shuffles
    data = data.sample(frac=1)
    train_df = data[:split_idx]
    dev_df = data[split_idx:]
    split_dict = {'train_df': train_df,
                  'dev_df': dev_df}
    cross_val_list.append(split_dict)

## Error Metric

TODO YW - TALK ABOUT ERROR METRIC.

## Get Baseline


As our primary error metric, we focus on the root mean squared error of the logarithm of the prices, which is the error metric being used to create the leaderboard for this kaggle competition. See rmsle() in shared_functions.py for our implementation of the root mean squared error, an implementation we found from Mark Nagelberg on Kaggle: https://www.kaggle.com/marknagelberg/rmsle-function.

When we consulted our resident real estate expert, Hilary's dad, about this problem, he told us that only one of these factors matters - "location, location, location." In the spirit of that insight, we created a baseline "model" which looks at what neighborhood the house is in and takes the mean price of houses from that neighborhood in the training set. 

In [None]:
# todo: figure out why i'm getting nans now
# when i wasn't in original notebook
def baseline_pred(row,
                  train_df):
    for col in train_df:
        if 'Neighborhood' in col:
            if row[col] == 1:
                neighborhood_var = col
                break
    return np.nanmean(train_df[train_df[neighborhood_var]==1]['LogSalePrice'])

def get_baseline_cross_val(cross_val_list):
    all_rmses = []
    for di in cross_val_list:
        dev_df = di['dev_df']
        dev_df['baseline_pred'] = dev_df.apply(lambda row: baseline_pred(row,
                                                                         di['train_df']), axis=1)
        rmse = testing.rmsle(list(np.exp(dev_df['LogSalePrice'])), list(np.exp(dev_df['baseline_pred'])))
        all_rmses.append(rmse)
    return np.nanmean(all_rmses) 

# baseline RMSLE
print("Baseline RMSLE: {:.3f}".format(get_baseline_cross_val(cross_val_list)))

With this as a baseline, we began exploring how different types of models perform on the problem.

## Model Exploration 
#### Linear Regression

We begin with linear regression as the standard choice for a regression problem. In ordinary least squares regression, the regression line is fit by minimizing the sum of squared residuals between the predicted line and the true data points. We can interpret the resulting coefficients on each feature as representing the additional impact of a one-unit change in that feature on the final price.

- Will not work well with variables that are highly correlated with each other
- Indeed this is why linear regression with PCA seems to do just as well as the original variables; the  loss is not as great for linear regression (only 0.04 loss in RMSLE).

In [None]:
# Set up different LR models
models_to_param_list = {LinearRegression: [{}], 
                        Lasso: [{'alpha': 0.005}], # lower value is less regularization
                        Ridge: [{'alpha': 2}]} # more effective with more regularization
                        #ElasticNet: [{'alpha': 0.1}]} # lower value is less regularization

# Outcome 
outcome_vars = ['LogSalePrice']

# Create feature sets
feature_sets = [[col for col in data.columns if col not in ['YrMoSold', 'LogSalePrice', 'SalePrice']]]

# Output model
lr_models = model_train.try_different_models(cross_val_list, 
                                        models_to_param_list,
                                        outcome_vars, 
                                        feature_sets)

lr_models.sort_values('Root MSE', ascending=True)

### Bagging Illustration

Bagging, or bootstrap aggregation, is intended to reduce variance in the test error by averaging predictions over very specialized models. While each of these models in isolation is likely to overfit, the ensemble of specialized models ends up being very effective at reducing overall test error. To stress test this assumption, we run several random forest ensembles on models that are increasingly less likely to *individually* overfit. We find that the more likely each individual model is to overfit (either by enforcing a smaller minimum leaf, or by enforcing a higher split size), the lower the error of the ensemble as a whole. We suspect that this is because a large ensemble paired with high variance/low bias individual models gives the best of both worlds.

Note however that this phenomenon is not true when we increase the proportion of features used to split at each node (causing any given individual tree to be more likely to overfit). We aren't sure why... !!!

Another interesting phenomenon is that the difference between the training and test error decreases as each individual model within the ensemble gets less complex. This is a sign of increasing bias in the underlying model, which makes sense, since each model is more likely to underfit since it has less underlying complexity. 

This analysis suggests that we ought to lean towards creating more complex individual trees (low bias); the higher variance that results from this ought be offset by the bootstrap aggregation procedure. We will use a grid search to find the optimal combination of parameters. 

In [None]:
# Initialize list of tests:
param_list = []

# Create list of parameter types
for min_leaf_size in range(10):
    param_list.append({'min_samples_leaf': min_leaf_size, 'n_estimators': 50})
    
for feature_prop in range(10):
    param_list.append({'max_features': feature_prop/10, 'n_estimators': 50})
                       
for split_size in range(11):
    param_list.append({'min_samples_split': split_size, 'n_estimators': 50})


In [None]:
# Run models to show the impact of bagging
### THIS TAKES A LONG TIME TO RUN
models_to_param_list = {RandomForestRegressor: param_list}
feature_sets = [[col for col in data.columns if col not in ['YrMoSold', 'LogSalePrice', 'SalePrice']]]

# Run different random forests
df = model_train.try_different_models(cross_val_list, 
                             models_to_param_list,
                             outcome_vars, 
                             feature_sets)

In [None]:
# Plot the data
fig, ax = plt.subplots(1,3, sharey='row')
fig.tight_layout(pad = 1.5)

ax[0].plot(df.iloc[0:9]['Root MSE'])
ax[0].plot(df.iloc[0:9]['Train MSE'])
ax[0].set_xlabel('Min Leaf Size')

ax[1].plot(df.iloc[10:20]['Root MSE'])
ax[1].plot(df.iloc[10:20]['Train MSE'])
ax[1].set_xlabel('% of Features used to split')

ax[2].plot(df.iloc[20:30]['Root MSE'])
ax[2].plot(df.iloc[20:30]['Train MSE'])
ax[2].set_xlabel('Minimum Split Size')

fig.suptitle('Bagging with High Variance Models')
# !!! remove the axis tick marks

### Boosting Illustration

In contrast, boosting is a process that reduces bias by refitting the model iteratively on the errors from the previous model. Boosting can turn weak learners into a accurate ensemble - we can see this below by showing how a relatively weak learner (tree of depth 3) actually appears to have the best accuracy in a GBM as compared to a GBM with greater depth. However, GBMs do have a tendency towards high variance (as seen by how quickly the model overfits the training data relative to the test data). 

In [None]:
# Initialize list of tests:
param_list = []
# Create list of parameter types
for depth in range(20):
    param_list.append({'max_depth': depth, 'n_estimators': 50})

In [None]:
# Run Gradient Boosting Results
models_to_param_list = {GradientBoostingRegressor: param_list}
df_boosting = model_train.try_different_models(cross_val_list, 
                             models_to_param_list,
                             outcome_vars, 
                             feature_sets)

In [None]:
# Plot the DF Boosting Results.
plt.plot(df_boosting.index, df_boosting['Root MSE'])
plt.plot(df_boosting.index, df_boosting['Train MSE'])
plt.xlabel('Depth of Tree')

plt.ylabel('Error')

### Bayesian Ridge 
TODO EMILY

Bayesian ridge is a form of ridge regression, which imposes a penalty on the size of the coefficients. In the Bayesian form of ridge regression, the parameters are estimated using a Gaussian prior. The coefficients, as well as the parameters of the Gaussian distribution (mean and variance) are estimated from the data using maximum likelihood estimation. 

In [None]:
## see how shape of coefficients changes as number of iterations goes up 
=param_list = []
for num_iter in range(1, 40):
    param_list.append({'n_iter': num_iter})
    
models_to_param_dict = {BayesianRidge: param_list}

df = model_train.try_different_models(cross_val_list, 
                                      models_to_param_dict,
                                      ['SalePrice'], 
                                      feature_sets)

In [None]:
df

In [None]:
models = [models[0] for models in df['Model'].values]
coefficients = [model.coef_ for model in models]

# todo: make this nice with subplots 
for i in range(0, len(models), 2):
    plt.hist(coefficients[i])
    plt.title("Iteration {}".format(i))
    plt.show()

### KNearest Neighbors

We experimented with a K Neighbors Regressor, which identifies the k nearest neighbors of the given example and averages their target variables in order to obtain a prediction. A challenge of the K Neighbors algorithm is that it is not able to learn relative importance of different features; as a result, it struggles as the number of features increases.  

In [None]:
k_feature_sets_to_try = []

for i in range(1,75):
    set_to_try = list(feature_importances.feature.values)[:(i)]
    set_to_try = [item for item in set_to_try if item not in ['SalePrice', 'LogSalePrice']]
    k_feature_sets_to_try.append(set_to_try)

In [None]:
## to do: would also be fun to see how this changes as k changes, but maybe not that important  
models_to_param_dict = {KNeighborsRegressor: [{}]}

k_df = model_train.try_different_models(cross_val_list, 
                                        models_to_param_dict,
                                        ['SalePrice'], 
                                         k_feature_sets_to_try)
k_df.plot.scatter(x='Num Features', y='Root MSE', title=)

The plot above shows how the root MSE changes as the number of features changes. As the first few features get added in, there's significant instability in the error, as each new feature drastically changes the estimation of which data point is closest. We see that after the error stabilizes, the general trend is an increase in error as more features are added, which makes sense, since the features that get added later on are less important, and this model has no way of handling less inportant features. Additionally, we see significant periods of plateau, where new features do not cause any change in error. This is likely due to the sparsity of our data set, since many of our features are dummy variables for different values of categories, and those variables are often very sparse (many 0 values across the data set). The sheer number of features, combined with the relative sparsity of our set, make K Neighbors a poor choice for this problem. 

### Other Models

### Ensembling Techniques

#### Voting & Avergaging

EMILY TODO

As a first, simple pass at ensembling, we tried different variations of voting ensembles where each preciction is an average of the predictions of individual models in the ensemble. The intuition here is that different types of models will error in different types of ways, so if one type of model - say, a linear model - has a bias that disproportionately causes errors on certain types of rows, then averaging those predictions with that of a model - say, tree-based - that better handles those rows would pull bad predictions back in the right direction. 

(here, it would be good to include a little play-by-play of ensembles we tried and how our kaggle score rose with each one. maybe also compare them on our own dev sets to see if they match...? I don't think they do...). 

We used a combination of experimentation and theory to guide our selection of different voting ensembles. We watched our Kaggle leaderboard position go up as we made changes to these ensembles, and learned the following:
- **Linear + tree-based ensembles work well overall**: We know that linear models tend to have errors born out of bias in the specification, whereas tree-based models tend to have errors born out of variance. It makes sense that averaging predictions from these types of models would help counteract the errors specific to each type. 
- **Weighting the averages towards the linear models (either with a weighting average, or by including more linear models) tends to work well**: Our linear models tend to predict better than our tree-based ensembles overall, so weighting our ensembles towards linear models helped us capture the best predictions. In general, including multiple linear models works better than overly weighting the predictions of one linear model; this makes sense, as when we included different linear models, we tried to choose models whose errors were not correlated with the same variables in the hopes that bad predictions from one model would be balanced out by good predictions from the other model on the same example, and vice versa. (would be good to show error correlation of a bayesian vs a linear from the same ensemble)
- **Linear models best predict the logarithmic outcome variable, whereas tree-based models best predict the "true" outcome variable:** We looked at all types of models trained on both types of outcome variables, and in general, we found that each model was best suited to a particular outcome variable in a way that made intuitive sense to us. Since the linear models are confined to linear decision boundaries in making their predictions, they do best predicting on a normally-distributed variable, since outliers are not well-handled by linear boundaries. The logaritmic outcome variable was more normally distributed than the normal outcome variable, so the linear models handled this well. On the other hand, the tree-based models can handle non-linear decision boundaries, as "outlier" type predictions can simply be handled by particular branches in the trees. The non-logarithmic outcome variable preserves more of the variation in the outcome variable (i.e. it makes examples with similar SalePrices look more different than they would look if you took their logarithm), so using the non-logarithmic variable allows the tree-based model to capture more nuanced decision rules that capture the variability in the underlying data.  (note - I'm not actually sure if this point belongs in the ensembling section or somewhere else)

#### Stacking
YW TODO

In [None]:
# Load Kaggle test set
dropcols_train = ['YrMoSold', 'SalePrice', 'LogSalePrice']
dropcols_dev = ['YrMoSold']

def stack_ensemble(train_data, dev_data, cols_to_drop, model1, model2, model3):

    # Get boosted predictions
    best_boosted_model = model1
    boosted_pred = best_boosted_model.predict(train_data.drop(columns=dropcols_train)) # how to include the cross validation here
    boosted_test_pred = best_boosted_model.predict(dev_data.drop(columns=dropcols_dev))

    # Get LR predictions
    best_LR_model = model2
    LR_pred = best_LR_model.predict(train_data.drop(columns=dropcols_train))
    LR_test_pred = best_LR_model.predict(dev_data.drop(columns=dropcols_dev))

    # Get bagged predictions
    best_bagg_model = model3
    bagg_pred = best_bagg_model.predict(train_data.drop(columns=dropcols_train))
    bagg_test_pred = best_bagg_model.predict(dev_data.drop(columns=dropcols_dev))

    # Create stacked model
    trainpred = pd.DataFrame(np.column_stack((boosted_pred, LR_pred, bagg_pred)), columns = ['boosted', 'LR', 'bagg'])
    testpred  = pd.DataFrame(np.column_stack((boosted_test_pred, LR_test_pred, bagg_test_pred)), columns = ['boosted', 'LR', 'bagg'])

    # Fit the ensemble parameters
    ensemble_LR = LinearRegression()
    ensemble_LR.fit(trainpred, train_data['LogSalePrice'])
    finaltrainpred = ensemble_LR.predict(trainpred)

    # Fit the final predictions
    finaltestpred = ensemble_LR.predict(testpred)

    # Calculate the error (note this is the whole dataset so test/train are the same)
    error = testing.calculate_error(train_true = np.array(train_data['LogSalePrice']), 
                            train_pred = np.array(finaltrainpred),
                            test_true  = np.array(train_data['LogSalePrice']),
                            test_pred  = np.array(finaltrainpred),
                            outcome_var = 'LogSalePrice')

    print(error)
    print(finaltestpred.shape)
    
    return np.exp(finaltestpred)

kagglepred = stack_ensemble(train_data = data,
                              dev_data = kaggle_test_data,
                              cols_to_drop = dropcols,
                              model1 = df_boosting.sort_values('Root MSE')['Model'][4][0],
                              model2 = lr_models.sort_values('Root MSE')['Model'][2][0],
                              model3 = df.sort_values('Root MSE')['Model'][2][0])



In [None]:

testing.make_kaggle_submission(list(kagglepred),
                               kaggle_test_data,
                               'tempsubmission.csv')

### Error Analysis

In this section, we'll go into more detail about how we actually iterated on models and chose whichever ones we end up deciding are our best. Our primary tools will be this error correlation table, which we'll use to look at patterns of errors the model is making, and diagnostics to determine whether or not the model is overfitting. We'll compare different models to each other and explain the model or ensemble that gives us the best results.

In [None]:
# this still only works on individual models, it doens't average the correlations over a set of models
# this tool is really more exploratory than anything - look at a couple models from the set you care about
# and see what the trends are

# use this variable to specify which model specification to use
#df_and_row_to_use = lrdf[:1]
# use this variable to specify which in the list of models trained with that specification to use
#model_to_use = df_and_row_to_use['Model'][0]
# don't change this
#features_to_use = df_and_row_to_use['Features']

lr_to_eval = ensemble.get_model_dicts_from_data_frame(lrdf[:1])
print(lr_to_eval)

# table for our LR with all the features
corrs_df = testing.create_error_correlation_table(lr_to_eval, dev_df)
corrs_df.reindex(corrs_df.avg_correlation.abs().sort_values(ascending=False).index)

#### Cycle through each model variation and plot errors

In [None]:
### Random Forest
df.sort_values('Root MSE', ascending=True).head(1)

In [None]:
### Linear Regression
lrdf.sort_values('Root MSE', ascending=True).head(1)

In [None]:
### TODO YW: what do you want to use for feature importance here? save off the PCA feature importance
### from the last notebook?

### Random Forest Errors
rf_error_spec = df.sort_values('Root MSE', ascending=True).iloc[0]
model_to_use = rf_error_spec['Model'][0]
features_to_use = rf_error_spec['Features']
plot_features = list(feature_importances[:20].index)
plot_error_against_var(model_to_use, 'LogSalePrice', features_to_use, plot_features, dev_df)

In [None]:
### Linear Regression Errors
# use this variable to specify which model specification to use
df_and_row_to_use = lrdf.iloc[0]
features_to_use = df_and_row_to_use['Features']
plot_features = list(feature_importances[:20].index)
plot_error_against_var(model_to_use, 'LogSalePrice', features_to_use, plot_features, dev_df)

# Final Model & Explanation

Different models will perform better on different numbers of features. For that reason, it seems worthwhile to try all the individual models we're using on different numbers of features. It will likely be advantageous to ensemble models trained on different numbers of features. 

We'll use feature importances, determined via random forest, in order to decide which features to use. It may also be worthwhile to try some other feature selection methods, or to try random selection. 

In [None]:
feature_sets_to_try = []

for i in range(5,24):
    set_to_try = list(feature_importances.feature.values)[:(i * 10)]
    set_to_try = [item for item in set_to_try if item not in ['SalePrice', 'LogSalePrice']]
    feature_sets_to_try.append(set_to_try)

### Narrative of our process

In [None]:
### discussion of the scikitlearn column - real working problem

In [None]:
### first we worked just in this notebook for a long time, trying different models only on cross-validated dev sets
### then, we started 

### 1 Big Table

#### Comparing model performance across different number of features. 

# Messing Around

# YW

In [None]:
# Prep Data for Kaggle
kaggle_lr = pd.DataFrame()
kaggle_lr['Features'] = [feature_sets]
kaggle_lr['Models'] = lr_models['Model']
kaggle_lr['Num features in each'] = len(*feature_sets)
kaggle_lr['Outcome_Vars'] = [['LogSalePrice', 'LogSalePrice', 'LogSalePrice']]

## 2 steps for submitting to kaggle:
## retrain models on all data and get new preds
## then make the final submission CSV
final_preds = testing.retrain_on_all_data_and_get_final_preds(kaggle_lr, data, kaggle_test_data)

testing.make_kaggle_submission(final_preds,
                               kaggle_test_data,
                               'tempsubmission.csv')

# EMILY 

In [None]:

lr_to_eval = ensemble.get_model_dicts_from_data_frame(lrdf[:1])
print(lr_to_eval)

# table for our LR with all the features
corrs_df = testing.create_error_correlation_table(lr_to_eval, dev_df)
corrs_df.reindex(corrs_df.avg_correlation.abs().sort_values(ascending=False).index)

In [None]:

lr_to_eval = ensemble.get_model_dicts_from_data_frame(boost_df[11:12])
print(lr_to_eval)

# table for our LR with all the features
corrs_df = testing.create_error_correlation_table(lr_to_eval, dev_df)
corrs_df.reindex(corrs_df.avg_correlation.abs().sort_values(ascending=False).index)

In [None]:
models_to_param_list = {GradientBoostingRegressor: [{}]}
outcome_vars = ['LogSalePrice', 'SalePrice']
# for all models, we'll try with both the full feature set and the "top 10" feature set
boost_df = model_train.try_different_models(cross_val_list, 
                                        models_to_param_list,
                                        outcome_vars, 
                                        feature_sets_to_try)
boost_df = boost_df.sort_values('Root MSE', ascending=True)

In [None]:
boost_df_no_log = boost_df[boost_df['Outcome Var'] == 'SalePrice']
boost_df_no_log

### Linear Regression

In [None]:
models_to_param_list = {LinearRegression: [{}]}
outcome_vars = ['LogSalePrice', 'SalePrice']
# for all models, we'll try with both the full feature set and the "top 10" feature set
lrdf = model_train.try_different_models(cross_val_list, 
                                        models_to_param_list,
                                        outcome_vars, 
                                        feature_sets_to_try)
lrdf = lrdf.sort_values('Root MSE', ascending=True)

In [None]:
lrdf

### Tree-based Models

In [None]:
models_to_param_list = {DecisionTreeRegressor: [{}], 
                        RandomForestRegressor: [{'n_estimators': 20},
                                                {'min_samples_leaf': 3, 'n_estimators': 20}]}

df = model_train.try_different_models(cross_val_list, 
                                     models_to_param_list,
                                     outcome_vars, 
                                     feature_sets_to_try)
df.sort_values('Root MSE', ascending=True)

### Grab bag of other models - BR and KNN

In [None]:
models_to_param_list = {BayesianRidge: [{}]}

gb_df = model_train.try_different_models(cross_val_list, 
                                         models_to_param_list,
                                         outcome_vars, 
                                         feature_sets_to_try)
gb_df.sort_values('Root MSE', ascending=True)

In [None]:
lin_models = ensemble.get_model_dicts_from_data_frame(lrdf.sort_values('Root MSE')[:10])
#tree_models = ensemble.get_model_dicts_from_data_frame(df.sort_values('Root MSE')[:10])
bayesian_models = ensemble.get_model_dicts_from_data_frame(gb_df.sort_values('Root MSE')[:20])
boost_models = ensemble.get_model_dicts_from_data_frame(boost_df_no_log.sort_values('Root MSE')[:50])

In [None]:
len(boost_models)

In [None]:
top_of_each = [models[0] for models in [lin_models, tree_models, bayesian_models]]
lin_and_tree = [models[0] for models in [lin_models, tree_models]]
lin_and_bayesian = [models[0] for models in [lin_models, bayesian_models]]
bayesian_and_tree = [models[0] for models in [bayesian_models, tree_models]]
lin_0_tree_6 = [lin_models[0], tree_models[6]]
bayesian_0_tree_6 = [bayesian_models[0], tree_models[6]]
top_of_b_and_l_plus_tree_8 = [bayesian_models[0], lin_models[0], tree_models[6]]
same_but_boost_instead = [bayesian_models[0], lin_models[0], boost_models[28]]
same_but_top_boost = [bayesian_models[0], lin_models[0], boost_models[0]]
same_but_top_non_log_boost = [bayesian_models[0], lin_models[0], boost_models[11]]

'''
[top_of_each,
                                                               lin_and_tree,
                                                               lin_and_bayesian,
                                                               bayesian_and_tree, 
                                                               lin_models,
                                                               bayesian_models,
                                                               lin_0_tree_6,
                                                               bayesian_0_tree_6, 
                                                               top_of_b_and_l_plus_tree_8,
                                                               same_but_boost_instead,
                                                               same_but_top_boost,
                                                               same_but_top_non_log_boost]'''

In [None]:
voting_ensembles_df = ensemble.try_different_voting_ensembles(cross_val_list,
                                                              [[bayesian_models[0], bayesian_models[17],
                                                                lin_models[0], boost_models[2]]])

In [None]:
voting_ensembles_df.sort_values('RMSE for ensemble')

In [None]:
# choose model - for now, just choosing the best one
ensemble_to_use = voting_ensembles_df.sort_values('RMSE for ensemble')[:1]
ensemble_to_use

In [None]:
## 2 steps for submitting to kaggle:
## retrain models on all data and get new preds
## then make the final submission CSV
final_preds = testing.retrain_on_all_data_and_get_final_preds(ensemble_to_use,
                                                              data,
                                                              kaggle_test_data)
testing.make_kaggle_submission(final_preds,
                               kaggle_test_data,
                               'em_sunday_10.csv')