In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, KFold, cross_val_predict
from sklearn.metrics import mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, \
recall_score, precision_score, confusion_matrix, mean_absolute_error, f1_score, cohen_kappa_score, matthews_corrcoef, classification_report
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, ParameterGrid, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor,GradientBoostingClassifier,BaggingRegressor,BaggingClassifier, \
AdaBoostRegressor,AdaBoostClassifier,RandomForestRegressor,RandomForestClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from pyearth import Earth
import itertools as it
import time as time
import xgboost as xgb
import re 

## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

### Distribution of response
*By Sylvia Sherwood*

In [None]:
#...Plot for distribution of response...#

# Mean and standard deviation of response #

### Data cleaning
*By Sankaranarayanan Balasubramanian & Fiona Fe*

In [None]:
#...Code with comments...#

# Imputing missing values #

### Data preparation
*By Ryu Kimiko*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

In [3]:
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area

In [None]:
######-----Standardizing the dataset for Lasso / Ridge-------#########

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

## Developing the model: Hyperparameter tuning

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

Put each model in a section of its name and mention the name of the team-member tuning the model. Below is an example:

In [2]:
# reading in the data from the created csv files
X_train = pd.read_csv('X_train_stratified.csv')
y_train = pd.read_csv('y_train_stratified.csv')
X_test = pd.read_csv('X_test_stratified.csv')
y_test = pd.read_csv('y_test_stratified.csv')

### Decision Tree
*By Aarti Pappu*

First, I started with a basic model to better understand the ranges of the different hyperparameters of the decision tree so I could know where to start my tuning.

In [5]:
# Defining the object to build a regression tree
model = DecisionTreeClassifier(random_state=1) 

#Fitting the regression tree to the data
model.fit(X_train, y_train)

print("Maximum number of leaves:", model.get_n_leaves())
print("Maximum depth:", model.get_depth())
print("Maximum features:", len(X_train.columns))

Maximum number of leaves: 1037
Maximum depth: 25
Maximum features: 11


Using the hyperparameter ranges obtained, as well as my intuition about other hyperparameters that could be important in tuning the tree (`min_samples_leaf` and `min_samples_split`), I did a course grid search to try to narrow the range of the optimal hyperparameters. I then plotted the results of the coarse grid search to better understand the ranges of the optimal hyperparameters.

In [None]:
# coarse grid search parameter grid
param_grid = {    
    'criterion':['gini','entropy'],
    'max_depth': range(2,26,5),
    'max_leaf_nodes': range(2,1038,100),
    'max_features': range(1, 12,3),
    'min_samples_leaf': range(1,10,2),
    'min_samples_split': range(2,10,2)
}

# using 2-fold CV because of the limited number of instances of some of the classes 
skf = StratifiedKFold(n_splits=2)

# aiming to maximize F1-score
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)

In [None]:
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(3,2,figsize=(18,20))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0,1])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0,1])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0,1])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_min_samples_leaf, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0,1])
axes[1,1].set_xlabel('min_samples_leaf')
axes[1,1].set_ylabel('K-fold F1 Score')
axes[2,0].plot(cv_results.param_min_samples_split, cv_results.mean_test_f1_weighted, 'o')
axes[2,0].set_ylim([0,1])
axes[2,0].set_xlabel('min_samples_split')
axes[2,0].set_ylabel('K-fold F1 Score')
axes[2,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[2,1].set_ylim([0,1])
axes[2,1].set_xlabel('criterion')
axes[2,1].set_ylabel('K-fold F1 Score');

In [None]:
# finer grid search
param_grid = {    
    'criterion':['gini','entropy'],
    'max_depth': range(8,26,2),
    'max_leaf_nodes': range(100,1038,50),
    'max_features': range(1,12,2)
}


#Grid search to optimize parameter values
skf = StratifiedKFold(n_splits=2)#The folds are made by preserving the percentage of samples for each class.

#Minimizing FNR is equivalent to maximizing recall
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)

In [None]:
cv_results = pd.DataFrame(grid_search.cv_results_)
fig, axes = plt.subplots(2,2,figsize=(18,15))
plt.subplots_adjust(wspace=0.2)
axes[0,0].plot(cv_results.param_max_depth, cv_results.mean_test_f1_weighted, 'o')
axes[0,0].set_ylim([0.4,0.6])
axes[0,0].set_xlabel('max_depth')
axes[0,0].set_ylabel('K-fold F1 Score')
axes[0,1].plot(cv_results.param_max_leaf_nodes, cv_results.mean_test_f1_weighted, 'o')
axes[0,1].set_ylim([0.4,0.6])
axes[0,1].set_xlabel('max_leaf_nodes')
axes[0,1].set_ylabel('K-fold F1 Score')
axes[1,0].plot(cv_results.param_max_features, cv_results.mean_test_f1_weighted, 'o')
axes[1,0].set_ylim([0.4,0.6])
axes[1,0].set_xlabel('max_features')
axes[1,0].set_ylabel('K-fold F1 Score')
axes[1,1].plot(cv_results.param_criterion, cv_results.mean_test_f1_weighted, 'o')
axes[1,1].set_ylim([0.4,0.6])
axes[1,1].set_xlabel('criterion')
axes[1,1].set_ylabel('K-fold F1 Score');

In [None]:
# finer grid search
param_grid = { 
    'max_depth': range(16,26),
    'max_leaf_nodes': range(550,1038),
    'max_features': range(1,5)
}


#Grid search to optimize parameter values
skf = StratifiedKFold(n_splits=2)#The folds are made by preserving the percentage of samples for each class.

#Minimizing FNR is equivalent to maximizing recall
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, scoring=['f1_weighted','accuracy'], refit= 'f1_weighted', cv=skf, n_jobs=-1, verbose = True)
grid_search.fit(X_train, y_train)

# make the predictions
y_pred = grid_search.predict(X_test)

print('Train F1-score : %.3f'%grid_search.best_estimator_.score(X_train, y_train))
print('Test F1-score : %.3f'%grid_search.best_estimator_.score(X_test, y_test))
print('Best F1-score Through Grid Search : %.3f'%grid_search.best_score_)

print('Best params for F1-score')
print(grid_search.best_params_)

In [None]:
# Create a new Random Forest classifier with the best parameters
model = DecisionTreeClassifier(random_state=1, max_depth=17, max_features = 4, max_leaf_nodes=613)
model.fit(X_train,y_train)

feature_importance_df = pd.concat([pd.Series(model.feature_names_in_), pd.Series(model.feature_importances_)], axis = 1)
feature_importance_df.rename(columns={0: "predictors", 1: "feature_importance"}).sort_values(by='feature_importance', ascending=False)

In [None]:
# Predict the labels of the test set
y_pred = model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

### Bagged Decision Trees
*By Divya Bhardwaj*

### Random Forest
*By Diego Schummer*

### XGBoost
*By Yasmeen Nahas*

## Model Ensemble 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**


### Voting ensemble

### Stacking ensemble(s)

### Ensemble of ensembled models

### Innovative ensembling methods
*(Optional)*

## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.