### Importing datasets that have been transformed in Part 1
https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification-format

In [1]:
import pandas as pd
import numpy as np
#Importing our modified datasets from RentHop_Interest_Part1:
train = pd.read_csv('data/train_transformed.csv')
test = pd.read_csv('data/test_transformed.csv')

In [2]:
train.head()

Unnamed: 0,bathrooms,bedrooms,price,interest_level,elevator_fl,hardwood_fl,cats_fl,dogs_fl,doorman_fl,dishwasher_fl,...,hour_categories_night,hour_categories_morning,hour_categories_afternoon,hour_categories_evening,borough_Bronx,borough_Brooklyn,borough_Manhattan,borough_Queens,borough_Staten_Island,borough_outside
0,0.48282,1.311964,-0.049358,medium,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
1,-0.357225,0.411224,0.106361,low,1,0,1,1,1,0,...,0,0,1,0,0,0,1,0,0,0
2,-0.357225,-0.489516,-0.058834,high,0,1,0,0,0,1,...,1,0,0,0,0,0,1,0,0,0
3,-0.357225,-0.489516,-0.031985,low,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
4,-0.357225,2.212704,-0.027248,low,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0


### Understanding the evaluation criteria

The test dataset will be evaluated using the multi-class logarithmic loss. Our submission will consist of probabilities for each class (in this case its the probability that the interest level for each listing is 'low', 'medium' or 'high').  Below is the formula for the multi-class logarithmic loss:<br>
![title](drawings/multiclass_logloss.jpg)<br>

N is the number of listings in the dataset, M is the number of classes, yij is 1 if listing i belongs to class j and is 0 if it doesn't, log is the natural log, and pij is the probability that listing i belongs to class j.<br> 

<b>Notes</b>:<br><br>
The three probabilities for each listing don't need to sum to 1 in the submission, but they will be rescaled to sum to 1 after the submission.<br>
If the predicted probability for a listing i that actually belongs to class j is extremely low, the log of that value will approach negative infinity and will result in the overall error to approach infinity, even if the other listings have good predictions. In order to not have a few bad predictions take over the entire error, the predicted errors will be replaced with max((min(pij,1-1e-15)1e-15). This emposes a ceiling and floor on the probability so that we never take the log of a value less than 1e-15 ore more than 1-1e-15. <br>

I am going to first implement my own multi-class logarithmic loss scoring function to ensure that I know the details of how it can be implemented and don't need to rely upon the scikit learn package. When I actually begin doing model selection and hyperparameter tuning I will be using scikit's 'neg_log_loss' scoring function out of convenience. <br>

<br>
We can gauge how well a model is doing by comparing it to a "dumb" baseline classifier. If we had just predicted a probability of 0.333 for each class for each listing (ie equal probability that each listing is of 'low', 'medium', or 'high' interest level), we would get a multiclass log loss of 1.10. We know this b/c the multiclass log loss in this case is equal to 1/N * N * log(0.33) which is simplified to log(0.33) which is equivalent to 1.10. <br>

If we take into account the prevelance of each class in the train dataset, we can make an even better "dumb" baseline classifier if, for each listing, we set the probability of each class equal to its prevalance in the dataset. So in this case, listings with a 'low' interest level were ~69.468% of the listings. Listings with a 'medium' interest level were ~22.753% of the listings. Listings with a 'high' interest level were ~7.779% of the listings. We can set the prediction for the 'low' interest level to 0.69468, the prediction for the 'medium' interest level to 0.22753 and the prediction for the 'high' interest level to 0.07779 for every listing. 

https://stats.stackexchange.com/questions/276067/whats-considered-a-good-log-loss
https://medium.com/@fzammito/whats-considered-a-good-log-loss-in-machine-learning-a529d400632d

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X = train.drop(['interest_level'], axis=1)
y = train['interest_level']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=2)

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
prob_target_dt = dtc.predict_proba(X_test)
print('decision tree predictions:\n', prob_target_dt)

#In this decision tree, I will try to limit overfitting (and thus increase the bias in order to reduce the variance of the model)
dtc_2 = DecisionTreeClassifier(max_depth = 10, min_samples_leaf=15)
dtc_2.fit(X_train, y_train)
prob_target_dt_2 = dtc_2.predict_proba(X_test)
print('decision tree 2 predictions:\n', prob_target_dt_2)

#Using the one-vs-all (also known as one-vs-rest (ovr)) method of multi-class classification.
#
lgr = LogisticRegression(multi_class='ovr', solver='liblinear')
lgr.fit(X_train, y_train)
predictions = lgr.predict(X_test)
prob_target_lgr = lgr.predict_proba(X_test)
print('logistic regression predictions:\n', prob_target_lgr)

decision tree predictions:
 [[0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 ...
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
decision tree 2 predictions:
 [[0.14285714 0.71428571 0.14285714]
 [0.14285714 0.71428571 0.14285714]
 [0.23123123 0.32732733 0.44144144]
 ...
 [0.23123123 0.32732733 0.44144144]
 [0.07692308 0.42307692 0.5       ]
 [0.         0.98974359 0.01025641]]
logistic regression predictions:
 [[0.13749065 0.52994628 0.33256307]
 [0.14623071 0.41490856 0.43886073]
 [0.07851478 0.60096311 0.3205221 ]
 ...
 [0.13454349 0.44413792 0.42131859]
 [0.18526336 0.44249825 0.37223838]
 [0.01256163 0.90515997 0.0822784 ]]


In [4]:
#Creating a dataframe consisting of 4 columns: 3 columns for each of the probabilities for each listing and 1 column for the actual class of each listing.
y_test_df = y_test.to_frame()
y_test_df.reset_index(inplace=True, drop=True)

prediction_probabilities_dt= pd.DataFrame(data=prob_target_dt,
                 columns= ['high', 'low', 'medium'])
prob_target_dt = pd.concat([prediction_probabilities_dt, y_test_df],axis=1)

prediction_probabilities_dt_2= pd.DataFrame(data=prob_target_dt_2,
                 columns= ['high', 'low', 'medium'])
prob_target_dt_2 = pd.concat([prediction_probabilities_dt_2, y_test_df],axis=1)

prediction_probabilities_lgr= pd.DataFrame(data=prob_target_lgr,
                 columns= ['high', 'low', 'medium'])
prob_target_lgr = pd.concat([prediction_probabilities_lgr, y_test_df],axis=1)


#Creating dataframe of predictions for dumb baseline classifier
#Creating each column starting with numpy arrays
low = np.zeros((prob_target_lgr.shape[0],1), dtype=float)
low += 0.69468
medium = np.zeros((prob_target_lgr.shape[0],1), dtype=float)
medium += 0.22753
high = np.zeros((prob_target_lgr.shape[0],1), dtype=float)
high += 0.07779

predictions_dumb = np.concatenate((high,low,medium), axis=1)
#print(predictions_dumb)

prediction_probabilities_dumb= pd.DataFrame(data=predictions_dumb,
                 columns= ['high', 'low', 'medium'])
prob_target_dumb = pd.concat([prediction_probabilities_dumb, y_test_df],axis=1)

In [5]:
#My personal implementation for evaluating the multiclass log loss on a set of predictions:
def row_multiclasslogloss(prob_target_row):
    unit_log_loss = 0
    eps = 1e-15
    if prob_target_row.loc['interest_level'] == 'low':
        a = prob_target_row['low']
        #Note that numpy's clip() function below returns eps if a is less than eps and 1-eps if a is more than 1-eps
        #This makes sure that prob_target_row['low'] is kept between 1e-15 and 1-e-15.
        #Otherwise, if for one datapoint prob_target_row['low'] was ~0, then the log of it would be -inf.
        #This way extreme values don't completely control overall value.
        b = np.clip(a, eps, 1 - eps)
        unit_log_loss = 1*np.log(b)
        return unit_log_loss
    if prob_target_row.loc['interest_level'] == 'medium':
        a = prob_target_row['medium']
        b = np.clip(a, eps, 1 - eps)
        unit_log_loss = 1*np.log(b)
        return unit_log_loss
    if prob_target_row.loc['interest_level'] == 'high':
        a = prob_target_row['high']
        b = np.clip(a, eps, 1 - eps)
        unit_log_loss = 1*np.log(b)
        return unit_log_loss
    
def overall_multiclasslogloss(prob_target):
    loss_series =  prob_target.apply(row_multiclasslogloss, axis=1)
    loss_series_sum = loss_series.sum()*(-1/prob_target.shape[0])
    
    return loss_series_sum

In [6]:
loss_amount_dt = overall_multiclasslogloss(prob_target_dt)
print('multiclass log loss for decision tree:',loss_amount_dt)
loss_amount_dt_2 = overall_multiclasslogloss(prob_target_dt_2)
print('multiclass log loss for decision tree 2:',loss_amount_dt_2)
loss_amount_lgr =  overall_multiclasslogloss(prob_target_lgr)
print('multiclass log loss for logistic regression:', loss_amount_lgr)
loss_amount_dumb =  overall_multiclasslogloss(prob_target_dumb)
print('multiclass log loss for dumb baseline classifier:', loss_amount_dumb)

multiclass log loss for decision tree: 12.490455074603624
multiclass log loss for decision tree 2: 0.9511805831094223
multiclass log loss for logistic regression: 0.6864907218337554
multiclass log loss for dumb baseline classifier: 0.7889918473333496


<b>Observations</b>:<br>
1.) <br> 
The first decision tree has such high error b/c it is WAY overfitting (I added no parameters to limit this).<br>
This decision tree's predicted probabilities for each listing consist of 1 for the predicted class and 0s for the classes not predicted.This is another sign that the model is overfitting b/c this indicates that each leaf in the tree exclusively contains listings of one class.<br>
2.)<br> This second tree has much lower error b/c I set parameters that limit the depth of the tree resulting in less overfitting. <br>
3.)<br>The simple logistic regression model beats the dumb baseline classifier which is a good sign.<br>


## 5. Model Selection & Hyperparameter Tuning

Note that from here on out I will be using sklearn's implemention of multiclass logarithmic loss rather than my own out of convenience. 

I will be using the following algorithms:

<b>Generalized linear models</b>:<br>
1.) Simple logistic regression <br>
2.) Logistic regression w/ L1 Regularization-> not converged and run time is WAY too long.<br>
3.) Logistic regression w/ L2 Regularization <br>
4.) Logistic regression w/ ElasticNet Regularization -> not converged and run time is WAY too long.<br>

<b>Nearest-Neighbors model</b>:<br>
1.) KNeighborsClassifier -> ran but bad results (probably need to do some feature selection beforehand).<br>

<b>Tree-based models</b>: <br>
1.) Random Forest classification (tree-based ensemble method)<br>
2.) AdaBoost classification (tree-based ensemble method)<br>
3.) Gradient Boosting classification (tree-based ensemble method)<br>
4.) XGBoost classification (tree-based ensemble method)<br>

<b>Support Vector Machine model</b>:<br>
1.) Support Vector Classification -> run time too long

<b>Ensemble algorithms</b>:<br>
1.) Voting classifier<br>


https://scikit-learn.org/stable/supervised_learning.html

### Dumb baseline classifier
Like I did with my own implementation of multi-class log loss above (in that case it was calculated for half the train dataset b/c I did a train-test split on the train dataset), I will calculate what the log loss would be for the dumb baseline classifer on the entire train dataset so that we have a baseline to compare to when evaluating how effective each model is. Note that the log loss should be very close to what I calculated above (0.789). This is b/c the log loss is an average across the dataset so it doesn't depend on the size of the dataset. 

In [7]:
#Do the dumb baseline classifier:
from sklearn.metrics import log_loss
#Creating dataframe of predictions for dumb baseline classifier
#Creating each column starting with numpy arrays
low = np.zeros((train.shape[0],1), dtype=float)
low += 0.69468
medium = np.zeros((train.shape[0],1), dtype=float)
medium += 0.22753
high = np.zeros((train.shape[0],1), dtype=float)
high += 0.07779

predictions_dumb = np.concatenate((high,low,medium), axis=1)
#print(predictions_dumb)

prediction_probabilities_dumb= pd.DataFrame(data=predictions_dumb,
                 columns= ['high', 'low', 'medium'])
prob_target_dumb = pd.concat([prediction_probabilities_dumb, train['interest_level']],axis=1)

dumb_baseline_logloss = log_loss(train['interest_level'], prediction_probabilities_dumb)
print('log loss for the "dumb" baseline classifier:', dumb_baseline_logloss)

log loss for the "dumb" baseline classifier: 0.7885769114648241


## Generalized linear models

### Simple logistic regression

l1: liblinear
l2: liblinear, newton-cg, sag, lbfgs
elasticnet: saga
none: newton-cg, sag, lbfgs, saga

In [8]:
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV

KF = KFold(n_splits=5, shuffle=True, random_state=14)

#Redefine so that X_train is for the entire training dataset (above we only used half)
#This will take longer to run
X_train = train.drop(['interest_level'], axis=1)
y_train = train['interest_level']

#Had to increase # of iterations from 100 for any of this model to converge
lgr = LogisticRegression(multi_class='ovr', penalty='none', max_iter=10000)

#params_lgr = {"solver": ['newton-cg',  'sag', 'lbfgs', 'saga']}
params_lgr = {"solver": ['lbfgs','newton-cg']}
print(len(X_train))
grid_lgr = GridSearchCV(lgr, param_grid=params_lgr, scoring='neg_log_loss', cv=KF)
grid_lgr.fit(X_train, y_train)

lgr_best_estimator = grid_lgr.best_estimator_
lgr_best_params = grid_lgr.best_params_
lgr_best_score = -1*(grid_lgr.best_score_)

print(lgr_best_estimator)
print(lgr_best_params)
print('log loss for simple logistic regression:', lgr_best_score)

49352
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='ovr', n_jobs=None, penalty='none',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
{'solver': 'lbfgs'}
log loss for simple logistic regression: 0.6811067236543596


### l1 penalty
#check to see if saved

<b> Run-time is WAY too long. </b><br>
Need to look into strategies to decrease the run-time.<br>
https://stackoverflow.com/questions/52670012/convergencewarning-liblinear-failed-to-converge-increase-the-number-of-iterati

In [9]:
#lgr_l1 = LogisticRegression(multi_class='ovr', penalty="l1", solver='liblinear', max_iter=10000)

#params_lgr_l1 = {"C": [50, 100, 500]}

#grid_lgr_l1 = GridSearchCV(lgr_l1, param_grid=params_lgr_l1, scoring='neg_log_loss', cv=KF)
#grid_lgr_l1.fit(X_train, y_train)

#lgr_l1_best_estimator = grid_lgr_l1.best_estimator_
#lgr_l1_best_params = grid_lgr_l1.best_params_
#lgr_l1_best_score = -1*(grid_lgr_l1.best_score_)

#print(lgr_l1_best_estimator)
#print(lgr_l1_best_params)
#print('log loss for logistic regression w/ L1 Regularization:', lgr_l1_best_score)

### l2 penalty

In [10]:
lgr_l2 = LogisticRegression(multi_class='ovr', penalty="l2")

#Spent a lot of time tuning this, 'lbfgs' seems to work better and to actually be able to converge.
params_lgr_l2 = {"C": [3500, 5000, 6000],
                'solver': ['lbfgs']}

grid_lgr_l2 = GridSearchCV(lgr_l2, param_grid=params_lgr_l2, scoring='neg_log_loss', cv=KF)
grid_lgr_l2.fit(X_train, y_train)

lgr_l2_best_estimator = grid_lgr_l2.best_estimator_
lgr_l2_best_params = grid_lgr_l2.best_params_
lgr_l2_best_score = -1*(grid_lgr_l2.best_score_)

print(lgr_l2_best_estimator)
print(lgr_l2_best_params)
print('log loss for logistic regression w/ l2 penalty:', lgr_l2_best_score)
#print(grid_lgr_l2.cv_results_)





LogisticRegression(C=6000, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
{'C': 6000, 'solver': 'lbfgs'}
log loss for logistic regression w/ l2 penalty: 0.6808655800944188




### ElasticNet
<b> Run-time is WAY too long (likely from the L1 regularization term). </b><br>

In [11]:
#lgr_en = LogisticRegression(multi_class='ovr', penalty="elasticnet", solver='saga')

#params_lgr_en = {"C": [1000, 5000, 10000],
#                'l1_ratio': [0.25, 0.5, 0.9]}

#grid_lgr_en = GridSearchCV(lgr_en, param_grid=params_lgr_en, scoring='neg_log_loss', cv=KF)
#grid_lgr_en.fit(X_train, y_train)

#lgr_en_best_estimator = grid_lgr_en.best_estimator_
#lgr_en_best_params = grid_lgr_en.best_params_
#lgr_en_best_score = -1*(grid_lgr_en.best_score_)

#print(lgr_en_best_estimator)
#print(lgr_en_best_params)
#print('log loss for logistic regression w/ ElasticNet:', lgr_en_best_score)

## Tree-based algorithms

### Random Forest Classification

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=7)

params_rfc ={
        "n_estimators": [10, 20],
        "criterion": ["gini", "entropy"],
        "max_depth": [2, 5, 10],
        "max_features": ["log2", "sqrt"],
        "min_samples_leaf": [1, 5, 8],
        "min_samples_split": [2, 3, 5]
}


grid_rfc = GridSearchCV(rfc, param_grid=params_rfc, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_rfc.fit(X_train, y_train)

rfc_best_model = grid_rfc.best_estimator_
rfc_best_params = grid_rfc.best_params_
rfc_best_score = -1*(grid_rfc.best_score_)

print(rfc_best_model)
print(rfc_best_params)
print(rfc_best_score)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=7, verbose=0,
                       warm_start=False)
{'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 20}
0.6703905305945391


### AdaBoost Classication

In [13]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(random_state=65)

params_ada ={
        #was 100,200 but VERY long
        "n_estimators": [30],
        "learning_rate": [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
        "algorithm": ['SAMME', 'SAMME.R']
}

grid_ada = GridSearchCV(ada, param_grid=params_ada, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_ada.fit(X_train, y_train)

ada_best_model = grid_ada.best_estimator_
ada_best_params = grid_ada.best_params_
ada_best_score = abs(grid_ada.best_score_)

print(ada_best_model)
print(ada_best_params)
print(ada_best_score)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.05,
                   n_estimators=30, random_state=65)
{'algorithm': 'SAMME.R', 'learning_rate': 0.05, 'n_estimators': 30}
0.8158580933459949


### Gradient Boosting Classication

In [14]:
from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(random_state=76)

#tuned
params_gbm ={
        "loss": ['deviance'],
        "n_estimators": [30],
        "learning_rate": [0.05, 0.1],
        "subsample": [0.8],
        "max_depth": [10],
        "max_features": ["sqrt"],
        "min_samples_leaf": [8],
        "min_samples_split": [10]
}


grid_gbm = GridSearchCV(gbm, param_grid=params_gbm, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_gbm.fit(X_train, y_train)

gbm_best_model = grid_gbm.best_estimator_
gbm_best_params = grid_gbm.best_params_
gbm_best_score = abs(grid_gbm.best_score_)

print(gbm_best_model)
print(gbm_best_params)
print(gbm_best_score)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=10,
                           max_features='sqrt', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=8, min_samples_split=10,
                           min_weight_fraction_leaf=0.0, n_estimators=30,
                           n_iter_no_change=None, presort='auto',
                           random_state=76, subsample=0.8, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'min_samples_split': 10, 'n_estimators': 30, 'subsample': 0.8}
0.635702147144199


### XGBoost Classification
https://medium.com/@gabrielziegler3/multiclass-multilabel-classification-with-xgboost-66195e4d9f2d
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [15]:
import warnings
#warnings.filterwarnings("ignore", category=DeprecationWarning) 
#warnings.filterwarnings("ignore", category=FutureWarning) 
from xgboost import XGBClassifier
#https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

xgb = XGBClassifier(random_state =17)

params_xgb ={
        "n_estimators": [100],
        "learning_rate": [0.05, 0.1],
        "subsample": [0.8],
        "objective": ["multi:softprob"],
        "eval_metric": ["logloss"],
        "num_class": [3]
}


grid_xgb = GridSearchCV(xgb, param_grid=params_xgb, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_xgb.fit(X_train, y_train)

xgb_best_model = grid_xgb.best_estimator_
xgb_best_params = grid_xgb.best_params_
xgb_best_score = abs(grid_xgb.best_score_)

print(xgb_best_model)
print(xgb_best_params)
print(xgb_best_score)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, num_class=3, objective='multi:softprob',
              random_state=17, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=None, silent=None, subsample=0.8, verbosity=1)
{'eval_metric': 'logloss', 'learning_rate': 0.1, 'n_estimators': 100, 'num_class': 3, 'objective': 'multi:softprob', 'subsample': 0.8}
0.6447125322983399


## Nearest-Neighbors 

<b> Takes a VERY long time to run and does not give good results. </b>

In [16]:
feature_importance_series = pd.Series(index=X_train.columns, data=(rfc_best_model.feature_importances_)).sort_values(ascending=False)
print(feature_importance_series.iloc[0:10])
top_performing_features = feature_importance_series.index[0:10]
print(top_performing_features)

X_train = train[top_performing_features]

price                    0.241493
nofee_fl                 0.073116
number_of_photos         0.071503
words_in_description     0.059147
bedrooms                 0.046701
hour_categories_night    0.044535
hardwood_fl              0.041357
num_features_listed      0.038714
bathrooms                0.032052
has_photos               0.031411
dtype: float64
Index(['price', 'nofee_fl', 'number_of_photos', 'words_in_description',
       'bedrooms', 'hour_categories_night', 'hardwood_fl',
       'num_features_listed', 'bathrooms', 'has_photos'],
      dtype='object')


In [18]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

params_knn ={
#                    "n_neighbors": [3,10,20],
#                    "weights": ["distance", "uniform"],
#                    "algorithm": ["ball_tree", "kd_tree", "brute"],
#                    "p": [1,2]
                    "n_neighbors": [10],
                    "weights": ["distance", "uniform"],
                    "algorithm": ["ball_tree"],
                    "p": [2]

}

#params_knn ={
#                    "n_neighbors": [5],
#                    "weights": ["uniform"],
#                    "algorithm": ["auto"],
#                    "p": [2]
#
#}


grid_knn = GridSearchCV(knn, param_grid=params_knn, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_knn.fit(X_train, y_train)

knn_best_model = grid_knn.best_estimator_
knn_best_params = grid_knn.best_params_
knn_best_score = abs(grid_knn.best_score_)

print(knn_best_model)
print(knn_best_params)
print(knn_best_score)

KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')
{'algorithm': 'ball_tree', 'n_neighbors': 10, 'p': 2, 'weights': 'uniform'}
1.9833393740643022


### Support Vector Machine algorithms

1.) SVC 
- May not work well for tens of thousands of datapoints
- The multiclass case is handled with one-vs-one.
- Support vector machine algorithms do NOT directly provide probabiliy estimates. However, SVC has a a 'probability' parameter that can set to True for probability estimates but this makes it takes longer to run. It is calculated with five-fold cross validation using Platt scaling which fits a logistic regression to the support vector classifier's scores. This is very computationally expensive. 
- <b>This algorithm was too computationally expensive and the training time was too long for this dataset </b> (tens of thousands of datapoints and required Platt scaling for probabilities). I may revisit this algorithm when I run this project with some form of cloud computing.

2.) NuSVC
- Similar to SVC but accepts slightly different parameters and has different mathematical formulation.

3.) LinearSVC
- Is the specific implementation of SVC with the kernal being linear. However, it is implemented with 'liblinear' (rather than  'libsvm' for SVC). This allows for more flexibility in the penalty options, more flexibility in the loss functions, and scales better to large #s of samples. 
- The multi-class case is handled with one-vs-rest technqiue (OVR). 
- <b> Does NOT have a parameter to return probability estimates for each of the classes </b>. For this project, our model is evaluated with the multiclass log loss which requires probability estimates for each of the classes. For this reason, I did NOT use the LinearSVC algorithm.

Sources:<br>
1.) https://scikit-learn.org/stable/modules/svm.html<br>
2.) https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47<br>
3.) http://web.mit.edu/6.034/wwwbob/svm.pdf<br>
4.) http://cs229.stanford.edu/notes/cs229-notes3.pdf <br>
5.) https://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities <br>

In [None]:
#from sklearn.svm import SVC 


#SVC = SVC(probability=True,gamma='auto', random_state=32)

#tuned
#params_SVC ={'C': [1],
#             'kernel':['rbf']
#}


#grid_SVC = GridSearchCV(SVC, param_grid=params_SVC, scoring='neg_log_loss',return_train_score=False, cv=KF)
#grid_SVC.fit(X_train, y_train)

#SVC_best_model = grid_SVC.best_estimator_
#SVC_best_params = grid_SVC.best_params_
#SVC_best_score = abs(grid_SVC.best_score_)

#print(SVC_best_model)
#print(SVC_best_params)
#print(SVC_best_score)

### Stacking w/ Voting Classifier

In [19]:
from sklearn.ensemble import VotingClassifier

params_voting_ensemble = {'weights': [[0.50,0.50], [0.25, 0.75], [0.75, 0.25], [0,1], [1,0]]}

#the estimator parameter needs to be a list of (string, estimator) tuples:
estimators=[('xgboost', xgb_best_model), ('gbm', gbm_best_model)]

#voting set to 'soft' allows for predicting class labels by weighting the probabilities, not the outcome
voting_ensemble_model = VotingClassifier(estimators, voting='soft')

grid_voting_ensemble = GridSearchCV(voting_ensemble_model, param_grid=params_voting_ensemble, scoring='neg_log_loss',return_train_score=False, cv=KF)
grid_voting_ensemble.fit(X_train, y_train)

voting_ensemble_best_model = grid_voting_ensemble.best_estimator_
voting_ensemble_best_params = grid_voting_ensemble.best_params_
voting_ensemble_best_score = grid_voting_ensemble.best_score_

print(voting_ensemble_best_model)
print(voting_ensemble_best_params)
print(voting_ensemble_best_score)

VotingClassifier(estimators=[('xgboost',
                              XGBClassifier(base_score=0.5, booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=1,
                                            eval_metric='logloss', gamma=0,
                                            learning_rate=0.1, max_delta_step=0,
                                            max_depth=3, min_child_weight=1,
                                            missing=None, n_estimators=100,
                                            n_jobs=1, nthread=None, num_class=3,
                                            objective='multi:softprob',
                                            random_state=17, reg...
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=

## 6. Submission

In [None]:
test_dataset_X = test.drop(['interest_level'],axis=1)

#Using tuned random forest model
model = rfc_best_model

predictions_proba = model.predict_proba(test_dataset_X)
prediction_probabilities_dt= pd.DataFrame(data=predictions_proba,
                 columns= ['high', 'low', 'medium'])

prediction_probabilities_dt_high = prediction_probabilities_dt['high']
prediction_probabilities_dt_medium = prediction_probabilities_dt['medium']
prediction_probabilities_dt_low = prediction_probabilities_dt['low']

untransformed_test_dataset = pd.read_json('data/test.json')
listing_id_series = untransformed_test_dataset['listing_id'].reset_index(drop=True)

submission_dict = {'listing_id': listing_id_series, 'high': prediction_probabilities_dt_high,
                  'medium': prediction_probabilities_dt_medium,
                  'low': prediction_probabilities_dt_low}
submission_df = pd.DataFrame(submission_dict)
submission_df.to_csv('data/first_submission.csv', index=False)