# Learning from Data Term Project

#### Ali Kerem Bozkurt - 150190003
#### Beyza Aydeniz - 150200039
#### Elvan Teke - 150190102
#### Hasan Fatih Durkaya - 150200074
#### Ömer Yıldırım - 150190115

In [24]:
import helpers
import random as r
from numpy import mean
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression, Ridge, LassoLars
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor, VotingRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor, XGBRFRegressor

In [25]:
# Prepare features and targets
train_features = helpers.prepareFeatures("train_features.csv", normalize=False, ohe_children=False, ohe_region=True)
train_targets = helpers.prepareTargets("train_targets.csv")

test_features = helpers.prepareFeatures("test_features.csv",normalize=False, ohe_children=False, ohe_region=True)

# We could normalize the data but since there is not a high imbalance between features, it's not needed.
# We could use one hot encoding for children but the models perform worse.

In [26]:
models = []
avg_errors = []

# Add models

models.append(LinearRegression())
models.append(Ridge())
models.append(LassoLars())
models.append(KNeighborsRegressor())
models.append(AdaBoostRegressor())
models.append(XGBRegressor())
models.append(GradientBoostingRegressor())
models.append(RandomForestRegressor())
models.append(XGBRFRegressor())

In [27]:
# Pick seed
r.seed(1)

# Apply 5-fold CV on models
for model in models:
    errors = helpers.CV(features=train_features, targets=train_targets, model=model, n_splits=5)
    avg_errors.append(mean(errors))

In [28]:
# Sort by average error in ascending order
models = [model for _, model in sorted(zip(avg_errors, models))]
avg_errors.sort()

# Print errors
print("   {0:32} Average CV Error".format('Model'))
for i in range(len(models)):
    print("{0}. {1:30} : {2}".format(i + 1, models[i].__class__.__name__, avg_errors[i]))

   Model                            Average CV Error
1. XGBRFRegressor                 : 23700801.101244736
2. GradientBoostingRegressor      : 24304636.24704759
3. RandomForestRegressor          : 26897806.321399402
4. AdaBoostRegressor              : 30490363.558834463
5. XGBRegressor                   : 32700412.265734624
6. LassoLars                      : 39851568.49743668
7. LinearRegression               : 39995269.09365535
8. Ridge                          : 40118935.89183073
9. KNeighborsRegressor            : 133118638.15185659


Top regressors are XGBRFRegressor and GradientBoostingRegressor. We can combine them using VotingRegressor and parameter tune to get the best fit

In [30]:
xgbrf = XGBRFRegressor(max_depth=4, n_estimators=200, subsample=0.5, colsample_bynode=1, min_child_weight=7, learning_rate=1)
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=70, learning_rate=0.2)

final_model = VotingRegressor([('gbr', gbr), ('xgbrf', xgbrf)])

final_model.fit(train_features, train_targets)

# Print train error
error = mse(train_targets, final_model.predict(train_features))
print("Final model error : {}".format(error))

Final model error : 18403159.7504674


In [None]:
helpers.createSubmission(model=final_model, test_features=test_features, submissionFile="submission.csv")