# Learning from Data Term Project

#### Ali Kerem Bozkurt - 150190003
#### Beyza Aydeniz - 150200039
#### Elvan Teke - 150190102
#### Hasan Fatih Durkaya - 150200074
#### Ömer Yıldırım - 150190115

In [11]:
import helpers
import random as r
from numpy import mean
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression, Ridge, LassoLars
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor, VotingRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor, XGBRFRegressor

In [19]:
# Prepare features and targets
train_features = helpers.prepareFeatures("train_features.csv", normalize=False, ohe_children=False, ohe_region=True)
train_targets = helpers.prepareTargets("train_targets.csv")

test_features = helpers.prepareFeatures("test_features.csv",normalize=False, ohe_children=False, ohe_region=True)

# We could normalize the data but since there is not a high imbalance between features, it's not needed.
# We could use one hot encoding for children but the models perform worse.

In [13]:
models = []
avg_errors = []

# Add models for CV

models.append(LinearRegression())
models.append(Ridge())
models.append(LassoLars())
models.append(KNeighborsRegressor())
models.append(AdaBoostRegressor())
models.append(XGBRegressor())
models.append(GradientBoostingRegressor())
models.append(RandomForestRegressor())
models.append(XGBRFRegressor())

In [14]:
# Pick seed
r.seed(1)

# Apply 5-fold CV on models
for model in models:
    errors = helpers.CV(features=train_features, targets=train_targets, model=model, n_splits=5)
    avg_errors.append(mean(errors)) # Add average CV error for each model

In [15]:
# Sort by average error in ascending order
models = [model for _, model in sorted(zip(avg_errors, models))]
avg_errors.sort()

# Print errors
print("   {0:32} Average CV Error".format('Model'))
for i in range(len(models)):
    print("{0}. {1:30} : {2}".format(i + 1, models[i].__class__.__name__, avg_errors[i]))

   Model                            Average CV Error
1. XGBRFRegressor                 : 24057694.91484101
2. GradientBoostingRegressor      : 24262940.30005169
3. RandomForestRegressor          : 25662805.01407318
4. AdaBoostRegressor              : 29981675.663254507
5. XGBRegressor                   : 32468007.509918343
6. LinearRegression               : 39857699.46223384
7. Ridge                          : 39993464.69700228
8. LassoLars                      : 40249508.09507123
9. KNeighborsRegressor            : 144493572.30999446


Top regressors are XGBRFRegressor and GradientBoostingRegressor. We can combine them using VotingRegressor and parameter tune to get the best fit

In [20]:
xgbrf = XGBRFRegressor(max_depth=4, n_estimators=200, subsample=0.442, colsample_bynode=1, min_child_weight=7)
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=68, learning_rate=0.228)

final_model = VotingRegressor([('gbr', gbr), ('xgbrf', xgbrf)])

final_cv_error = helpers.CV(features=train_features, targets=train_targets, model=final_model, n_splits=5)
print("Final model CV error for each fold : {}".format(final_cv_error))
print("Average : {}".format(mean(final_cv_error)))

Final model CV error for each fold : [15295592.045305327, 28410718.241410345, 22831759.522029623, 20755001.89100978, 28469308.505679592]
Average : 23152476.041086935


After parameter tuning with CV, we use the whole dataset to make our final submission.

In [17]:
final_model.fit(train_features, train_targets)

# Print final training error
final_training_error = mse(train_targets, final_model.predict(train_features))
print("Final model training error : {}".format(final_training_error))

Final model training error : 18403159.7504674


In [18]:
helpers.createSubmission(model=final_model, test_features=test_features, submissionFile="submission.csv")