To get started you need to install all the necessary project dependencies. You can do it by running the cell below. Exclamation mark means that terminal command shall be executed in the cell

In [None]:
! pip install -r requirements.txt

Here we import necessary packages. Note that utils is a folder, not a package. In this folder we need "machine_learning.py" file, from which we import MachineLearning object. We also suppress warnings, since they make the output messy

In [1]:
import pickle, os, warnings, optuna, numpy as np
from utils.machine_learning import MachineLearning
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


Here we cycle through the files and load only training and test datasets

In [2]:
data_sets = []
directory = 'data'

files = os.listdir(directory)
files.sort()
files.pop(files.index('AB_NYC_2019.csv'))
files.pop(files.index('X_cluster.pkl'))
for file in files:
    with open(f"{directory}/{file}", 'rb') as f:
        data_sets.append(pickle.load(f))
X_test, X_train, y_test, y_train = data_sets[0], data_sets[1], data_sets[2], data_sets[3]

Setting random seed for reproducibility and creating an instance of MachineLearning class. As an input it accepts one of:
- [linreg](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
    - No sensible hyperparameters to tune
- [knn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
    - **Andreas:** Start with ('n_neighbors', 1, 10), ('weights', ['uniform', 'distance'])
- [tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
    - **Waqar:** Start with ('max_depth', 3, 12), ('min_samples_split', 10, 20), ('min_samples_leaf', 1, 10), ('max_features', ['auto', 'sqrt'])
- [rf](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [xgb](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)

You can click an algorithm and check out what kinds of parameteres it has for tuning

In [3]:
np.random.seed(510)
ml = MachineLearning('rf')

This is the most important part of the notebook, since here we tune the hyperparameters and train the model. You should supply number of hyperparameters tuning iterations, training and test datasets (in this particular order), and hyperparameters for tuning (in any order). A few notes about hyperparameters:
- They can be either numeric or categorical
- If you want to supply a numeric (integer or float) parameter, you pass the following tuple: (hyperparameter_name, min, max)
- If you want to supply a categorical parameter, you pass the following tuple: (hyperparameter_name, [val_1, val_2, ..., val_n])
- You may supply as many hyperparameters you want, from 1 to as many as an algo has

Sometimes the training might take quite a while if you pass many hyperparameters with wide values range. However, the algorithm does not really try all possible combinations of hyperparameters, as it gets rid of not promising combinations quite early, so do not be afraid of experimenting!

In [4]:
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('n_estimators', [100, 250, 500]),
    ('max_depth', 3, 10),
    ('min_samples_split', 10, 20),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['auto', 'sqrt'])
)

After the training is done, we can check out the hyperparameters the model was trained with

In [5]:
ml.model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 10,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 2,
 'min_samples_split': 11,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Finally, we can make predictions and calculate RMSE metric. The lower it is, the better the model is able of predicting the target variable

In [6]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.4270844628644313
Test RMSE:  0.4555017324289047


You shouldn't stop at this point, as our goal is to try and improve the performance as much as we can. For this case, you can see that `n_neighbors` parameter is equal to 10, which is the upper limit of what we supplied to the `fit` method. That means that we can try a higher range of neighbors and see if it makes RMSE even lower. Let us create an instance of the class once again and fit the model with different params

In [7]:
ml = MachineLearning('rf')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('n_estimators', [500, 600, 700]),
    ('max_depth', 10, 15),
    ('min_samples_split', 10, 20),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['auto', 'sqrt'])
)

Check out the hyperparameters

In [8]:
ml.model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 700,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Check out the metrics. Now as we increased the number of neighbors the model is overfitted, because Train RMSE is 0 (the model perfectly predicted all the training examples), but Test RMSE is not much less. The model once again used the upper limit of `n_neighbors`, meaning that increasing this number further does not make much sense, and we can stick to the previous version

In [9]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.36742415342131707
Test RMSE:  0.4522771323623893


In [10]:
ml = MachineLearning('rf')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('n_estimators', [700, 800, 900, 1000]),
    ('max_depth', 15, 20),
    ('min_samples_split', 10, 20),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['auto', 'sqrt'])
)

In [11]:
ml.model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 700,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [12]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.3677849636549292
Test RMSE:  0.4518659789411234


In [42]:
ml = MachineLearning('rf')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('n_estimators', [700]),
    ('max_depth', [15]),
    ('min_samples_split', 2, 10),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['sqrt'])
)

In [43]:
ml.model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 700,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [44]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.3278128179884352
Test RMSE:  0.4514605491246648


In [45]:
ml.save()

In [19]:
ml = MachineLearning('rf')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('n_estimators', [700, 800, 900, 1000]),
    ('max_depth', [15]),
    ('min_samples_split', 2, 10),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['sqrt'])
)

In [20]:
ml.model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 15,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 900,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [21]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.33458745014242597
Test RMSE:  0.451524173294318


Once we made up our mind with the model, we should save it by running this cell. Make sure you are saving the right model! I reran the first model in the cell above to save this particular version of KNN regressor

In [24]:
ml = MachineLearning('xgb')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('loss', ['squared_error', 'absolute_error']),
    ('learning_rate', [0.001, 0.01, 0.1]),
    ('n_estimators', [100, 350, 500]),
    ('max_depth', [3, 15]),
    ('min_samples_split', 2, 10),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['sqrt', 'log2']),
    ('n_iter_no_change', [10]),
    ('tol', [0.0001]),
    ('validation_fraction', [0.25])
)

In [25]:
ml.model.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'squared_error',
 'max_depth': 3,
 'max_features': 'log2',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 8,
 'min_samples_split': 8,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 350,
 'n_iter_no_change': 10,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.25,
 'verbose': 0,
 'warm_start': False}

In [26]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.4488394429037245
Test RMSE:  0.4575576353763742


In [37]:
ml = MachineLearning('xgb')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('loss', ['squared_error', 'absolute_error']),
    ('learning_rate', [0.001, 0.01, 0.1]),
    ('n_estimators', [500, 750, 1000]),
    ('max_depth', [6, 15]),
    ('min_samples_split', 2, 10),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['sqrt', 'log2']),
    ('n_iter_no_change', [10]),
    ('tol', [0.0001]),
    ('validation_fraction', [0.25])
)

In [38]:
ml.model.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'squared_error',
 'max_depth': 6,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 9,
 'min_samples_split': 9,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_iter_no_change': 10,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.25,
 'verbose': 0,
 'warm_start': False}

In [39]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.4321055649810131
Test RMSE:  0.45437897778949027


In [40]:
ml.save()

In [31]:
ml = MachineLearning('xgb')
ml.fit(
    100, X_train, y_train, X_test, y_test,
    ('loss', ['squared_error', 'absolute_error']),
    ('learning_rate', [0.001, 0.01]),
    ('n_estimators', [1000]),
    ('max_depth', [6, 15]),
    ('min_samples_split', 2, 10),
    ('min_samples_leaf', 1, 10),
    ('max_features', ['sqrt', 'log2']),
    ('n_iter_no_change', [10]),
    ('tol', [0.0001]),
    ('validation_fraction', [0.25])
)

In [32]:
ml.model.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.01,
 'loss': 'squared_error',
 'max_depth': 6,
 'max_features': 'log2',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 8,
 'min_samples_split': 10,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 1000,
 'n_iter_no_change': 10,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.25,
 'verbose': 0,
 'warm_start': False}

In [33]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.4423452114147251
Test RMSE:  0.45711374253112563


In [46]:
ml = MachineLearning('linreg')
ml.fit(1, X_train, y_train, X_test, y_test)

In [47]:
ml.model.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': 'deprecated',
 'positive': False}

In [48]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.47465352491425084
Test RMSE:  0.47768408774812077


In [49]:
ml.save()

If you still have some issues with how the code works, you can use the help function to read the documentation

In [13]:
help(ml.fit)

Help on method fit in module utils.machine_learning:

fit(n_trials: int, X_train: numpy.ndarray, y_train: numpy.ndarray, X_test: numpy.ndarray, y_test: numpy.ndarray, *args) -> None method of utils.machine_learning.MachineLearning instance
    Find optimal hyperparameters and fit the model
    
    ## Parameters
    `n_trials` : int
        Number of optimization iterations
    
    `X_train`, `y_train`, `X_test`, `y_test` : np.ndarray
        Training and test datasets
    
    `*args` : tuple
        Tuples containing hyperparameters and their values. If a hyperparameter is:
        - Numeric, then the tuple passed is (hyperparameter_name, min, max)
        - Categorical, then the tuple passed is (hyperparameter_name, [val_1, val_2, ..., val_n])

