To get started you need to install all the necessary project dependencies. You can do it by running the cell below. Exclamation mark means that terminal command shall be executed in the cell

In [11]:
! pip install -r requirements.txt



Here we import necessary packages. Note that utils is a folder, not a package. In this folder we need "machine_learning.py" file, from which we import MachineLearning object. We also suppress warnings, since they make the output messy

In [1]:
import pickle, os, warnings, optuna, numpy as np
from utils.machine_learning import MachineLearning
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


Here we cycle through the files and load only training and test datasets

In [2]:
data_sets = []
directory = 'data'

files = os.listdir(directory)
files.sort()
files.pop(files.index('AB_NYC_2019.csv'))
files.pop(files.index('X_cluster.pkl'))
for file in files:
    with open(f"{directory}/{file}", 'rb') as f:
        data_sets.append(pickle.load(f))
X_test, X_train, y_test, y_train = data_sets[0], data_sets[1], data_sets[2], data_sets[3]

In [133]:
#import pandas as pd
#dfer = pd.DataFrame(data_sets)
#dfer.describe()
datae= pd.read_csv('AB_NYC_2019.csv')

Setting random seed for reproducibility and creating an instance of MachineLearning class. As an input it accepts one of:
- [linreg](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
    - No sensible hyperparameters to tune
- [knn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
    - **Andreas:** Start with ('n_neighbors', 1, 10), ('weights', ['uniform', 'distance'])
- [tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
    - **Waqar:** Start with ('max_depth', 3, 12), ('min_samples_split', 10, 20), ('min_samples_leaf', 1, 10), ('max_features', ['auto', 'sqrt'])
- [rf](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [xgb](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)

You can click an algorithm and check out what kinds of parameteres it has for tuning

In [3]:
np.random.seed(510)

ml = MachineLearning('tree')

This is the most important part of the notebook, since here we tune the hyperparameters and train the model. You should supply number of hyperparameters tuning iterations, training and test datasets (in this particular order), and hyperparameters for tuning (in any order). A few notes about hyperparameters:
- They can be either numeric or categorical
- If you want to supply a numeric (integer or float) parameter, you pass the following tuple: (hyperparameter_name, min, max)
- If you want to supply a categorical parameter, you pass the following tuple: (hyperparameter_name, [val_1, val_2, ..., val_n])
- You may supply as many hyperparameters you want, from 1 to as many as an algo has

Sometimes the training might take quite a while if you pass many hyperparameters with wide values range. However, the algorithm does not really try all possible combinations of hyperparameters, as it gets rid of not promising combinations quite early, so do not be afraid of experimenting!

In [4]:
ml.fit(100, X_train, y_train, X_test, y_test, ('max_depth', 8, 12), ('min_samples_split', 20, 25), ('min_samples_leaf', 6, 10), ('max_features', ['auto', 'log2']))

After the training is done, we can check out the hyperparameters the model was trained with

In [5]:
ml.model.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 9,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 10,
 'min_samples_split': 25,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

Finally, we can make predictions and calculate RMSE metric. The lower it is, the better the model is able of predicting the target variable

In [6]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.4418847804226296
Test RMSE:  0.46730216692304793


In [7]:
ml.save()

You shouldn't stop at this point, as our goal is to try and improve the performance as much as we can. For this case, you can see that `n_neighbors` parameter is equal to 10, which is the upper limit of what we supplied to the `fit` method. That means that we can try a higher range of neighbors and see if it makes RMSE even lower. Let us create an instance of the class once again and fit the model with different params

In [8]:
ml = MachineLearning('knn')
ml.fit(100, X_train, y_train, X_test, y_test, ('n_neighbors', 10, 20), ('weights', ['uniform', 'distance']))

In [102]:
# Hyper parameters range intialization for tuning 

parameters={"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }

In [140]:
# calculating different regression metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

reg_decision_model=DecisionTreeRegressor()
tuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3)


In [142]:
X=datae.iloc[:,:-1]
y=datae.iloc[:,-1]

In [None]:
#tuning_model.fit(X,y)   #This is to tune the hyperparameters, the analysis works.

Fitting 3 folds for each of 50400 candidates, totalling 151200 fits
[CV 1/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitter=best;, score=nan total time=   0.0s
[CV 2/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitter=best;, score=nan total time=   0.0s
[CV 3/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitter=best;, score=nan total time=   0.0s
[CV 1/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitter=random;, score=nan total time=   0.0s
[CV 2/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitter=random;, score=nan total time=   0.0s
[CV 3/3] END max_depth=1, max_features=auto, max_leaf_nodes=None, min_samples_leaf=1, min_weight_fraction_leaf=0.1, splitte

In [149]:
grid_search.best_params_

NameError: name 'grid_search' is not defined

In [163]:
#print(tuning_model.best_params_) Here is the problem when trying to get them


AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

Check out the hyperparameters

In [9]:
ml.model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 20,
 'p': 2,
 'weights': 'distance'}

Check out the metrics. Now as we increased the number of neighbors the model is overfitted, because Train RMSE is 0 (the model perfectly predicted all the training examples), but Test RMSE is not much less. The model once again used the upper limit of `n_neighbors`, meaning that increasing this number further does not make much sense, and we can stick to the previous version

In [10]:
pred_train = ml.predict(X_train)
pred_test = ml.predict(X_test)
print(f'Train RMSE: {ml.get_rmse(y_train, pred_train)}\nTest RMSE:  {ml.get_rmse(y_test, pred_test)}')

Train RMSE: 0.0
Test RMSE:  0.46821311923555525


In [11]:
ml = MachineLearning('knn')
ml.fit(100, X_train, y_train, X_test, y_test, ('n_neighbors', 1, 10), ('weights', ['uniform', 'distance']))

Once we made up our mind with the model, we should save it by running this cell. Make sure you are saving the right model! I reran the first model in the cell above to save this particular version of KNN regressor

In [12]:
ml.save()

If you still have some issues with how the code works, you can use the help function to read the documentation

In [13]:
help(ml.fit)

Help on method fit in module utils.machine_learning:

fit(n_trials: int, X_train: numpy.ndarray, y_train: numpy.ndarray, X_test: numpy.ndarray, y_test: numpy.ndarray, *args) -> None method of utils.machine_learning.MachineLearning instance
    Find optimal hyperparameters and fit the model
    
    ## Parameters
    `n_trials` : int
        Number of optimization iterations
    
    `X_train`, `y_train`, `X_test`, `y_test` : np.ndarray
        Training and test datasets
    
    `*args` : tuple
        Tuples containing hyperparameters and their values. If a hyperparameter is:
        - Numeric, then the tuple passed is (hyperparameter_name, min, max)
        - Categorical, then the tuple passed is (hyperparameter_name, [val_1, val_2, ..., val_n])

