### Train Model

This notebook attempts to train the model, choose the best one based on a Metric(MAE) and deploy automatically


 - loads the data from a dat location and extracts some more features for the prediction
 - trains 
     - a Linear regression model
     - a Random Forest Regression model
     
 - Running the notebookl shoud automaticaly store/deployed the better model(based on MAE) at a location/model registry
 
 - the web api is plugged in to the model registry and will automatically picks up the deployed model
 

### 0.  Load Setup on Collab Variables

In [1]:
# #uncomment and run if working on collab
# from google.colab import drive
# drive.mount('/content/drive')


In [2]:
# #uncomment and run if working on collab
# !rm -rf mlcore
# !cp -r /content/drive/MyDrive/data/ data/
# !mkdir logs
# !mkdir models
# !unzip /content/drive/MyDrive/data/mlcore.zip
#!cd mlcore && pip install -e . && cd .. 

### 0.  Load Env Variables ( Uncomment if not running on Docker)

In [3]:
# # run this if running locally not required if you used the docker script
# #!pip install python-dotenv
# from dotenv import load_dotenv
# load_dotenv(dotenv_path = '../.env')

### 1. Import requires packages

In [4]:
import pandas as pd
import seaborn as sns

from datetime import datetime
from mlcore.data_helper import load_data
from mlcore.utils import set_logger


In [5]:
ts = datetime.now()
nb_run_id = 'trng_'+ ts.strftime("%m_%d_%Y_%H_%M_%S")
training_logger = set_logger(nb_run_id)

### 2 load data from given files' (schemas)

A data_dict and helper function is used to allow data loaa from multiple files/data sources

In [6]:
data_dict = {
    'AB_NYC_2019':None,
}


for schema_name in data_dict:
    data_dict[schema_name] = load_data(schema_name, logger = training_logger)

2022-04-30 06:41:29,062:Loaded schema AB_NYC_2019 in dataframe with shape (48895, 16)


In [7]:
abnb_data = data_dict['AB_NYC_2019']
print(abnb_data.shape)
abnb_data.iloc[0]

(48895, 16)


id                                                              2539
name                              Clean & quiet apt home by the park
host_id                                                         2787
host_name                                                       John
neighbourhood_group                                         Brooklyn
neighbourhood                                             Kensington
latitude                                                    40.64749
longitude                                                  -73.97237
room_type                                               Private room
price                                                            149
minimum_nights                                                     1
number_of_reviews                                                  9
last_review                                               2018-10-19
reviews_per_month                                               0.21
calculated_host_listings_count    

### 3. Train

### 3.1  Set up experiment data/features

In [8]:
def remove_outliers(df, feature_name, lower_qtl=0.01, upper_qtl=0.99):
    
    cutoff_low = df[feature_name].quantile(lower_qtl)
    cutoff_high  = abnb_data[feature_name].quantile(upper_qtl)

    df = df[(df[feature_name] < cutoff_high) & (df[feature_name] > cutoff_low)]
    return df
    

In [9]:
from sklearn.model_selection import train_test_split
cat_features = ['neighbourhood_group', 'room_type','neighbourhood']
num_features =['minimum_nights', 'calculated_host_listings_count' ]
long_lat = ['latitude', 'longitude']
text_features = ['name']
review_features = ['number_of_reviews', 'reviews_per_month']
target=['price']

training_logger.info('org_shape:{} '.format(abnb_data.shape))



features = cat_features + num_features + target +long_lat + text_features
abnb_data = abnb_data[features]
abnb_data.drop_duplicates(inplace=True)
abnb_data.dropna(inplace=True)

# Train after removing outliers (commenting below 2 lines push up the error)
for ftr in ['price']:
    abnb_data = remove_outliers(abnb_data, ftr, lower_qtl=0.01, upper_qtl=0.99)
    training_logger.info('after outlier removal for feature::{} --> shape:{} '.format(ftr, abnb_data.shape))

#split randomly
train_data,test_data = train_test_split(abnb_data,test_size=0.2,shuffle=True,random_state=0)

train_data.fillna(0)
test_data.fillna(0)


training_logger.info('train_shape:{} '.format(train_data.shape))
training_logger.info('test_shape:{}  '.format(test_data.shape))



2022-04-30 06:41:29,126:org_shape:(48895, 16) 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abnb_data.drop_duplicates(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abnb_data.dropna(inplace=True)
2022-04-30 06:41:29,243:after outlier removal for feature::price --> shape:(47728, 9) 
2022-04-30 06:41:29,270:train_shape:(38182, 9) 
2022-04-30 06:41:29,271:test_shape:(9546, 9)  


### 3.2  Feature Transformation/Scaling pipeline set up

In [10]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline,FeatureUnion

from sklearn.pipeline import make_pipeline

In [11]:
# set up pipeline for classical models
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
tfidf_preprocesor = TfidfVectorizer()

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, cat_features),
    ('standard-scaler', numerical_preprocessor, num_features),
    ('tf-idf', tfidf_preprocesor, text_features[0])
    ])

preprocessor_notext = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, cat_features),
    ('standard-scaler', numerical_preprocessor, num_features),
    ])

### 3.3 Model Training and Evaluation

#### 3.3.1 Load previous models and new models which will be trained and compared

In [12]:
from mlcore.train_eval_helper_reg import *
from mlcore.modelops import load_model, save_model, read_data


In [13]:
mldbpath = '../data/mldbreg.sqlite'
deployed_model_info=None
deployed_model_obj = None
load_prev_model = False
#from tensorflow import keras

if load_prev_model:
    try:
        deployed_model_info = read_data(mldbpath, 'deployed_model').iloc[0].to_dict()
        if deployed_model_info:
            deployed_model_name = deployed_model_info['final_model_name']
            deployed_model_obj = load_model(deployed_model_name, cur_logger=training_logger) 
            if deployed_model_obj['type']=='DL':
                actual_obj = keras.models.load_model('../models/'+deployed_model_name+'.deep_mdl')
            deployed_model_obj['obj'] = actual_obj
            training_logger.info('Loaded model {}'.format(deployed_model_info))
    except :
        training_logger.info('Could not load deployed model it may not exist')

#### 3.3.2 Set up  Models which need to be trained on

In [14]:
#set up Models
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor


models = {
        'LinReg' : {'obj': LinearRegression(),
             'param_grid':{},
             "type": 'classical',
             'preprocessor' : preprocessor,
             "features":cat_features+num_features + text_features
                 
                },

    'SVR' : {'obj': SVR(),
             'param_grid':{'svr__C': [0.1,1, 10, 100], 
                           'svr__gamma': [1,0.1,0.01,0.001],
                           'svr__kernel': ['rbf', 'poly', 'sigmoid']
                          },
             "type": "classical",
             "preprocessor" : preprocessor,
             "features":cat_features+num_features+text_features,
             
            },
    
    'RF' : {'obj': RandomForestRegressor(),
         'param_grid':{'randomforestregressor__max_depth': [10, 30, 60, 90,100],
                       'randomforestregressor__min_samples_leaf': [2, 4, 6,8,20],
                       'randomforestregressor__min_samples_split': [5, 10,15],
                       'randomforestregressor__n_estimators': [200, 300,600,900,1200],
                       'randomforestregressor__bootstrap' : [True, False],
                       'randomforestregressor__max_features' : ['auto', 'sqrt']
                       
                      },
         "type": "classical",
         "preprocessor" : preprocessor_notext,
         "features":cat_features+num_features,

        }    
        }



In [15]:
metrics = {'mean_squared_error':0, 'mean_absolute_error':0, 'r2_score':0}
if deployed_model_obj:
    models['deployed_model'] = {
        'obj':deployed_model_obj['obj'],
        'param_grid':deployed_model_obj['param_grid'],
        'features':deployed_model_obj['features'],
        'preprocessor':deployed_model_obj['preprocessor'],
         'type':deployed_model_obj['type']
    }
    
for metric in metrics:
    for mdl in models:
        models[mdl][metric]=0

get_df_from_dict(models)

Unnamed: 0,index,obj,param_grid,type,preprocessor,features,mean_squared_error,mean_absolute_error,r2_score
0,LinReg,LinearRegression(),{},classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0
1,SVR,SVR(),"{'svr__C': [0.1, 1, 10, 100], 'svr__gamma': [1...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0
2,RF,RandomForestRegressor(),"{'randomforestregressor__max_depth': [10, 30, ...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0


#### 3.3.3 Train Models

In [16]:
#Train Models (trained model is stored back in dict)
use_dask = True

# daks is still buggy
if use_dask:
    from dask.distributed import Client, progress
    dask_client = Client(processes=False, threads_per_worker=2,
                    n_workers=2, memory_limit='3GB')
else:
    dask_client = None

comparison_result_dict = train_models(models, 
                                      train_data, 
                                      target, 
                                      training_logger, 
                                      dask_client = dask_client, 
                                      randomized_search=True,
                                      scoring_metric = 'r2_score'
                                     )

#comparison_result_dict = train_models(classifiers, X_train,y_train, training_logger)
comparison_result = get_df_from_dict(comparison_result_dict, idxname='model')
comparison_result

2022-04-30 06:41:30,991:Training started for LinReg
2022-04-30 06:41:32,612:Training ended for LinReg
2022-04-30 06:41:32,613:Training started for SVR
  y = column_or_1d(y, warn=True)
2022-04-30 06:43:05,014:Training ended for SVR
2022-04-30 06:43:05,015:Training started for RF
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
2022-04-30 06:43:13,712:Training ended for RF


Unnamed: 0,model,obj,param_grid,type,preprocessor,features,mean_squared_error,mean_absolute_error,r2_score
0,LinReg,LinearRegression(),{},classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0
1,SVR,SVR(),"{'svr__C': [0.1, 1, 10, 100], 'svr__gamma': [1...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0
2,RF,"(DecisionTreeRegressor(max_features='auto', ra...","{'randomforestregressor__max_depth': [10, 30, ...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",0,0,0


#### 3.3.3 Test/Compare  Models

In [17]:
# Compute errr ON TEST DATA ( RMSE is computed and stored back in dict)
for metric in metrics:
    comparison_result_dict = test_models(models, test_data, target, training_logger,dask_client, error_metric=metric)
    #comparison_result_dict = test_models(classifiers, X_test,y_test, target)

comparison_result = get_df_from_dict(comparison_result_dict, idxname='model')
comparison_result

2022-04-30 06:43:13,760:Testing started for LinReg
2022-04-30 06:43:13,889:Testing ended for LinReg on metric mean_squared_error
2022-04-30 06:43:13,891:Testing started for SVR
2022-04-30 06:43:34,100:Testing ended for SVR on metric mean_squared_error
2022-04-30 06:43:34,102:Testing started for RF
2022-04-30 06:43:34,182:Testing ended for RF on metric mean_squared_error
2022-04-30 06:43:34,184:Testing started for LinReg
2022-04-30 06:43:34,264:Testing ended for LinReg on metric mean_absolute_error
2022-04-30 06:43:34,267:Testing started for SVR
2022-04-30 06:43:54,267:Testing ended for SVR on metric mean_absolute_error
2022-04-30 06:43:54,270:Testing started for RF
2022-04-30 06:43:54,348:Testing ended for RF on metric mean_absolute_error
2022-04-30 06:43:54,350:Testing started for LinReg
2022-04-30 06:43:54,433:Testing ended for LinReg on metric r2_score
2022-04-30 06:43:54,436:Testing started for SVR
2022-04-30 06:44:14,681:Testing ended for SVR on metric r2_score
2022-04-30 06:44:14

Unnamed: 0,model,obj,param_grid,type,preprocessor,features,mean_squared_error,mean_absolute_error,r2_score
0,LinReg,LinearRegression(),{},classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",6312.595211,49.39859,0.413396
1,SVR,SVR(),"{'svr__C': [0.1, 1, 10, 100], 'svr__gamma': [1...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",7118.395927,46.375446,0.338516
2,RF,"(DecisionTreeRegressor(max_features='auto', ra...","{'randomforestregressor__max_depth': [10, 30, ...",classical,ColumnTransformer(transformers=[('one-hot-enco...,"[neighbourhood_group, room_type, neighbourhood...",6299.533762,47.898508,0.41461


In [18]:
#Get best Model Based on a metric
metric = 'mean_absolute_error'
best_model_row = get_best_model(comparison_result, metric,reverse=True)
best_model = best_model_row['model']
best_model_id = models[best_model]['obj']
training_logger.info("Best performing model on basic of metric {} is {}".format(metric, best_model))
#best_model_row.to_dict()

2022-04-30 06:44:14,811:Best performing model on basic of metric mean_absolute_error is SVR


In [19]:
#Save/Deploy final_trained_model
final_trained_model = best_model_row.to_dict()
final_model_name =final_trained_model['model']+'_'+nb_run_id.replace('trng','model')

#save model
model_type = final_trained_model['type']

save_model(final_trained_model, final_model_name, model_type=model_type)

deploy_df = pd.DataFrame([[final_model_name]], 
                         columns =['final_model_name'])
deploy_df


model stored at ../models/SVR_model_04_30_2022_06_41_28.mdl


Unnamed: 0,final_model_name
0,SVR_model_04_30_2022_06_41_28


#### 3.3.3 Update Model registry/Deploy Best

In [20]:
schema_dict = {
    'deployed_model':deploy_df,
    'hist_deployed_models':deploy_df,
    #'train_report':comparison_result
}


In [21]:
from mlcore.dbhelper import store_data, overwrite_data
for dkey in schema_dict:
    data_to_be_stored = schema_dict[dkey]
    if dkey=='deployed_model':
         overwrite_data(data_to_be_stored, mldbpath, dkey)
    else:
        store_data(data_to_be_stored, mldbpath, dkey)

dep_model_name = deploy_df['final_model_name'].iloc[0]
print('Model {} deployed \n associated reports saved in respective tables with id:{}'.format(dep_model_name,nb_run_id))

Model SVR_model_04_30_2022_06_41_28 deployed 
 associated reports saved in respective tables with id:trng_04_30_2022_06_41_28


In [22]:
training_logger.info('Traning job with id {} finished'.format(nb_run_id))

2022-04-30 06:44:15,007:Traning job with id trng_04_30_2022_06_41_28 finished
