# HPE Ezmeral Container Platform ML Ops - Lab 2
## Model Development

### About the Model
This tutorial uses a XGboost based python model written to classify a person's income as being either less than or equal to, or more than, $ 50,000 

### Setup correct directory paths
Please, make sure all references to <userID> below in the directory paths for models and data subfolders match your own userID. For example, replace data/UCI_Income/\<userID\> with data/UCI_Income/student$$I

### Test Connection to Training Cluster

First we'll test that the training cluster is indeed functioning properly. 

<b>%attachments</b> is a line magic command that is special to the HPE Ezmeral ML Ops Jupyter notebooks. During the notebook creation step, we have the option to attach the notebook to a training cluster. This line magic command will output a table with the name(s) of the training cluster(s) available for us to use. Sometimes, tenant admin may have created multiple training clusters for different projects depending on the needs of the model or size of data, e.g. some with GPU nodes, while others with CPUs only.

In [None]:
%attachments

In [None]:
userID="student$$I" 

To utilize the training cluster, we will need grab the name of the training cluster you want to use and feed it into another custom line magic command. 

In this lab, we are going to use the **pythonmldl** training cluster that is the common cluster for all training jobs. 

The Jupyter notebook will then send the contents of the cell to be executed on the training cluster. Any work that you have done in this notebook will be not be propogated to the training cluster. Therefore you will need to import the libraries and re-write any code you need to be excuted on the training cluster. 

The example cell below will execute a print statement on the training cluster

In [None]:
%%pythonmldl

print('test')

The training cluster will send back a unique log url to this particular user and notebook. You can use this URL with another custom line magic command to track the status of the job in real time. 

Copy the URL output from the previous cell and paste it into the cell below where it says "your_url_here"

In [None]:
%logs --url http://hpecp-21.cplocal:10001/history/16

### Now we can start coding

In [None]:
import numpy as np
import pandas as pd
import os
import json
import seaborn as sns
sns.set(font_scale=1.5)

%matplotlib inline 

Here we are definining a function to return us the path to the Project Repository. Every cluster in the ML Ops tenant has read and write privileges to the Project Repository. There are two different ways to access the project repo:
1. You can copy the direct path from the HPE Ezmeral CP UI 
2. You can use the bdvcli command as seen below in the function to grab the path

bdvcli is a custom command line tool used to obtain information about HEPCP. See [bdvcli documentation](https://github.com/bluedatainc/solutions/blob/master/bdvcli_commands/bdvcli_commands.md) for list of commands.

In [None]:
def ProjectRepo(path):
   ProjectRepo = os.popen('bdvcli --get cluster.project_repo').read().rstrip()
   return ProjectRepo + '/' + path

# Data preprocessing

In [None]:
strpath= "data/UCI_Income/" + userID + "/adult_data.csv"
train_file = ProjectRepo(strpath)
train_set = pd.read_csv(train_file, header=None)
train_set.head()

In [None]:
strpath= "data/UCI_Income/" + userID + "/adult_test.csv"
test_file = ProjectRepo(strpath)
test_set = pd.read_csv(test_file, skiprows=1, header=None)
test_set.head()

## Initial Findings
1. No column headers in the data (can fix using dataset description from website)
2. Some "?" in test data 
3. Target values differ in train and test set

#### 1. Fix column headers

In [None]:
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 
              'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
             'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels
train_set.info()
test_set.info()

#### 2. Clean up ? in data

In [None]:
train_set.replace(' ?', np.nan).dropna().shape
test_set.replace(' ?', np.nan).dropna().shape
# removing rows with "?" from our dataframes 
train_no_missing = train_set.replace(' ?', np.nan).dropna()
test_no_missing = test_set.replace(' ?', np.nan).dropna()

#### 3. Fix targets (remove the extra periods from '<=50K.' to '<=50K')

In [None]:
test_no_missing['wage_class'] = test_no_missing.wage_class.replace({' <=50K.' : ' <=50K', ' >50K.' : ' >50K'})
test_no_missing.wage_class.unique()
train_no_missing.wage_class.unique()

## Applying ordinal encoding to categoricals
- ordinal encoding: convert string labels to integer values 1 through k. First unique value in column becomes 1, the second becomes 2, the third becomes 3, adn so on


In [None]:
#combine the datasets together first
combined_set = pd.concat([train_no_missing, test_no_missing], axis=0)
combined_set.info()
#Visualizations after initial cleaning of dataset 
group = combined_set.groupby('wage_class')
group
#encode non-numerical features into numeric values using pandas Cateogrical codes 
#and generating categorical codes mapping into dictionary
cat_codes = {}
for feature in combined_set.columns: 
    if combined_set[feature].dtype == 'object':
        #workclass : { occupation : number }
        temp_dict = {}
        feature_codes = list(pd.Categorical(combined_set[feature]).codes)
        feature_list = list(combined_set[feature])
        for i in range(len(feature_codes)):
            temp_dict[feature_list[i].strip()] = int(feature_codes[i])
            if len(temp_dict) > len(feature_list):
                break
        cat_codes[feature] = temp_dict
        combined_set[feature] = pd.Categorical(combined_set[feature]).codes
combined_set.info()
# saving encoding to json file to be used for scoring script
strpath= "data/UCI_Income/" + userID + "/encoding.json"
json_file = ProjectRepo(strpath)
with open(json_file, 'w') as file:
    json.dump(cat_codes, file)
    #split combined set back into test/train split 
final_train = combined_set[:train_no_missing.shape[0]] 
final_test = combined_set[train_no_missing.shape[0]:]
strpath= "data/UCI_Income/" + userID + "/adult_train_cleaned.csv"
final_train.to_csv(ProjectRepo(strpath))
strpath= "data/UCI_Income/" + userID + "/adult_test_cleaned.csv"
final_test.to_csv(ProjectRepo(strpath))
#extracting target values from our test and train sets 
y_train = final_train.pop('wage_class')
y_test = final_test.pop('wage_class')

# Model Development

### First model

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
ind_params = {'learning_Rate': 0.1, 'n_estimators': 1000, 'seed': 0, 'subsample' : 0.8, 'colsample_bytree': 0.8, 
              'objective': 'binary:logistic'}

#optimizing for accuracy, GBM = gradient boost model
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                             cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)
optimized_GBM.fit(final_train, y_train)
optimized_GBM.cv_results_

### Second model
Tuning other hyperparameters in an attempt to achieve higher mean accuracy

In [None]:
cv_params = {'learning_rate': [0.1, 0.01], 'subsample': [0.7, 0.8, 0.9]}
ind_params = {'n_estimators': 1000, 'seed': 0, 'colsample_bytree': 0.8, 'objective': 'binary:logistic', 
              'max_depth': 3, 'min_child_weight': 1}
                    
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                             cv_params, 
                             scoring = 'accuracy', cv=5, n_jobs=-1)
optimized_GBM.fit(final_train, y_train)
optimized_GBM.cv_results_

### Third model
Utilize XGBoost's built-in cv which allows early stopping to prevent overfitting

In [None]:
xgdmat = xgb.DMatrix(final_train, y_train)
our_params = {'eta': 0.1, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic',
              'max_depth': 3, 'min_child_weight': 1}

cv_xgb = xgb.cv(params=our_params, dtrain=xgdmat, num_boost_round=3000, metrics=['error'],
                early_stopping_rounds=100)
print('Best iteration:', len(cv_xgb))
cv_xgb.tail(5)

### Final Model

In [None]:
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':3, 'min_child_weight':1} 

final_gb = xgb.train(our_params, xgdmat, num_boost_round = 326)

# Plot feature importances

In [None]:
xgb.plot_importance(final_gb)
importances = final_gb.get_fscore()
importances
importance_frame = pd.DataFrame({'Importance': list(importances.values()), 'Feature': list(importances.keys())})
importance_frame.sort_values(by = 'Importance', inplace=True)
importance_frame.plot(kind='barh', x='Feature', figsize=(8,8), color='green')

# Build model remotely on a distributed Python training cluster

This training job combines all the cells we've worked on preeviously and form one large cell. At the end, we will save the model into the Project Repository. 

Make sure you fill your <b>training cluster name</b> in the line magic! 

In [None]:
%%pythonmldl
userID="student$$I"
# Importing libraries 
print("Importing libraries")
import numpy as np
import pandas as pd
import os
import pickle
import xgboost as xgb
import datetime
from sklearn.model_selection import GridSearchCV

# Start time 
print("Start time: ", datetime.datetime.now())

# Project repo path function
def ProjectRepo(path):
   ProjectRepo = os.popen('bdvcli --get cluster.project_repo').read().rstrip()
   return ProjectRepo + '/' + path

# Reading in data 
print("Reading in data")
strpath= "data/UCI_Income/" + userID + "/adult_train_cleaned.csv"
train = pd.read_csv(ProjectRepo(strpath))
print("Done reading in data")

# Extracting target values 
y_train = train.pop('wage_class')
train.pop('Unnamed: 0')

# Model development / Training
print("Training...")
xgdmat = xgb.DMatrix(train, y_train)
our_params = {'eta': 0.1, 'seed': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'binary:logistic',
              'max_depth': 3, 'min_child_weight': 1}
cv_xgb = xgb.cv(params=our_params, dtrain=xgdmat, num_boost_round=3000, metrics=['error'],
                early_stopping_rounds=100)
optimal_rounds = len(cv_xgb)
final_gb = xgb.train(our_params, xgdmat, num_boost_round = optimal_rounds)

# Save model into project repo
print("Saving model")
strpath= "models/XGB_Income/" + userID + "/XGB.pickle.dat"
xgb.Booster.save_model(final_gb, ProjectRepo(strpath))

# Finish time
print("End time: ", datetime.datetime.now())

Copy the unique log url and paste it into the cell below 

In [None]:
%logs --url http://hpecp-21.cplocal:10001/history/20

# Testing Models
Here we are going to test that the model prediction is as expected

1. We're going to test the model that we've created here locally 
2. Then we will test the model that has been saved in the Project Repository 
3. Validate that the values are the same

We will take the first value in the adult_test_cleaned dataset
Test the model by loading from Project Repository

In [None]:
strpath= "data/UCI_Income/" + userID + "/adult_test_cleaned.csv"
cleaned = pd.read_csv(ProjectRepo(strpath))
cleaned.tail(1)
temp = cleaned.tail(1)
y_test = temp.pop('wage_class')
temp.set_index('age')
temp.pop('Unnamed: 0')
mat = xgb.DMatrix(temp) 
y_pred = final_gb.predict(mat)
y_pred

In [None]:
model = xgb.Booster({'nthread':325})
strpath= "models/XGB_Income/" + userID + "/XGB.pickle.dat"
model.load_model(ProjectRepo(strpath))
temp = cleaned.tail(1)
y_test = temp.pop('wage_class')
temp.set_index('age')
temp.pop('Unnamed: 0')
mat = xgb.DMatrix(temp) 
y_pred = model.predict(mat)
y_pred

Validate that the 2 numbers are the same 