# LAB 2 Boosting Machine Learning Algorithm

## Introduction
Reference: Danish Harron: "Python Machine Learning Case Studies" Chapter 5. 2017.

This workshop reviews making using the classification methods developed in the application of data minning so as to spot out the insight froma. complicated data set for further A/B Test applications.

The case here about a pediatric surgeon and clinic supervisor at Ohio Clinic, was in big trouble, facing clinic losses for the third consecutive year. 

The supervisor had recently been promoted to this position, but she knew for a fact that the clinic had been doing due diligence in terms of efficiency. What surprised her most was that the hospital was incurring losses despite having the finest doctors available and no lack of scheduled appointments. 

She got the data log file and discovered reasons that losses are coming up even though the rate of appointments is going up. However, patients are not reporting at the time of their scheduled appointments, prompt not to meet following patients leading to many overtime work of staff, raising the costs. 

She believed that knowing which patients were likely not to show up would enable the hospital to take countermeasures to minimize the overtime work costs.

### Python libraries:¶

numpy, time for common language program function

pandas for data file or database manilpulation

IPython for data visulation

statistics, sklearn, and scipy are for statistical and mathematical formula/function

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from IPython.display import Image
from matplotlib.pylab import rcParams
import pandas_profiling as pdf
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import kernel_approximation
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.kernel_approximation import (RBFSampler,Nystroem)
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

rcParams['figure.figsize'] = 15, 5

In [None]:
data = pd.read_csv('No-show-Issue-Comma-300k.csv')
data.head()

In [None]:
data.shape

In [None]:
for column in list(data.columns):
    print("{0:25} {1:}".format(column, data[column].nunique()))

### Data Wrangling 

In [None]:
data[data['Age'] < 0]['Age'].value_counts().sum()

In [None]:
data = data[data['Age'] >= 0]

In [None]:
del data['Handcap']

In [None]:
data['AwaitingTime'] = data['AwaitingTime'].apply(lambda x: abs(x))
dow_mapping = {'Monday' : 0, 'Tuesday' : 1, 'Wednesday' : 2, 'Thursday' : 3, 'Friday' : 4, 'Saturday' : 5, 'Sunday' : 6}
data['DayOfTheWeek'] = data['DayOfTheWeek'].map(dow_mapping)

In [None]:
data['Alcoholism'] = data['Alcoolism']
del data['Alcoolism']
data['HyperTension'] = data['HiperTension']
del data['HiperTension']
data['AppointmentDate'] = data['ApointmentData']
del data['ApointmentData']

In [None]:
for field in ['Gender', 'Status']:
    data[field] = pd.Categorical(data[field]).codes

In [None]:
data.head()

In [None]:
def features_plots(discrete_vars):

    plt.figure(figsize=(15,24.5))

    for i, cv in enumerate(['Age']):
        plt.subplot(7, 2, i+1)
        plt.hist(data[cv], bins=len(data[cv].unique()))
        plt.title(cv)
        plt.ylabel('Frequency')

    for i, dv in enumerate(discrete_vars):
        plt.subplot(7, 2, i+3)
        data[dv].value_counts().plot(kind='bar', title=dv)
        plt.ylabel('Frequency')

In [None]:
discrete_vars = ['Gender', 'DayOfTheWeek','Diabetes', 'Alcoholism', 'HyperTension', 'Smokes', 'AwaitingTime',
                      'Tuberculosis', 'Scholarship', 'Sms_Reminder', 'Status']

features_plots(discrete_vars)

In [None]:
plt.scatter(data['Age'], data['Status'], s=1)
plt.title('Scatter plot of Age and Awaiting Time')
plt.xlabel('Age')
plt.ylabel('No-show')
plt.xlim(0, 120)
plt.ylim(-1, 2)

In [None]:
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data[['Age', 'AwaitingTime']].corr(method='pearson')
print(correlations)

In [None]:
data_dow_status = data.groupby(['Sms_Reminder', 'Status'])['Sms_Reminder'].count().fillna(0)
data_dow_status[[0, 1]].plot(kind='bar', stacked=True)
plt.title('Frequency of people showing up and not showing up by number of SMS reminders sent')
plt.xlabel('Number of SMS reminders')
plt.ylabel('Frequency')

In [None]:
data_dow_status = data.groupby(['DayOfTheWeek', 'Status'])['DayOfTheWeek'].count().unstack('Status').fillna(0)
data_dow_status[[0, 1]].plot(kind='bar', stacked=True)
plt.title('Frequency of people showing up and not showing up by Day of the week')
plt.xlabel('Day of the week')
plt.ylabel('Frequency')

In [None]:
data.boxplot(column=['Age'], return_type='axes', by='Status')
plt.show()

In [None]:
plt.figure(figsize=(15,3.5))

for i, status in enumerate(['show ups', 'show ups']):

    data_show = data[data['Status']==i]
    plt.subplot(1, 2, i+1)

    for gender in [0, 1]:
        data_gender = data_show[data_show['Gender']==gender]
        freq_age = data_gender['Age'].value_counts().sort_index()
        freq_age.plot()

    plt.title('Age wise frequency of patient %s for both genders'%status)
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.legend(['Female', 'Male'], loc='upper left')

In [None]:
data.boxplot(column=['AwaitingTime'], return_type='axes', by='Status')
plt.show()

#### Exercise: Extract more features from the date features (hour, min, etc.).

In [None]:
for col in ['AppointmentRegistration', 'AppointmentDate']: #'AppointmentRegistration', 'ApointmentData'
    for index, component in enumerate(['year', 'month', 'day']):
        data['%s_%s'%(col, component)] = data[col].apply(lambda x: int(x.split('T')[0].split('-')[index]))

In [None]:
for index, component in enumerate(['hour']):
    data['%s_%s'%('AppointmentRegistration', component)] = data['AppointmentRegistration'].apply(lambda x: int(x.split('T')[1][:-1].split(':')[index]))

In [None]:
# Include a Boolean transformation of the features in your dataset like that done in Chapter 4. 
# This will increase the feature set which can become beneficial while training the model.
data.head()

In [None]:
pdf.ProfileReport(data)

In [None]:
print(data.shape, data.head())

In [None]:
def model_performance(model_name, X_train, y_train, y_test, Y_pred):

    print('Model name: %s'%model_name)
    print('Test accuracy (Accuracy Score): %f'%metrics.accuracy_score(y_test, Y_pred))
    print('Test accuracy (ROC AUC Score): %f'%metrics.roc_auc_score(y_test, Y_pred))
    print('Train accuracy: %f'%clf.score(X_train, y_train))

    fpr, tpr, thresholds = metrics.precision_recall_curve(y_test, Y_pred)
    print('Area Under the Precision-Recall Curve: %f'%metrics.auc(fpr, tpr))
    
    false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(y_test, Y_pred)
    roc_auc = metrics.auc(false_positive_rate, true_positive_rate)
    
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [None]:
features_of_choice = ['Age', 'Gender', 'Diabetes', 'Alcoholism', 'HyperTension',
                        'Scholarship', 'Sms_Reminder', 
                        'AppointmentDate_year', 'AppointmentDate_month', 'AppointmentDate_day',
                     ]


x = np.array(data[features_of_choice])
y = np.array(data['Status'])

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

In [None]:
print(data.columns)

In [None]:
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)

In [None]:
y_pred = clf.predict(x_test)
model_performance('Decision tree classifier', x_train, y_train, y_test, y_pred)

In [None]:
rbf_feature = kernel_approximation.RBFSampler(gamma=1, random_state=1)
X_train = rbf_feature.fit_transform(x_train)

clf = SGDClassifier()
clf.fit(X_train, y_train)

In [None]:
X_test = rbf_feature.fit_transform(x_test)
Y_pred = clf.predict(X_test)
model_performance('Kernel approximation', X_train, y_train, y_test, Y_pred)


In [None]:
clf = GradientBoostingClassifier(random_state=10, learning_rate=0.1,
    n_estimators=200, max_depth=5, max_features=10)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

In [None]:
model_performance('Gradient Boosting', x_train, y_train, y_test, y_pred)

In [None]:
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

In [None]:
y_pred = clf.predict(x_test)
model_performance('Random Forest', x_train, y_train, y_test, y_pred)

In [None]:
for feature, score in zip(features_of_choice, list(clf.feature_importances_)):
        print('%s\t%f'%(feature, score))

In [None]:
#Excercise 1: Repeat gradient boosting classification but this time only consider the features it deemed important. Did AUC and ROC improve?
features_of_choice2 = ['Age', 'Gender', 'Sms_Reminder', 
                        'AppointmentDate_month', 'AppointmentDate_day',]

x2 = np.array(data[features_of_choice2])
x_train2, x_test2, y_train2, y_test2 = train_test_split(x, y, test_size=0.3, random_state=1)

clf2 = GradientBoostingClassifier(random_state=10, learning_rate=0.1,
    n_estimators=200, max_depth=5, max_features=10)
clf2.fit(x_train2, y_train2)
y_pred2 = clf.predict(x_test2)
model_performance('Gradient Boosting', x_train2, y_train2, y_test2, y_pred2)


In [None]:
for feature, score in zip(features_of_choice, list(clf2.feature_importances_)):
        print('%s\t%f'%(feature, score))

In [None]:
#Excercise 2: Recently a new type of boosting, Xgboost, has been popular among data scientists. Apply that to our dataset, optimize using grid search, and see if it performs relatively better than gradient boosting.
import xgboost as xgb

params = {'max_depth':5, 'eta':0.1, 'silent':1, 'objective':'binary:hinge' }
num_round = 10
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
bst = xgb.train(params, dtrain, num_round)
y_pred4 = bst.predict(dtest)

print(y_pred4)

model_performance('XGBoost', x_train, y_train, y_test, y_pred4)
xgb.plot_importance(bst)

In [None]:
#Excercise 3: Apply grid search to gradient boosting to fine-tune the parameters of learning rate, max_depth, etc.
from sklearn.model_selection import GridSearchCV

clf3 = GradientBoostingClassifier()
params = {"learning_rate": [0.001, 0.1, 0.5], "max_depth":[3,5,7,10], "n_estimators":[100,200]}
cv = GridSearchCV(estimator=clf3, param_grid=params)

y_pred3 = cv.fit(x_train, y_train).predict(x_test)

model_performance('Gradient Boosting', x_train, y_train, y_test, y_pred3)


In [None]:
print(cv.best_params_)

THE END