<div style="width:image width px; font-size:75%; text-align:right;">
    <img src="img/data_ev_unsplash.jpg" width="width" height="height" style="padding-bottom:0.2em;" />
    <figcaption>Photo by ev on Unsplash</figcaption>
</div>

# Introduction to machine learning in Python with scikit-learn

**Applied Programming - Summer term 2020 - FOM Hochschule für Oekonomie und Management - Cologne**

**Lecture 06 - May 07, 2020**

## Table of contents
* [Recap and outlook](#recap)
* [Business understanding](#businessunderstanding)
* [Data understanding](#dataunderstanding)
* [Data preparation](#preparation)
* [Modeling and evaluation](#modeling)
* [Homework](#homework)
* [References](#references)

## Recap and outlook<a class="anchor" id="recap"></a>
In the last lecture we learned about important packages to extend the functionality of Python. Concerning machine learning, we looked at the scikit-learn package.

> *Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.* [[1]](#sklearn2020)

Important features of the scikit-learn library are:
* Supervised and unsupervised learning algorithms
* Clean, uniform, and streamlined API
* In-depth, well understandable documentation with references to scientific papers

In this lecture we will mainly focus on the implementation of the different machine learning algorithms - so we will primarily discuss the section modeling in CRISP-DM. In the homework following this lecture you will deepen the preprocessing by means of an example data set.

## Business understanding<a class="anchor" id="businessunderstanding"></a>
A major issue for businesses that assign appointments to their customers is not showing up. This is especially important for medical offices, as treatment times are reserved and then expire unused, although others would have needed them. The reduction of such cases is therefore in the interest of all parties involved. Medical practices do not suffer losses due to unused capacity and other patients are given treatment. Support in the form of a predictive algorithm would therefore be highly desirable.

What possible measures could be taken if the prediction is successful? If a patient is found to have an increased risk of no-shows, it is possible to deliberately over-plan - i.e. to double the number of treatment appointments. Or measures can be introduced to make the patient aware of the appointment. Various applications for a predictive no-show score are therefore conceivable.

The data set used here includes appointments and information on the patient from the Brazilian city of Vitória [[2]](#jonihoppen2017). Vitória has about 360.000 inhabitants and is located 530km north-east of Rio de Janeiro on the Atlantic coast.

## Data understanding<a class="anchor" id="dataunderstanding"></a>
First we load the data into the notebook and examine the scope, type and properties of the data.

The documentation attached to the data includes the following descriptions of the columns:
* ``PatientId``: Identification of a patient
* ``AppointmentID``: Identification of each appointment
* ``Gender``: Male or Female . Female is the greater proportion, woman takes way more care of they health in comparison to man
* ``ScheduledDay``: The day someone called or registered the appointment, this is before appointment of course
* ``AppointmentDay``: The day of the actuall appointment, when they have to visit the doctor
* ``Age``: How old is the patient
* ``Neighbourhood``: Where the appointment takes place
* ``Scholarship``: True of false; state social support programme
* ``Hipertension``: True or false
* ``Diabetes``: True or false
* ``Alcoholism``: True or false
* ``Handcap``: True or false
* ``SMS_received``: 1 or more messages sent to the patient
* ``No-show``: True or false.

For the examination of the data, keep the following questions in mind:
* *What stands out to you?*
* *Which pre-processing steps will be necessary?*
* *What is the name of the label variable for which supervised learning can take place?*
* *Is it a classification or regression problem?*

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

In [None]:
# Load the data and a first look at a random sample
df = pd.read_csv('dat/noshow.csv')
df.sample(10)

In [None]:
# Basic statistics of all columns
df.describe(include = 'all')

In [None]:
# 
df.dtypes

In [None]:
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'], format = '%Y-%m-%dT%H:%M:%SZ')
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'], format = '%Y-%m-%dT%H:%M:%SZ')
df.sample(5)

In [None]:
df.dtypes

In [None]:
# 
print('Values of Gender:       {}'.format(df['Gender'].unique()))
print('Values of Scholarship:  {}'.format(df['Scholarship'].unique()))
print('Values of Hipertension: {}'.format(df['Hipertension'].unique()))
print('Values of Diabetes:     {}'.format(df['Diabetes'].unique()))
print('Values of Alcoholism:   {}'.format(df['Alcoholism'].unique()))
print('Values of Handcap:      {}'.format(df['Handcap'].unique()))
print('Values of SMS_received: {}'.format(df['SMS_received'].unique()))
print('Values of No-show:      {}'.format(df['No-show'].unique()))

In [None]:
# 
print('Values of Age: {}'.format(np.sort(df['Age'].unique())))

In [None]:
#
print(len(df.columns[df.isna().any()])/len(df.columns))
print(df.isnull().sum().sum()/np.product(df.shape))

In [None]:
# 
fig, ax = plt.subplots(nrows = 4, ncols = 3, figsize = (20, 15))
ax[0, 0].hist(df['Gender'])
ax[0, 0].set_title('Gender')
ax[0, 1].hist(df['ScheduledDay'])
ax[0, 1].set_title('ScheduledDay')
ax[0, 2].hist(df['AppointmentDay'])
ax[0, 2].set_title('AppointmentDay')
ax[1, 0].hist(df['Age'])
ax[1, 0].set_title('Age')
ax[1, 1].hist(df['Neighbourhood'])
ax[1, 1].set_title('Neighbourhood')
ax[1, 2].hist(df['Scholarship'])
ax[1, 2].set_title('Scholarship')
ax[2, 0].hist(df['Hipertension'])
ax[2, 0].set_title('Hipertension')
ax[2, 1].hist(df['Diabetes'])
ax[2, 1].set_title('Diabetes')
ax[2, 2].hist(df['Alcoholism'])
ax[2, 2].set_title('Alcoholism')
ax[3, 0].hist(df['Handcap'])
ax[3, 0].set_title('Handcap')
ax[3, 1].hist(df['SMS_received'])
ax[3, 1].set_title('SMS_received')
ax[3, 2].hist(df['No-show'])
ax[3, 2].set_title('No-show')
plt.draw()

## Data preparation<a class="anchor" id="preparation"></a>
Based on the findings and the discussion from the first two sections, we can now start the pre-processing. As first general steps we will change column names and convert binary attributes into an object type.

Ask yourself at the beginning:
* *Which working steps are to be taken on the basis of the previous findings?*
* *What must be considered?*
* *What additional information could be relevant?*

In [None]:
# Drop IDs, rename columns in order to correct spelling and get 'cleaned' names
df = df.drop(['PatientId', 'AppointmentID'], axis = 1).rename(columns = {'Hipertension': 'Hypertension',
                                                                         'Handcap': 'Handicap',
                                                                         'SMS_received': 'SMSReceived',
                                                                         'No-show': 'NoShow'})

In [None]:
# Convert binary attributes to object type columns
df['Scholarship'] = df['Scholarship'].astype('object')
df['Hypertension'] = df['Hypertension'].astype('object')
df['Diabetes'] = df['Diabetes'].astype('object')
df['Alcoholism'] = df['Alcoholism'].astype('object')
df['Handicap'] = df['Handicap'].astype('object')
df['SMSReceived'] = df['SMSReceived'].astype('object')

In [None]:
df.dtypes

### Inconsistencies

In [None]:
df[df['Age'] < 0]

In [None]:
# Correct age of -1
df.loc[df['Age'] < 0, 'Age'] = 0                    # Other possibility is to drop this row

### Feature engineering

In [None]:
# Add columns with weekday names
df['WeekdayScheduled'] = df['ScheduledDay'].dt.weekday_name
df['WeekdayAppointment'] = df['AppointmentDay'].dt.weekday_name

In [None]:
# Calculate time difference in days between scheduling and appointment
df['Waiting'] = (df['AppointmentDay'].dt.date - df['ScheduledDay'].dt.date) / np.timedelta64(1, 'D')

In [None]:
# Extract hour when appointment was scheduled
df['ArrangementHour'] = df['ScheduledDay'].dt.hour

In [None]:
# Drop original date columns
df.drop(['ScheduledDay', 'AppointmentDay'], axis = 1, inplace = True)

In [None]:
df.sample(5)

### Encoding

In [None]:
# Using pandas for columns with non-binary values
df = pd.get_dummies(df, columns = ['Neighbourhood', 'WeekdayScheduled', 'WeekdayAppointment'])

In [None]:
# Using sklearn for binary columns
from sklearn.preprocessing import LabelEncoder
for c in ['Gender', 'NoShow']:
    l = LabelEncoder()
    df[c] = l.fit_transform(df[c])

## Modeling and evaluation<a class="anchor" id="modeling"></a>
Now, after the data has been prepared for modelling, we can proceed with the application of different machine learning algorithms. For this purpose we will include a number of additional packages/modules. Think about this:

* *What can be done to obtain a well generalising model?*
* *How can I measure performance?*
* *What do I have to compare the performance with or is a metric absolute?*

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

In [None]:
# Split features and label into two NumPy-arrays
X = df.drop(['NoShow'], axis = 1)
y = df['NoShow']

In [None]:
# Split the data for training and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.35,
                                                    random_state = 42)

### Decision tree

In [None]:
# Instantiate the classifier object and start training
clf_dt = DecisionTreeClassifier(max_depth = 4)  # restrict the tree fixed to four levels to get better display 
clf_dt = clf_dt.fit(X_train, y_train)

In [None]:
# Get the indication about the most important features
print("Feature Importance:\n")
for name, importance in zip(X.columns, np.sort(clf_dt.feature_importances_)[::-1]):
    print("{}: {:.2f}".format(name, importance))

In [None]:
# Print confusion matrix and accuracy
y_pred = clf_dt.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
# Calculate null accuracy
1 - y_test.mean()

In [None]:
graph = Source(export_graphviz(clf_dt,
                               out_file = None,
                               feature_names = X.columns,
                               class_names = ['No', 'Yes'],
                               filled = True))
display(SVG(graph.pipe(format = 'svg')))

### Random forest

In [None]:
# Instantiate the classifier object and start training
clf_rf = RandomForestClassifier(random_state = 42)
clf_rf.fit(X_train, y_train)

In [None]:
# Get the indication about the most important features
print("Feature Importance:\n")
for name, importance in zip(X.columns, np.sort(clf_rf.feature_importances_)[::-1]):
    print("{}: {:.2f}".format(name, importance))

In [None]:
# Print confusion matrix and accuracy
y_pred = clf_rf.predict(X_test)
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.accuracy_score(y_test, y_pred))

### Grid search

In [None]:
params = {'n_estimators': [10, 20, 30],
          'max_depth':[3, 4, 5]}
clf_rf2 = RandomForestClassifier(random_state = 42)
clf_grid = GridSearchCV(clf_rf2, params, cv = 5, n_jobs = -1, verbose = 1)
clf_grid.fit(X_train, y_train)
print(clf_grid.best_params_)
print(clf_grid.best_score_)

## Homework
Please download the data set about campus recruitment [[3]](#roshan2020). Follow the CRISP-DM steps through to modeling. Use a Support Vector Machine combined with a Grid Search.

## References<a class="anchor" id="references"></a>

[1]<a class="anchor" id="sklearn2020"></a> The scikit-learn developers (2020). scikit-learn. Retrieved 2020-04-02 from https://scikit-learn.org/stable/

[2]<a class="anchor" id="jonihoppen2017"></a> JoniHoppen (2017). Medical Appointment No Shows, Why do 30% of patients miss their scheduled appointments?, Version 5. Retrieved 2020-05-04 from https://www.kaggle.com/joniarroba/noshowappointments.

[3]<a class="anchor" id="roshan2020"></a> Ben Roshan D (2020). Campus Recruitment, Academic and Employability Factors influencing placement, Version 1. Retrieved 2020-05-04 from https://www.kaggle.com/benroshan/factors-affecting-campus-placement.