# Module 6: Exercise

In this session, we fit a linear SVM on **Medical Appointment No Shows** dataset
with the typical train/validate workflow.

In addition, you are supposed to perform outlier removal and feature selection before
training a linear SVM on this dataset as well.

Please follow **LinearSVM** and **Processing** labs in this module to get familliarized with
linear SVM model and prepare for this exercise.

The **Processing** provides an example of how you could incorporate feature selections and
outlier detection into a more complete data analysis workflow.
Please refer back to labs in **Module 3** and **Module 4** respectively for more details.

Dataset: https://www.kaggle.com/joniarroba/noshowappointments

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Load dataset

This dataset is used to potentially predict **No-show** from various factors recorded.

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/AppliedML_M6/appointment_noshow.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

## Processing

<span style="background: yellow;">For this section, every time when you are debugging your code to modify the dataset, you probably need to re-run cells from **Load dataset**. (above cell)</span>

List top 5 records to have a preview of this dataset.

In [None]:
# Add code below this comment  (Question #E6001)
# ----------------------------------








Looks like we won't be using **PatientId** and **AppointmentID** columns.

Delete those columns.

In [None]:
# Add code below this comment  (Question #E6002)
# ----------------------------------








dataset.head()

Convert **Gender** and **No-show** to binary (0s and 1s).

In [None]:
# Add code below this comment  (Question #E6003)
# ----------------------------------








dataset.head()

Convert **ScheduledDay** and **AppointmentDay** into np.datetime data type.

In [None]:
# Add code below this comment  (Question #E6004)
# ----------------------------------










dataset.head()

Add a column **AwaitingTime** filled with the time difference between **AppointmentDay** and **ScheduledDay**,
in number of days.

In [None]:
dataset['AwaitingTime'] = (dataset['AppointmentDay'] - dataset['ScheduledDay']).apply(lambda dt: dt.days)
dataset.head()

Check unique values of all columns, except **ScheduledDay** and **AppointmentDay**
because they would have too many unique values.

The goal is to understand whether there's missing values or "bad" values in the dataset.

In [None]:
# Complete code below this comment  (Question #E6005)
# ----------------------------------
for column_name in set(dataset.columns)-{'ScheduledDay', 'AppointmentDay'}:
    print(column_name, sorted(np.unique(dataset[<placeholder>])))
    

    
    
    
    
    

### Outliers

Import some outlier detection utilities.

In [None]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

First thing we noticed is that age can't be less than or equal to 0 unless there's probably some discrepancy in the dataset. We remove those rows.

In [None]:
# Add code below this comment  (Question #E6006)
# ----------------------------------








# ----------------------------------


print('Number of records', len(dataset))
print('Age', np.unique(dataset['Age']))

Remove outliers in **AwaitingTime** with **Elliptic Envelope**.

In [None]:
awaiting_time = np.array(dataset['AwaitingTime']).reshape((-1, 1))
# print('awaiting_time.shape', awaiting_time.shape)

# Complete code below this comment  (Question #E6007)
# ----------------------------------
envelope = EllipticEnvelope(contamination = 0.003)
envelope.fit(<placeholder>)
outliers = <placeholder>
dataset.drop(np.flatnonzero(outliers), inplace=True)
dataset.reset_index(drop=True, inplace=True)
# ----------------------------------

print({'inliers': np.sum(~outliers), 'outliers': np.sum(outliers)})
print('Number of records', len(dataset))
print('AwaitingTime', np.unique(dataset['AwaitingTime']))

### Encoding

Dates and times are usually difficult to deal with for predictive models as input data.

Therefore, we create take day and month out of **AppointmentDay** and **create two new columns** respectively.

Also remove **ScheduledDay** because it can be derived from these two columns and **AwaitingTime**,
so this column would become redundant.

Remove column **AppointmentDay**.

In [None]:
# Complete code below this comment  (Question #E6008)
# ----------------------------------
dataset['AppointmentDate_day'] = dataset['AppointmentDay'].apply(<placeholder>)
dataset['AppointmentDate_month'] = dataset['AppointmentDay'].apply(<placeholder>)

# Add code below this comment to delete columns ScheduledDay and AppointmentDay (Question #E6009)
# ----------------------------------




# ----------------------------------

dataset.head()

Strings are also undesirable data types here. We use **LabelBinarizer** to create a one-hot encoding for **Neighbourhood** instead.

In [None]:
from sklearn.preprocessing import LabelBinarizer
# Complete code below this comment  (Question #E6010)
# ----------------------------------
encoder = LabelBinarizer()
Neighbourhood_onehot = encoder.fit_transform(dataset[<placeholder>])
# ----------------------------------

for j, neighborhood in enumerate(encoder.classes_):
    dataset['Neighbourhood ({})'.format(neighborhood)] = Neighbourhood_onehot[:, j]

del dataset['Neighbourhood']
dataset.info()

### Statistics

Now all columns are integer type.

Check statictics for the rest of the columns.

In [None]:
dataset.loc[:, [column_name for column_name in dataset.columns
    if not column_name.startswith('Neighbourhood')]].describe()

Check class balance.

This dataset is used to potentially predict **No-show** from various factors recorded.
Since no-shows should usually be the minority cases, it's very likely that this dataset  
is very imbalanced.

We want to understand how balanced it is between number of positive and negative samples quantitatively.  
So we find out the ratio of no-shows among the entire dataset.

In [None]:
# Complete code below this comment  (Question #E6011)
# ----------------------------------
num_noshow = np.sum(<placeholder>) # find out total number of no-show cases
print('noshow ratio:', num_noshow, '/', len(dataset), '=', num_noshow / len(dataset))






For the sake of fairness, we will resample no-show cases to rebalance the dataset.

First, we calculate this upsample rate that would make positive and negative samples appear 50/50,
when multiplied to number of no-show cases.

In [None]:
upsample_rate = (len(dataset) - num_noshow) / num_noshow
print('upsample_rate:', upsample_rate)

Verify this upsample rate by definition.

In [None]:
print(int(num_noshow * upsample_rate), len(dataset) - num_noshow)

Now we resample dataset. Please upsample these no-show cases then concatenate with original "show-up" cases and
and create a new dataset **dataset_resampled**.

In [None]:
# Complete code below this comment  (Question #E6012)
# ----------------------------------
dataset_resampled = pd.concat([
    dataset[dataset['No-show'] == 1].sample(<placeholder>).reset_index(drop=True),
    dataset[dataset['No-show'] == 0]
])




Shuffle **dataset_resampled**.

In [None]:
# Add code below this comment  (Question #E6013)
# ----------------------------------








Verify no-show ratio again.

In [None]:
print('noshow ratio:', np.sum(dataset_resampled['No-show'] == 1) / len(dataset_resampled))

To avoid mixing up **dataset** and **dataset_resampled**,
we replace **dataset** and delete **dataset_resampled**.

In [None]:
dataset = dataset_resampled
del dataset_resampled

Next thing, you may realize that we got a lot of columns just for neighborhood.
Number of these columns are way higher than other features combined.
Since we have already made an decision to encode them with one-hot encoding arbitrary,
why not use PCA on these columns and see if we could compress them down.

In [None]:
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import scale, LabelEncoder, OneHotEncoder, LabelBinarizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

Select all Neighbourhood columns.

In [None]:
# Complete code below this comment  (Question #E6014)
# ----------------------------------
X_neighbors = np.array(dataset.loc[:, [column_name for column_name in <placeholder>
    if column_name.startswith('Neighbourhood')]])
X_neighbors.shape



Apply PCA to **X_neighbors** with 60 pincipal components to create **X_neighbors_PCA**.

In [None]:
# Add code below this comment  (Question #E6015)
# ----------------------------------









Check combined variance explained ratio to make sure it doesn't drop too significantly.

In [None]:
np.sum(pca.explained_variance_ratio_)

Delete all Neighbourhood columns because we are going to replace them all with their principal components.

In [None]:
# Add code below this comment  (Question #E6016)
# ----------------------------------










Attatch principal components of the original Neighbourhood columns onto the dataset.

In [None]:
for j in range(X_neighbors_PCA.shape[1]):
    dataset['N{}'.format(j)] = X_neighbors_PCA[:, j]

Take a look at the dataset again.

In [None]:
dataset.info()

### Feature selection

Now we create following arrays for easier access to column names and features.

Study what they are and what do they represent.

In [None]:
column_names = np.array(dataset.columns)
original_features = column_names!='No-show'

Create train/test split **X_train, X_test, y_train, y_test**

In [None]:
from sklearn.model_selection import train_test_split

# Complete code below this comment  (Question #E6017)
# ----------------------------------
# Pull features and labels
X = scale(np.array(dataset.loc[:, <placeholder>]))  # use what you learned from the cell above this
y = np.array(dataset['No-show'])

# Add code below this comment  (Question #E6018)
# ----------------------------------
# Create training/validation split










Fit a feature selector.

In [None]:
# Complete code below this comment  (Question #E6019)
# ----------------------------------
selector = SelectKBest(f_classif, k=20)
selector.fit(<placeholder>)
selected_features = column_names[original_features][selector.get_support()]
print(selected_features)




## Create a linear SVM classifier

In [None]:
import tensorflow as tf

# Please ignore the configurations below
tf.logging.set_verbosity(tf.logging.ERROR)
import tf_threads
estimator_config = tf.contrib.learn.RunConfig(session_config=
    tf_threads.limit(tf, 4)
)

Prepare feature columns as TensorFlow placeholders.

In [None]:
# Add code below this comment  (Question #E6020)
# ----------------------------------









Create SVM classifier

In [None]:
# Add code below this comment  (Question #E6021)
# ----------------------------------









Prepare input_fn() to supply training data.

In [None]:
# Complete code below this comment  (Question #E6022)
# ----------------------------------
def input_fn():
    X_selected = selector.transform(X_train)
    columns = {
        feature_name: tf.constant(np.expand_dims(<placeholder>, 1))
            for i,feature_name in enumerate(selected_features)
    }
    columns['example_id'] = tf.constant([str(i+1) for i in range(len(X_selected))])
    labels = tf.constant(y_train)
    return columns, labels

## Train SVM

This may take a few minutes.

In [None]:
%%time

# Add code below this comment  (Question #E6023)
# ----------------------------------











## Evaluation

First create a predict_fn() to supply data from test dataset.

In [None]:
def predict_fn():
    X_selected = selector.transform(X_test)
    columns = {
        feature_name: tf.constant(np.expand_dims(X_selected[:, i], 1))
            for i,feature_name in enumerate(selected_features)
    }
    columns['example_id'] = tf.constant([str(i+1) for i in range(len(X_selected))])
    return columns

Make predictions

In [None]:
y_pred = classifier.predict(input_fn=predict_fn)
y_pred = list(map(lambda i: i['classes'], y_pred))

Measure accuracy and create confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Add code below this comment  (Question #E6024)
# ----------------------------------











# Save your notebook!