# Module 6: Exercise

In this session, we fit a linear SVM on **Medical Appointment No Shows** dataset
with the typical train/validate workflow.

In addition, you are supposed to perform outlier removal and feature selection before
training a linear SVM on this dataset as well.

Please follow **LinearSVM** and **Processing** labs in this module to get familliarized with
linear SVM model and prepare for this exercise.

The **Processing** provides an example of how you could incorporate feature selections and
outlier detection into a more complete data analysis workflow.
Please refer back to labs in **Module 3** and **Module 4** respectively for more details.

Dataset: https://www.kaggle.com/joniarroba/noshowappointments

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Load dataset

This dataset is used to potentially predict **No-show** from various factors recorded.

In [17]:
# Dataset location
DATASET = '/dsa/data/all_datasets/AppliedML_M6/appointment_noshow.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


## Processing

<span style="background: yellow;">For this section, every time when you are debugging your code to modify the dataset, you probably need to re-run cells from **Load dataset**. (above cell)</span>

List top 5 records to have a preview of this dataset.

In [18]:
# Add code below this comment  (Question #E6001)
# ----------------------------------
dataset.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,823681800000000.0,5604716,F,2016-04-20T08:15:11Z,2016-05-02T00:00:00Z,13,CONQUISTA,0,0,0,0,0,1,No
1,713958100000000.0,5658944,F,2016-05-04T12:37:06Z,2016-05-04T00:00:00Z,15,SANTA MARTHA,0,0,0,0,0,0,No
2,9918162000000.0,5646167,M,2016-05-02T11:00:26Z,2016-05-10T00:00:00Z,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,No
3,319479300000000.0,5684241,F,2016-05-11T08:23:40Z,2016-05-11T00:00:00Z,37,TABUAZEIRO,1,0,0,0,0,0,No
4,577744700000.0,5564030,M,2016-04-11T07:08:46Z,2016-05-16T00:00:00Z,59,PARQUE MOSCOSO,0,0,0,0,0,0,No


Looks like we won't be using **PatientId** and **AppointmentID** columns.

Delete those columns.

In [19]:
# Add code below this comment  (Question #E6002)
# ----------------------------------
dataset.drop(['PatientId','AppointmentID'],axis=1,inplace=True)

dataset.head()

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,2016-04-20T08:15:11Z,2016-05-02T00:00:00Z,13,CONQUISTA,0,0,0,0,0,1,No
1,F,2016-05-04T12:37:06Z,2016-05-04T00:00:00Z,15,SANTA MARTHA,0,0,0,0,0,0,No
2,M,2016-05-02T11:00:26Z,2016-05-10T00:00:00Z,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,No
3,F,2016-05-11T08:23:40Z,2016-05-11T00:00:00Z,37,TABUAZEIRO,1,0,0,0,0,0,No
4,M,2016-04-11T07:08:46Z,2016-05-16T00:00:00Z,59,PARQUE MOSCOSO,0,0,0,0,0,0,No


Convert **Gender** and **No-show** to binary (0s and 1s).

In [20]:
# Add code below this comment  (Question #E6003)
# ----------------------------------

dataset['Gender'] = list(map(['M','F'].index , dataset['Gender']))
dataset['No-show'] = list(map(['Yes','No'].index ,dataset['No-show']))
dataset.head()

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,1,2016-04-20T08:15:11Z,2016-05-02T00:00:00Z,13,CONQUISTA,0,0,0,0,0,1,1
1,1,2016-05-04T12:37:06Z,2016-05-04T00:00:00Z,15,SANTA MARTHA,0,0,0,0,0,0,1
2,0,2016-05-02T11:00:26Z,2016-05-10T00:00:00Z,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,1
3,1,2016-05-11T08:23:40Z,2016-05-11T00:00:00Z,37,TABUAZEIRO,1,0,0,0,0,0,1
4,0,2016-04-11T07:08:46Z,2016-05-16T00:00:00Z,59,PARQUE MOSCOSO,0,0,0,0,0,0,1


Convert **ScheduledDay** and **AppointmentDay** into np.datetime data type.

In [21]:
dataset['ScheduledDay'] = dataset['ScheduledDay'].apply(np.datetime64)

In [22]:
# Add code below this comment  (Question #E6004)
# ----------------------------------
dataset['AppointmentDay'] = dataset['AppointmentDay'].apply(np.datetime64)
dataset.head()

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,1,2016-04-20 08:15:11,2016-05-02,13,CONQUISTA,0,0,0,0,0,1,1
1,1,2016-05-04 12:37:06,2016-05-04,15,SANTA MARTHA,0,0,0,0,0,0,1
2,0,2016-05-02 11:00:26,2016-05-10,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,1
3,1,2016-05-11 08:23:40,2016-05-11,37,TABUAZEIRO,1,0,0,0,0,0,1
4,0,2016-04-11 07:08:46,2016-05-16,59,PARQUE MOSCOSO,0,0,0,0,0,0,1


Add a column **AwaitingTime** filled with the time difference between **AppointmentDay** and **ScheduledDay**,
in number of days.

In [23]:
dataset['AwaitingTime'] = (dataset['AppointmentDay'] - dataset['ScheduledDay']).apply(lambda dt: dt.days)
dataset.head()

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AwaitingTime
0,1,2016-04-20 08:15:11,2016-05-02,13,CONQUISTA,0,0,0,0,0,1,1,11
1,1,2016-05-04 12:37:06,2016-05-04,15,SANTA MARTHA,0,0,0,0,0,0,1,-1
2,0,2016-05-02 11:00:26,2016-05-10,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,1,7
3,1,2016-05-11 08:23:40,2016-05-11,37,TABUAZEIRO,1,0,0,0,0,0,1,-1
4,0,2016-04-11 07:08:46,2016-05-16,59,PARQUE MOSCOSO,0,0,0,0,0,0,1,34


Check unique values of all columns, except **ScheduledDay** and **AppointmentDay**
because they would have too many unique values.

The goal is to understand whether there's missing values or "bad" values in the dataset.

In [24]:
# Complete code below this comment  (Question #E6005)
# ----------------------------------
for column_name in set(dataset.columns)-{'ScheduledDay', 'AppointmentDay'}:
    print(column_name, sorted(np.unique(dataset[column_name])))
        

Handcap [0, 1, 2, 3, 4]
Age [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]
AwaitingTime [-7, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 100, 101, 102, 103, 104, 106, 107, 108, 109, 110, 111, 114, 116, 118, 121, 122, 124, 125, 126, 131, 132, 138, 141, 145, 150, 154, 161, 168, 175, 178]
Gend

### Outliers

Import some outlier detection utilities.

In [25]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

First thing we noticed is that age can't be less than or equal to 0 unless there's probably some discrepancy in the dataset. We remove those rows.

In [26]:
dropdata=dataset[dataset['Age']<1]
dataset.drop(dropdata.index,inplace=True)
dataset.reset_index(drop=True,inplace =True)

Remove outliers in **AwaitingTime** with **Elliptic Envelope**.

In [27]:
awaiting_time = np.array(dataset['AwaitingTime']).reshape((-1, 1))#-1 unknown for number of rows and 1 
print('awaiting_time.shape', awaiting_time.shape)                 # 1 for columns in reshape
                                                                  #the double parentheses is passing a tuple
# Complete code below this comment  (Question #E6007)             #to the function
# ----------------------------------
envelope = EllipticEnvelope(contamination = 0.003)
envelope.fit(awaiting_time)
outliers = envelope.predict(awaiting_time)==-1
dataset.drop(np.flatnonzero(outliers), inplace=True)
dataset.reset_index(drop=True, inplace=True)
# ----------------------------------

print({'inliers': np.sum(~outliers), 'outliers': np.sum(outliers)})
print('Number of records', len(dataset))
print('AwaitingTime', np.unique(dataset['AwaitingTime']))

awaiting_time.shape (106987, 1)
{'inliers': 106670, 'outliers': 317}
Number of records 106670
AwaitingTime [-7 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85]


### Encoding

Dates and times are usually difficult to deal with for predictive models as input data.

Therefore, we create take day and month out of **AppointmentDay** and **create two new columns** respectively.

Also remove **ScheduledDay** because it can be derived from these two columns and **AwaitingTime**,
so this column would become redundant.

Remove column **AppointmentDay**.

In [28]:
# Complete code below this comment  (Question #E6008)
# ----------------------------------
dataset['AppointmentDate_day'] = dataset['AppointmentDay'].apply(lambda d:d.day)
dataset['AppointmentDate_month'] = dataset['AppointmentDay'].apply(lambda d:d.month)

# Add code below this comment to delete columns ScheduledDay and AppointmentDay (Question #E6009)
# ----------------------------------

del dataset['ScheduledDay']
del dataset['AppointmentDay']

# ----------------------------------

dataset.head()

Unnamed: 0,Gender,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AwaitingTime,AppointmentDate_day,AppointmentDate_month
0,1,13,CONQUISTA,0,0,0,0,0,1,1,11,2,5
1,1,15,SANTA MARTHA,0,0,0,0,0,0,1,-1,4,5
2,0,70,ILHA DO PRÍNCIPE,0,0,0,0,0,1,1,7,10,5
3,1,37,TABUAZEIRO,1,0,0,0,0,0,1,-1,11,5
4,0,59,PARQUE MOSCOSO,0,0,0,0,0,0,1,34,16,5


Strings are also undesirable data types here. We use **LabelBinarizer** to create a one-hot encoding for **Neighbourhood** instead.

In [29]:
from sklearn.preprocessing import LabelBinarizer
# Complete code below this comment  (Question #E6010)
# ----------------------------------
encoder = LabelBinarizer()
Neighbourhood_onehot = encoder.fit_transform(dataset['Neighbourhood'])
# ----------------------------------

for j, neighborhood in enumerate(encoder.classes_):
    dataset['Neighbourhood ({})'.format(neighborhood)] = Neighbourhood_onehot[:, j]

del dataset['Neighbourhood']
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106670 entries, 0 to 106669
Data columns (total 93 columns):
Gender                                         106670 non-null int64
Age                                            106670 non-null int64
Scholarship                                    106670 non-null int64
Hipertension                                   106670 non-null int64
Diabetes                                       106670 non-null int64
Alcoholism                                     106670 non-null int64
Handcap                                        106670 non-null int64
SMS_received                                   106670 non-null int64
No-show                                        106670 non-null int64
AwaitingTime                                   106670 non-null int64
AppointmentDate_day                            106670 non-null int64
AppointmentDate_month                          106670 non-null int64
Neighbourhood (AEROPORTO)                      106670 non-nul

### Statistics

Now all columns are integer type.

Check statictics for the rest of the columns.

In [30]:
dataset.loc[:, [column_name for column_name in dataset.columns
    if not column_name.startswith('Neighbourhood')]].describe()

Unnamed: 0,Gender,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AwaitingTime,AppointmentDate_day,AppointmentDate_month
count,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0,106670.0
mean,0.655161,38.248317,0.101191,0.203019,0.074088,0.031452,0.022874,0.322162,0.797553,8.867067,12.469195,5.211278
std,0.475318,22.42875,0.301582,0.402248,0.261916,0.174537,0.163863,0.467307,0.401825,14.165951,9.046689,0.473975
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-7.0,1.0,4.0
25%,0.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,5.0,5.0
50%,1.0,38.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,10.0,5.0
75%,1.0,56.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,13.0,19.0,5.0
max,1.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,85.0,31.0,6.0


Check class balance.

This dataset is used to potentially predict **No-show** from various factors recorded.
Since no-shows should usually be the minority cases, it's very likely that this dataset  
is very imbalanced.

We want to understand how balanced it is between number of positive and negative samples quantitatively.  
So we find out the ratio of no-shows among the entire dataset.

In [33]:
# Complete code below this comment  (Question #E6011)
# ----------------------------------
num_noshow = np.sum(dataset['No-show']==0) # find out total number of no-show cases
print('noshow ratio:', num_noshow, '/', len(dataset), '=', num_noshow / len(dataset))






noshow ratio: 21595 / 106670 = 0.202446798538


For the sake of fairness, we will resample no-show cases to rebalance the dataset.

First, we calculate this upsample rate that would make positive and negative samples appear 50/50,
when multiplied to number of no-show cases.

In [34]:
upsample_rate = (len(dataset) - num_noshow) / num_noshow
print('upsample_rate:', upsample_rate)

upsample_rate: 3.93956934476


Verify this upsample rate by definition.

In [35]:
print(int(num_noshow * upsample_rate), len(dataset) - num_noshow)

85075 85075


Now we resample dataset. Please upsample these no-show cases then concatenate with original "show-up" cases and
and create a new dataset **dataset_resampled**.

In [37]:
# Complete code below this comment  (Question #E6012)
# ----------------------------------
dataset_resampled = pd.concat([
    dataset[dataset['No-show'] == 1].sample(num_noshow).reset_index(drop=True),
    dataset[dataset['No-show'] == 0]
])




Shuffle **dataset_resampled**.

In [38]:
# Add code below this comment  (Question #E6013)
# ----------------------------------

dataset_resampled = dataset_resampled.sample(frac=1).reset_index(drop=True)

Verify no-show ratio again.

In [41]:
print('noshow ratio:', np.sum(dataset_resampled['No-show'] == 1) / len(dataset_resampled))

noshow ratio: 0.5


To avoid mixing up **dataset** and **dataset_resampled**,
we replace **dataset** and delete **dataset_resampled**.

In [42]:
dataset = dataset_resampled
del dataset_resampled

Next thing, you may realize that we got a lot of columns just for neighborhood.
Number of these columns are way higher than other features combined.
Since we have already made an decision to encode them with one-hot encoding arbitrary,
why not use PCA on these columns and see if we could compress them down.

In [43]:
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import scale, LabelEncoder, OneHotEncoder, LabelBinarizer
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

Select all Neighbourhood columns.

In [44]:
# Complete code below this comment  (Question #E6014)
# ----------------------------------
X_neighbors = np.array(dataset.loc[:, [column_name for column_name in dataset.columns
    if column_name.startswith('Neighbourhood')]])
X_neighbors.shape



(43190, 81)

Apply PCA to **X_neighbors** with 60 pincipal components to create **X_neighbors_PCA**.

In [45]:
# Add code below this comment  (Question #E6015)
# ----------------------------------
pca = PCA(n_components = 60)
X_neighbors_PCA = pca.fit_transform(X_neighbors)


Check combined variance explained ratio to make sure it doesn't drop too significantly.

In [46]:
np.sum(pca.explained_variance_ratio_)

0.96869606988266499

Delete all Neighbourhood columns because we are going to replace them all with their principal components.

In [50]:
# Add code below this comment  (Question #E6016)
# ----------------------------------
dataset.drop(dataset.loc[:,[column_name for column_name in dataset.columns 
                  if column_name.startswith('Neighbourhood')]],axis=1,inplace=True)


Attatch principal components of the original Neighbourhood columns onto the dataset.

In [51]:
for j in range(X_neighbors_PCA.shape[1]):
    dataset['N{}'.format(j)] = X_neighbors_PCA[:, j]

Take a look at the dataset again.

In [52]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43190 entries, 0 to 43189
Data columns (total 72 columns):
Gender                   43190 non-null int64
Age                      43190 non-null int64
Scholarship              43190 non-null int64
Hipertension             43190 non-null int64
Diabetes                 43190 non-null int64
Alcoholism               43190 non-null int64
Handcap                  43190 non-null int64
SMS_received             43190 non-null int64
No-show                  43190 non-null int64
AwaitingTime             43190 non-null int64
AppointmentDate_day      43190 non-null int64
AppointmentDate_month    43190 non-null int64
N0                       43190 non-null float64
N1                       43190 non-null float64
N2                       43190 non-null float64
N3                       43190 non-null float64
N4                       43190 non-null float64
N5                       43190 non-null float64
N6                       43190 non-null float64
N7 

### Feature selection

Now we create following arrays for easier access to column names and features.

Study what they are and what do they represent.

In [53]:
column_names = np.array(dataset.columns)
original_features = column_names!='No-show'

Create train/test split **X_train, X_test, y_train, y_test**

In [54]:
from sklearn.model_selection import train_test_split

# Complete code below this comment  (Question #E6017)
# ----------------------------------
# Pull features and labels
X = scale(np.array(dataset.loc[:, original_features]))  # use what you learned from the cell above this
y = np.array(dataset['No-show'])

# Add code below this comment  (Question #E6018)
# ----------------------------------
# Create training/validation split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.25)


Fit a feature selector.

In [55]:
# Complete code below this comment  (Question #E6019)
# ----------------------------------
selector = SelectKBest(f_classif, k=20)
selector.fit(X_train,y_train)
selected_features = column_names[original_features][selector.get_support()]
print(selected_features)




['Age' 'Scholarship' 'Hipertension' 'Diabetes' 'Handcap' 'SMS_received'
 'AwaitingTime' 'AppointmentDate_month' 'N3' 'N6' 'N8' 'N10' 'N12' 'N15'
 'N22' 'N27' 'N31' 'N39' 'N41' 'N54']


## Create a linear SVM classifier

In [56]:
import tensorflow as tf

# Please ignore the configurations below
tf.logging.set_verbosity(tf.logging.ERROR)
import tf_threads
estimator_config = tf.contrib.learn.RunConfig(session_config=
    tf_threads.limit(tf, 4)
)

Prepare feature columns as TensorFlow placeholders.

In [65]:
# Add code below this comment  (Question #E6020)
# -----------------------------------------------
features_columns = [tf.contrib.layers.real_valued_column(i)for i in selected_features]



Create SVM classifier

In [68]:
# Add code below this comment  (Question #E6021)
# ----------------------------------
classifier = tf.contrib.learn.SVM('example_id',feature_columns = features_columns,l2_regularization=1.0)


Prepare input_fn() to supply training data.

In [69]:
# Complete code below this comment  (Question #E6022)
# ----------------------------------
def input_fn():
    X_selected = selector.transform(X_train)
    columns = {
        feature_name: tf.constant(np.expand_dims(X_selected[:,i], 1))
            for i,feature_name in enumerate(selected_features)
    }
    columns['example_id'] = tf.constant([str(i+1) for i in range(len(X_selected))])
    labels = tf.constant(y_train)
    return columns, labels

## Train SVM

This may take a few minutes.

In [70]:
%%time

# Add code below this comment  (Question #E6023)
# ----------------------------------
classifier.fit(input_fn=input_fn,steps=30)


CPU times: user 57.6 s, sys: 1.07 s, total: 58.7 s
Wall time: 20.8 s


SVM(params={'feature_columns': [_RealValuedColumn(column_name='Age', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='Scholarship', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='Hipertension', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='Diabetes', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='Handcap', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='SMS_received', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='AwaitingTime', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='AppointmentDate_month', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='N3', dimension=1, default_v

## Evaluation

First create a predict_fn() to supply data from test dataset.

In [71]:
def predict_fn():
    X_selected = selector.transform(X_test)
    columns = {
        feature_name: tf.constant(np.expand_dims(X_selected[:, i], 1))
            for i,feature_name in enumerate(selected_features)
    }
    columns['example_id'] = tf.constant([str(i+1) for i in range(len(X_selected))])
    return columns

Make predictions

In [72]:
y_pred = classifier.predict(input_fn=predict_fn)
y_pred = list(map(lambda i: i['classes'], y_pred))

Measure accuracy and create confusion matrix.

In [73]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Add code below this comment  (Question #E6024)
# ----------------------------------
metrics = classifier.evaluate(input_fn=input_fn,steps=1)
print("Loss",metrics['loss'],"\nAccuracy",metrics['accuracy'])
confusion_matrix(y_test,y_pred)


Loss 0.923955 
Accuracy 0.581193


array([[3348, 2081],
       [2490, 2879]])

# Save your notebook!