# 4.3.3 Supervised Neural Nets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Import the model.
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Import Metrics
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import cross_val_score

# Cardiotocography Data Set
__Abstract:__ This data set consists of measurements from fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians. 

__Source:__ [UCI Machine Learning Repository Cardiotocography Data Set](http://archive.ics.uci.edu/ml/datasets/Cardiotocography#)

__Data Set Information:__ 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). 

__Attribute Information:__
LB - FHR baseline (beats per minute) 
AC - # of accelerations per second 
FM - # of fetal movements per second 
UC - # of uterine contractions per second 
DL - # of light decelerations per second 
DS - # of severe decelerations per second 
DP - # of prolongued decelerations per second 
ASTV - percentage of time with abnormal short term variability 
MSTV - mean value of short term variability 
ALTV - percentage of time with abnormal long term variability 
MLTV - mean value of long term variability 
Width - width of FHR histogram 
Min - minimum of FHR histogram 
Max - Maximum of FHR histogram 
Nmax - # of histogram peaks 
Nzeros - # of histogram zeros 
Mode - histogram mode 
Mean - histogram mean 
Median - histogram median 
Variance - histogram variance 
Tendency - histogram tendency 
CLASS - FHR pattern class code (1 to 10) 
NSP - fetal state class code (N=normal; S=suspect; P=pathologic)

In [2]:
ctg = pd.read_excel('ctg.xls', sheet_name='Raw Data')
ctg.head()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,,NaT,,,,,,,,,...,,,,,,,,,,
1,Variab10.txt,1996-12-01,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
2,Fmcs_1.txt,1996-05-03,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,1996-05-03,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,1996-05-03,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0


In [3]:
ctg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2130 entries, 0 to 2129
Data columns (total 40 columns):
FileName    2126 non-null object
Date        2126 non-null datetime64[ns]
SegFile     2126 non-null object
b           2126 non-null float64
e           2126 non-null float64
LBE         2126 non-null float64
LB          2126 non-null float64
AC          2126 non-null float64
FM          2127 non-null float64
UC          2127 non-null float64
ASTV        2127 non-null float64
MSTV        2127 non-null float64
ALTV        2127 non-null float64
MLTV        2127 non-null float64
DL          2128 non-null float64
DS          2128 non-null float64
DP          2128 non-null float64
DR          2128 non-null float64
Width       2126 non-null float64
Min         2126 non-null float64
Max         2126 non-null float64
Nmax        2126 non-null float64
Nzeros      2126 non-null float64
Mode        2126 non-null float64
Mean        2126 non-null float64
Median      2126 non-null float64
Vari

Looks like most of the columns are numerical with the exception of the File information and date.  It looks like there are a couple of rows with missing information. Let's check the tail of the data set to see if they are at the end of the set.

In [4]:
ctg.tail()

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
2125,S8001045.dsp,1998-06-06,CTG2127.txt,1576.0,3049.0,140.0,140.0,1.0,0.0,9.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,2.0
2126,S8001045.dsp,1998-06-06,CTG2128.txt,2796.0,3415.0,142.0,142.0,1.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2127,,NaT,,,,,,,,,...,,,,,,,,,,
2128,,NaT,,,,,,,,,...,,,,,,,,,,
2129,,NaT,,,,,,,564.0,23.0,...,,,,,,,,,,


Yep, there are null rows at the end (and the one at the beginning).  We should drop all of these.

In [5]:
ctg = ctg.dropna()
print(len(ctg))
ctg.head()

2126


Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,...,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
1,Variab10.txt,1996-12-01,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0
2,Fmcs_1.txt,1996-05-03,CTG0002.txt,5.0,632.0,132.0,132.0,4.0,0.0,4.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
3,Fmcs_1.txt,1996-05-03,CTG0003.txt,177.0,779.0,133.0,133.0,2.0,0.0,5.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
4,Fmcs_1.txt,1996-05-03,CTG0004.txt,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0
5,Fmcs_1.txt,1996-05-03,CTG0005.txt,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0


Great, now let's continue building the model.

## Building a Model - Default Settings

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the fetal state class code (NSP).

Before we import establish the model we first have to ensure correct typing for our data and do some other cleaning.  As noted before, the file information and date are non-numerical.  Let's drop these.

In [6]:
ctg = ctg.drop(['FileName', 'Date', 'SegFile'], axis=1)

Great. Let's identify our variables and model with the default settings.

In [7]:
# Identify variables
X = ctg.drop('NSP', axis=1)
Y = ctg.NSP

In [8]:
# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier()
mlp.fit(X, Y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Taking a look at our initial ground truth percentages, for reference.

In [9]:
Y.value_counts()/len(Y)

1.0    0.778457
2.0    0.138758
3.0    0.082785
Name: NSP, dtype: float64

The data set is skewed towards normal state, with suspect and pathologic being a total of 20% of the data.

Let's check the adjusted rand score.  This score will tell us how the prediction relates to the ground truth of the data.

In [10]:
# 10-fold cross validation
ars = cross_val_score(mlp, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Scores: {:.5f}(+/- {:.2f})'.format(ars.mean(), ars.std()*2))

Cross Validation Scores: 0.59756(+/- 0.53)


These adjusted rand scores are around 0.5, which indicates random labeling, and the large variance indicates that this model is overfitting.  Just to check, let's take a look at the contingency table for the model.

In [11]:
# Get predicted clusters.
full_pred = mlp.predict(X)
pd.crosstab(Y, full_pred) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1522,120,13
2.0,50,240,5
3.0,13,42,121


This doesn't look bad, but again, the normal state is most common, so this skew is evident in the data.

## Model 2 - Logistic Activation
Let's try a different activation function for the hidden layer.  We'll keep all other default settings, except for the activation function.

In [12]:
# Establish and fit the model, with default settings.
mlp2 = MLPClassifier(activation='logistic')
mlp2.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [13]:
# 10-fold cross validation
ars2 = cross_val_score(mlp2, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars2.mean(), ars2.std()*2))

Cross Validation Adjusted Rand Scores: 0.80801(+/- 0.39)


This adjusted rand score is higher, which is promising, but the variance in scores is still high, at 0.34. Again, let's take a look at the contingency table

In [14]:
# Get predicted clusters.
full_pred2 = mlp2.predict(X)
pd.crosstab(Y, full_pred2) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1653,2,0
2.0,101,194,0
3.0,25,43,108


This looks much better, with fewer mislabeled points.  Now, let's see if we can adjust some of the other hyperparameters to optimize the model.

## Model 3 - Playing with Size of Layers
Let's keep the logistic activation and then increase the size of the layer to our model. 

In [15]:
# Establish and fit the model, with default settings.
mlp3 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000))
mlp3.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=1000, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [16]:
# 10-fold cross validation
ars3 = cross_val_score(mlp3, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars3.mean(), ars3.std()*2))

Cross Validation Adjusted Rand Scores: 0.78487(+/- 0.30)


Again, adjusted rand score is high, and cross validation variance reduced slightly.

In [17]:
# Get predicted clusters.
full_pred3 = mlp3.predict(X)
pd.crosstab(Y, full_pred3) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1633,16,6
2.0,58,233,4
3.0,4,26,146


Again, slightly better labeling of classes.

## Model 4 - Multiple Large Layers
Now, let's try logistic activation with two layers with a size of 1000 each.

In [18]:
# Establish and fit the model, with default settings.
mlp4 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000, 1000))
mlp4.fit(X, Y)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 1000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [19]:
# 10-fold cross validation
ars4 = cross_val_score(mlp4, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars4.mean(), ars4.std()*2))

Cross Validation Adjusted Rand Scores: 0.75616(+/- 0.41)


Our adjusted rand scores decreased, and the variance increased.  Let's try some other adjustments in hyperparameters.

In [20]:
# Get predicted clusters.
full_pred4 = mlp4.predict(X)
pd.crosstab(Y, full_pred4) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1614,41,0
2.0,40,254,1
3.0,4,42,130


## Model 5 - Alpha
Let's reduce alpha to see how much impact that has on the model.

In [21]:
# Establish and fit the model, with default settings.
mlp5 = MLPClassifier(activation='logistic', hidden_layer_sizes=(1000, 1000), alpha=1e-6)
mlp5.fit(X, Y)

MLPClassifier(activation='logistic', alpha=1e-06, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 1000), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [22]:
# 10-fold cross validation
ars5 = cross_val_score(mlp5, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars5.mean(), ars5.std()*2))

Cross Validation Adjusted Rand Scores: 0.73593(+/- 0.33)


The model with a single layer of 1000 still has a higher score than this model. 

In [23]:
# Get predicted clusters.
full_pred5 = mlp5.predict(X)
pd.crosstab(Y, full_pred5) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1622,5,28
2.0,70,135,90
3.0,1,4,171


## Model 6 - Smaller Layers, Higher Alpha
Let's go back to using smaller layer sizes, and a smaller value for alpha.

In [24]:
# Establish and fit the model, with default settings.
mlp6 = MLPClassifier(activation='logistic', alpha=1e-7)
mlp6.fit(X, Y)

MLPClassifier(activation='logistic', alpha=1e-07, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [25]:
# 10-fold cross validation
ars6 = cross_val_score(mlp6, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars6.mean(), ars6.std()*2))

Cross Validation Adjusted Rand Scores: 0.78617(+/- 0.35)


The adjusted rand score is higher, great! The variance is also reduced.  The only realization is that the range of adjusted rand scores goes through 0.5, which indicates it could be from random assignments.

In [26]:
# Get predicted clusters.
full_pred6 = mlp6.predict(X)
pd.crosstab(Y, full_pred6) 

col_0,1.0,2.0,3.0
NSP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,1649,4,2
2.0,29,263,3
3.0,3,23,150


Satisfied with this model, let's try a gradient boosted classifier model to see how well that performs in comparison.

# Gradient Boosted Classifier Model


In [30]:
#instantiating and fitting the model
gbc = GradientBoostingClassifier()
gbc.fit(X, Y)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [31]:
# 10-fold cross validation
ars7 = cross_val_score(gbc, X, Y, scoring='adjusted_rand_score', cv=10)
print('Cross Validation Adjusted Rand Scores: {:.5f}(+/- {:.2f})'.format(ars7.mean(), ars7.std()*2))

Cross Validation Adjusted Rand Scores: 0.91966(+/- 0.16)


Even with the default settings, the gradient boosted classifier model has a much higher adjusted rand score, 0.1 higher, and has a lower variance in cross validation scores. 

Is this because the model is better suited for supervised learning? Or is there something inherently less accurate in unsupervised learning?