<a href="https://colab.research.google.com/github/cloudui/ml-by-example/blob/main/ch3/cardiotocography_SVM_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fetal state classification on cardiotocography

## Step 1: Fetching data and preprocessing

In [2]:
!curl -o CTG.xls https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1703k  100 1703k    0     0  2046k      0 --:--:-- --:--:-- --:--:-- 2044k


In [3]:
import pandas as pd

# Reading excel sheet into a DataFrame
df = pd.read_excel('CTG.xls', "Raw Data")

In [6]:
print(df)

          FileName       Date      SegFile       b  ...   FS  SUSP  CLASS  NSP
0              NaN        NaT          NaN     NaN  ...  NaN   NaN    NaN  NaN
1     Variab10.txt 1996-12-01  CTG0001.txt   240.0  ...  1.0   0.0    9.0  2.0
2       Fmcs_1.txt 1996-05-03  CTG0002.txt     5.0  ...  0.0   0.0    6.0  1.0
3       Fmcs_1.txt 1996-05-03  CTG0003.txt   177.0  ...  0.0   0.0    6.0  1.0
4       Fmcs_1.txt 1996-05-03  CTG0004.txt   411.0  ...  0.0   0.0    6.0  1.0
...            ...        ...          ...     ...  ...  ...   ...    ...  ...
2125  S8001045.dsp 1998-06-06  CTG2127.txt  1576.0  ...  0.0   0.0    5.0  2.0
2126  S8001045.dsp 1998-06-06  CTG2128.txt  2796.0  ...  0.0   0.0    1.0  1.0
2127           NaN        NaT          NaN     NaN  ...  NaN   NaN    NaN  NaN
2128           NaN        NaT          NaN     NaN  ...  NaN   NaN    NaN  NaN
2129           NaN        NaT          NaN     NaN  ...  NaN   NaN    NaN  NaN

[2130 rows x 40 columns]


Taking the right columns as input features and using the last column as the label. There are some rows not used.

In [15]:
X = df.iloc[1:2126, 3:-2].values
Y = df.iloc[1:2126, -1].values

Printing out distribution between classes

In [18]:
from collections import Counter

print(Counter(Y))

Counter({1.0: 1654, 2.0: 295, 3.0: 176})


## Step 2: Model training and tuning

Splitting into training and test sets

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

GridSearchCV to optimize hyperparameters. Realize that 'balanced' mode should not be used because the labels are not balanced in reality. 

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svc = SVC(kernel='rbf')
parameters = {'C': (100, 1e3, 1e4, 1e5),
              'gamma': (1e-08, 1e-7, 1e-6, 1e-5)}
grid_search = GridSearchCV(svc, parameters, n_jobs=-1, cv=5)
grid_search.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'C': (100, 1000.0, 10000.0, 100000.0),
                         'gamma': (1e-08, 1e-07, 1e-06, 1e-05)})

## Step 3: Model testing and testing accuracy

In [34]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 100000.0, 'gamma': 1e-07}
0.9541176470588235


In [35]:
svc_best = grid_search.best_estimator_
accuracy = svc_best.score(X_test, Y_test)
print(f'The accuracy is: {accuracy*100:.1f}%')

The accuracy is: 95.5%


Printing out relevant metrics with classification report.

In [36]:
from sklearn.metrics import classification_report

prediction = svc_best.predict(X_test)
report = classification_report(Y_test, prediction)
print(report)

              precision    recall  f1-score   support

         1.0       0.96      0.98      0.97       324
         2.0       0.89      0.91      0.90        65
         3.0       1.00      0.78      0.88        36

    accuracy                           0.96       425
   macro avg       0.95      0.89      0.92       425
weighted avg       0.96      0.96      0.95       425

