# Convolutional Neural Network for Disk Hernia and Spondilolysthesis Classification

**Data Set Information:**
Biomedical data set built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre MÃ©dico-Chirurgical de RÃ©adaptation des Massues, Lyon, France. The data have been organized in two different but related classification tasks. The first task consists in classifying patients as belonging to one out of three categories: Normal (100 patients), Disk Hernia (60 patients) or Spondylolisthesis (150 patients). For the second task, the categories Disk Hernia and Spondylolisthesis were merged into a single category labelled as 'abnormal'. Thus, the second task consists in classifying patients as belonging to one out of two categories: Normal (100 patients) or Abnormal (210 patients). We provide files also for use within the WEKA environment.

**Attribute Information:**
Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius and grade of spondylolisthesis. The following convention is used for the class labels: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO) and Abnormal (AB).

**Dataset Source:**
* Guilherme de Alencar Barreto (guilherme '@' deti.ufc.br) & Ajalmar RÃªgo da Rocha Neto (ajalmar '@' ifce.edu.br), Department of Teleinformatics Engineering, Federal University of CearÃ¡, Fortaleza, Ceará¡, Brazil.
* Henrique Antonio Fonseca da Mota Filho (hdamota '@' gmail.com), Hospital Monte Klinikum, Fortaleza, Ceará¡, Brazil.
* Kaggle Link - https://www.kaggle.com/caesarlupum/vertebralcolumndataset

In [None]:
import os
import glob
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#SKLearn libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

#Sagemaker libraries
import boto3
import sagemaker
from sagemaker import get_execution_role

#PyTorch libraries
from sagemaker.pytorch import PyTorch
from sagemaker.pytorch import PyTorchModel

# Reading in Data

Let's read in the CSV file and take a look at some of the entries and distribution

In [None]:
#Read in the CSV file
data_file = 'data/biomechanical_data.csv'
data = pd.read_csv(data_file, header=0, delimiter=",") 

#Check out the first few entries
data.head(10)

#Check out the distribution
sns.countplot(x='class', data=data)

#Print out some stats about the data
print('Number of Patients: ', data.shape[0])
counts_per_class=data.groupby(['class']).size()
display(counts_per_class)

# Preprocessing

Here we need to split the dataset into a training and testing set. We will use the `MinMaxScaler()` to noramlize the data and change the values of numeric columns in the dataset to a common scale. Then, we will convert the classes into a numeric index.
* Normal - 0 
* Hernia - 1
* Spondylolisthesis - 2

In [None]:
#Split into train, validation, and test data
features = data[data.columns[:-1]]
labels = data[data.columns[-1]]

#Split into train and test
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            test_size=0.3, 
                                                                            stratify=labels,
                                                                            random_state=69)

#Check the size of the datasets
print('Size of training set: ', len(train_features))
print('Size of test set: ', len(test_features))

In [None]:
#Normalize Data
scaler = MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

In [None]:
#Converts the classes to a numerical index 
def class2index(df):
    class2idx = {
        'Normal':0,
        'Hernia':1,
        'Spondylolisthesis':2
    }
    
    idx2class = {v: k for k, v in class2idx.items()}
    return df.replace(class2idx)

train_features, train_labels = np.array(train_features), np.array(class2index(train_labels))
test_features, test_labels = np.array(test_features), np.array(class2index(test_labels))

In [None]:
#Count the number of instances of each class 
def get_class_distribution(obj):
    count_dict = {
        "Normal": 0,
        "Hernia": 0,
        "Spondylolisthesis": 0,
    }
    
    for i in obj:
        if i == 0: 
            count_dict['Normal'] += 1
        elif i == 1: 
            count_dict['Hernia'] += 1
        elif i == 2: 
            count_dict['Spondylolisthesis'] += 1          
        else:
            print("Check classes.")
            
    return count_dict

#Take a look at the distribution to make sure the training and test sets aren't skewed 
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(25,7))

# Train
sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(train_labels)]).melt(),
            x = "variable", y="value", hue="variable",
            ax=axes[0]).set_title('Class Distribution in Train Set')

# Test
sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(test_labels)]).melt(),
            x = "variable", y="value", hue="variable",
            ax=axes[1]).set_title('Class Distribution in Test Set')

# Create CSVs and Load Data to S3
We will create two files: a `training.csv` and `test.csv` file with the features and class labels for the biomechanical data.

Save your train and test .csv feature files, locally. Then you can upload local files to S3 by using sagemaker_session.upload_data and pointing directly to where the training data is saved.

In [None]:
def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1).to_csv(os.path.join(data_dir, filename), header=False, index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [None]:
#Create CSV files for the training and test datasets
data_dir = 'training-data'

make_csv(train_features, train_labels, filename='train.csv', data_dir=data_dir)
make_csv(test_features, test_labels, filename='test.csv', data_dir=data_dir)

#Create Sagemaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

#Create an S3 bucket
bucket = sagemaker_session.default_bucket()

#Set prefix, a descriptive name for a directory  
prefix = 'biomechanical-data'

#Upload all data to S3
data = sagemaker_session.upload_data(path = data_dir, bucket = bucket, key_prefix = prefix)

In [None]:
#Confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

# Create a Model
When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the 'train.py' function we specified below. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments.

In [136]:
EPOCHS = 20
BATCH_SIZE = 16
LEARNING_RATE = 0.0007
NUM_FEATURES = 6
NUM_CLASSES = 3


estimator = PyTorch(
    entry_point='train.py',
    source_dir='pytorch',
    role=role,
    framework_version='1.0',
    py_version='py3',
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    hyperparameters={
        'input_features': NUM_FEATURES,
        'output_dim': NUM_CLASSES,
        'epochs': EPOCHS
    }
)

# Train and Deploy the Model
Train the estimator on the training data stored in S3. This should create a training job that we can monitor in the SageMaker console

In [137]:
%%time

# Train estimator on S3 training data
estimator.fit({'train':data})

2021-01-01 02:47:22 Starting - Starting the training job...
2021-01-01 02:47:46 Starting - Launching requested ML instancesProfilerReport-1609469241: InProgress
......
2021-01-01 02:48:46 Starting - Preparing the instances for training......
2021-01-01 02:49:49 Downloading - Downloading input data
2021-01-01 02:49:49 Training - Downloading the training image...
2021-01-01 02:50:09 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-01-01 02:50:10,644 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-01-01 02:50:10,647 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-01-01 02:50:10,659 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-01-01 02:50:13,741 sagemaker_pytorch_container.training INF

In [138]:
%%time

model = PyTorchModel(
    model_data=estimator.model_data,
    role=role,
    framework_version='1.0',
    py_version='py3',
    entry_point='predict.py',
    source_dir='pytorch'
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

-----------------!CPU times: user 419 ms, sys: 13.9 ms, total: 433 ms
Wall time: 8min 32s


# Evaluating the Model
Once the model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in data_dir and named test.csv. The labels and features are extracted from the .csv file.

In [139]:
#Read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

#Labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

In [145]:
#Generate predicted, class labels
test_y_preds = predictor.predict(test_x)

#Test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [148]:
_, y_pred_tags = torch.max(torch.from_numpy(test_y_preds), dim = 1)
print('\nPredicted class labels: ')
print(y_pred_tags)
print('\nTrue class labels: ')
print(test_y.values)

#Calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, y_pred_tags)

print('\nAccuracy: ')
print(accuracy)


Predicted class labels: 
tensor([0, 0, 2, 0, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 0, 1, 2, 0, 0, 0, 2,
        2, 0, 0, 1, 2, 0, 1, 2, 1, 2, 1, 2, 0, 2, 0, 1, 2, 2, 2, 1, 2, 2, 1, 2,
        0, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 0, 1, 0, 2, 2, 2, 2, 1, 2, 1,
        1, 2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 0, 1, 2, 0, 1, 0, 2, 1, 2, 2])

True class labels: 
[0 0 2 0 1 2 1 2 2 2 2 2 1 2 2 2 1 0 1 2 0 0 0 2 2 1 0 1 2 0 1 2 1 2 1 2 0
 0 0 1 2 2 2 0 2 2 1 0 0 2 2 2 0 2 2 0 2 2 1 1 2 0 0 1 0 2 2 2 2 1 2 1 0 2
 2 2 0 2 0 0 2 0 2 0 0 2 0 0 0 2 1 2 2]

Accuracy: 
0.8817204301075269


# Clean up Resources
After we're done evaluating our model, delete the model endpoint. We can do this with a call to `.delete_endpoint()`. Any other resources, we may delete from the AWS console.

In [135]:
predictor.delete_endpoint()

In [None]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()