# **Introduction**

In this notebook, our objective is to classify the progression of diabetes one year after baseline using the diabetes dataset. While our primary focus isn't on achieving pinpoint accuracy in classification, we aim to explore and highlight distinctions among various classifiers. These classifiers diverge based on the features they utilize from the dataset, allowing us to gain insights into their performance variations.

# **Dataset**

In this section, we will prepare the dataset for subsequent classification tasks. The dataset in focus is the Diabetes dataset, comprising information on 442 diabetes patients. Each patient's data consists of 11 features, with column 11 ("Y") being particularly distinctive as it represents a quantitative measure of disease progression one year after baseline. The remaining features are baseline variables, including age, sex, and body metrics.

[read more about the dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)


First, we will need to import some modules to facilitate our work:

In [None]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Assume dataset is located in dataset_path, change as needed
dataset_path = '/content/drive/MyDrive/diabetes.csv'
ds = pd.read_csv(dataset_path, sep='\t')

Let's take a look at our data

In [None]:
ds

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.00,157,93.2,38.0,4.00,4.8598,87,151
1,48,1,21.6,87.00,183,103.2,70.0,3.00,3.8918,69,75
2,72,2,30.5,93.00,156,93.6,41.0,4.00,4.6728,85,141
3,24,1,25.3,84.00,198,131.4,40.0,5.00,4.8903,89,206
4,50,1,23.0,101.00,192,125.4,52.0,4.00,4.2905,80,135
...,...,...,...,...,...,...,...,...,...,...,...
437,60,2,28.2,112.00,185,113.8,42.0,4.00,4.9836,93,178
438,47,2,24.9,75.00,225,166.0,42.0,5.00,4.4427,102,104
439,60,2,24.9,99.67,162,106.6,43.0,3.77,4.1271,95,132
440,36,1,30.0,95.00,201,125.2,42.0,4.79,5.1299,85,220


We proceed by splitting the data into training and testing sets to evaluate the performance of our classifiers in subsequent stages

In [None]:
def split_data(data, split_size):
  """
  Splits a given dataset into training and testing datasets based on a specified split size.

  Args:
  - data (DataFrame): The dataset to be split.
  - split_size (float): The ratio of the dataset to be allocated for training. Value should be in the range (0, 1).

  Returns:
  - train (DataFrame): The training dataset containing a portion of the original data specified by split_size.
  - test (DataFrame): The testing dataset containing the remaining portion of the original data.

  """
  # Calculate the index for splitting the dataset
  split_index = int(len(data) * split_size)

  # Create training and testing datasets using MyDataset class
  train = data.iloc[:split_index]
  test = data.iloc[split_index:]
  return train, test


In [None]:
train, test = split_data(ds, 0.8)

Later on, our aim is to predict the quantitative measure of disease progression one year after baseline ("Y") for new data points.

To accomplish this task, we must first transform the continuous values of "Y" into a categorical variable by binarizing it. This involves creating a new column called "Class" based on the original values of "Y". The "Class" column will have K possible values, each corresponding to a specific range or category of "Y". This transformation enables us to utilize classification algorithms effectively to predict the class of disease progression for new data points, where the "class" column will serve as the labels for supervised learning classification.

In [None]:
def calculate_equal_size_divisions(dataframe, column_name, num_divisions):
    """
    Calculate equal-size divisions for a given column in the DataFrame.

    Args:
        dataframe (pandas.DataFrame): Input DataFrame containing numerical data.
        column_name (str): Name of the column in the DataFrame.
        num_divisions (int): Number of divisions to be calculated.

    Returns:
        list: List of calculated divisions.
    """
    # Sort the DataFrame by the specified column
    sorted_df = dataframe.sort_values(by=column_name)

    # Calculate the equal-size divisions
    divisions = [sorted_df[column_name].quantile(q=q / num_divisions) for q in range(1, num_divisions)]

    return torch.tensor(divisions)

We can utilize the function calculate_equal_size_divisions to compute divisions for a specific column, such as deciles or percentiles. This function enables us to partition the data into equal-sized segments based on specified quantiles, facilitating the analysis of the dataset's distribution and characteristics.

In [None]:
deciles = calculate_equal_size_divisions(ds, "Y",10) # deciles[i] is the i'th decile upper bound, where i*10% of the data is below it
deciles

tensor([ 60.0000,  77.0000,  94.0000, 115.0000, 140.5000, 168.0000, 196.7000,
        232.0000, 265.0000], dtype=torch.float64)

In [None]:
percentiles = calculate_equal_size_divisions(ds, "Y",100)
percentiles[9::10] # obtain deciles from percentiles

tensor([ 60.0000,  77.0000,  94.0000, 115.0000, 140.5000, 168.0000, 196.7000,
        232.0000, 265.0000], dtype=torch.float64)

Once we have established the desired divisions, we can proceed to compute the classes (labels) for each data point:

In [None]:
def add_class(df, column_name, num_classes):

  """
  Add a new column to the DataFrame representing the class of each entry based on division size.

  Args:
      df (pandas.DataFrame): Input DataFrame containing data.
      column_name (str): Name of the column in the DataFrame.
      num_classes (int): Number of possible classes

  Returns:
      pandas.DataFrame: A copy of the input DataFrame with an additional 'Class' column.
          Each entry in the 'Class' column corresponds to the class of the original entry
          based on its value in the specified column and the specified division size.
  """

  # Calculate equal-size divisions based on the specified column and division size
  division_thresholds  = calculate_equal_size_divisions(df, column_name, num_classes)
  df_copy = df.copy()

  # Add a new column 'Class' to the copied DataFrame
  # Assign each entry in the 'Class' column based on its position relative to the calculated divisions
  # df_copy['Class'] = np.digitize(df[column_name], division_thresholds )
  data_torch = torch.tensor(df[column_name].values)
  categories = torch.searchsorted(division_thresholds, data_torch)
  df_copy['Class'] = categories
  return df_copy


In the initial phase of this notebook, the classes will be determined based on the deciles of the Y column:

In [None]:
# Calculate labels using deciles of Y
train_with_labels_by_deciles = add_class(train, "Y", 10) # num_classes = 10 for deciles
test_with_labels_by_deciles = add_class(test, "Y", 10)
train_with_labels_by_deciles


Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y,Class
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151,5
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75,1
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141,4
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206,7
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135,4
...,...,...,...,...,...,...,...,...,...,...,...,...
348,57,1,24.5,93.0,186,96.6,71.0,3.0,4.5218,91,148,5
349,49,2,21.0,82.0,119,85.4,23.0,5.0,3.9703,74,88,2
350,41,2,32.0,126.0,198,104.2,49.0,4.0,5.4116,124,243,8
351,25,2,22.6,85.0,130,71.0,48.0,3.0,4.0073,81,71,1


Subsequently, the classes will be determined based on the percentiles of the Y column.

In [None]:
# Calculate labels using percentiles of Y
train_with_labels_by_percentiles = add_class(train, "Y", 100) # num_classes = 100 for percentiles
test_with_labels_by_percentiles = add_class(test, "Y", 100)
train_with_labels_by_percentiles

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y,Class
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151,55
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75,17
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141,49
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206,75
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135,47
...,...,...,...,...,...,...,...,...,...,...,...,...
348,57,1,24.5,93.0,186,96.6,71.0,3.0,4.5218,91,148,54
349,49,2,21.0,82.0,119,85.4,23.0,5.0,3.9703,74,88,24
350,41,2,32.0,126.0,198,104.2,49.0,4.0,5.4116,124,243,82
351,25,2,22.6,85.0,130,71.0,48.0,3.0,4.0073,81,71,14


To utilize our dataset effectively, we will implement a new class that inherits from torch.Dataset:

In [None]:
class MyDataset(Dataset):
    """
    A PyTorch dataset class representing a dataset.

    This class inherits from torch.Dataset and provides methods to access and manipulate the dataset.

    Args:
        data (str or pandas.DataFrame): If a string is provided, it is treated as the path to a CSV file. If a
            DataFrame is provided directly, it is used as the dataset.
        sep (str, optional): The separator to use when reading the CSV file. Default is ','.

    Attributes:
        data (pandas.DataFrame): The dataset stored in a DataFrame.
        __n_features (int): Number of features in the dataset (excluding the class column, if present).

    Methods:
        __getitem__(idx): Retrieves a sample from the dataset at the specified index.
        __len__(): Returns the length of the dataset.
    """

    def __init__(self, data, sep='\t'):
        """
        Initialize the MyDataset object.

        Args:
            data (str or pandas.DataFrame): If a string is provided, it is treated as the path to a CSV file. If a
                DataFrame is provided directly, it is used as the dataset.
            sep (str, optional): The separator to use when reading the CSV file. Default is ','.
        """
        if isinstance(data, str):
            # Read the CSV file
            self.data = pd.read_csv(data, sep=sep)
        else:
            self.data = data

        # Calculate number of features (number of columns that are not class)
        self.n_features = self.data.shape[1] - 1


    def __getitem__(self, idx):
        """
        Retrieve a sample from the dataset at the specified index.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            tuple: A tuple containing input features and the corresponding label.
        """
        return torch.tensor(self.data.iloc[idx, :self.n_features]), torch.tensor(self.data.iloc[idx, self.n_features], dtype=torch.long)

    def __len__(self):
        """
        Return the length of the dataset.

        Returns:
            int: Length of the dataset.
        """
        return len(self.data)


The final step in data preparation involves creating data loaders:

In [None]:
batch_size = 10
train_dataloader_deciles = DataLoader(MyDataset(train_with_labels_by_deciles), batch_size)
test_dataloader_deciles = DataLoader(MyDataset(test_with_labels_by_deciles), batch_size)
train_dataloader_percentiles = DataLoader(MyDataset(train_with_labels_by_percentiles), batch_size)
test_dataloader_percentiles = DataLoader(MyDataset(test_with_labels_by_percentiles), batch_size)

In [None]:
# Get an example batch from the one of the training data loaders
iterator = iter(train_dataloader_deciles)
batch = next(iterator)
print("Example batch: ", batch)

Example batch:  [tensor([[ 59.0000,   2.0000,  32.1000, 101.0000, 157.0000,  93.2000,  38.0000,
           4.0000,   4.8598,  87.0000, 151.0000],
        [ 48.0000,   1.0000,  21.6000,  87.0000, 183.0000, 103.2000,  70.0000,
           3.0000,   3.8918,  69.0000,  75.0000],
        [ 72.0000,   2.0000,  30.5000,  93.0000, 156.0000,  93.6000,  41.0000,
           4.0000,   4.6728,  85.0000, 141.0000],
        [ 24.0000,   1.0000,  25.3000,  84.0000, 198.0000, 131.4000,  40.0000,
           5.0000,   4.8903,  89.0000, 206.0000],
        [ 50.0000,   1.0000,  23.0000, 101.0000, 192.0000, 125.4000,  52.0000,
           4.0000,   4.2905,  80.0000, 135.0000],
        [ 23.0000,   1.0000,  22.6000,  89.0000, 139.0000,  64.8000,  61.0000,
           2.0000,   4.1897,  68.0000,  97.0000],
        [ 36.0000,   2.0000,  22.0000,  90.0000, 160.0000,  99.6000,  50.0000,
           3.0000,   3.9512,  82.0000, 138.0000],
        [ 66.0000,   2.0000,  26.2000, 114.0000, 255.0000, 185.0000,  56.0000,
 

In this setup, each batch comprises a list containing two tensors: one tensor representing the data with a size of batch_size X num_features (10 X 11), and another tensor representing the labels with a size of 10.

Therefore, each individual data point corresponds to a row within the first tensor.

# **Classifiers Setup**

In this section, we'll require some additional imports

In [None]:
import torch.nn as nn
import torch.optim as optim

To accomplish the classification task, we'll define a straightforward and simple classifier

In [None]:
class SimpleClassifier(nn.Module):
    """
    A simple feedforward neural network classifier.

    This class inherits from torch.nn.Module and defines a simple classifier model.

    Args:
        input_size (int): The size of the input features.
        hidden_size (int): The size of the hidden layer.
        num_classes (int): The number of output classes.

    Attributes:
        fc1 (torch.nn.Linear): The first fully connected layer.
        relu (torch.nn.ReLU): The ReLU activation function.
        fc2 (torch.nn.Linear): The second fully connected layer.

    Methods:
        forward(x): Defines the forward pass of the classifier.
    """

    def __init__(self, input_size, hidden_size, num_classes):
        """
        Initialize the SimpleClassifier object.

        Args:
            input_size (int): The size of the input features.
            hidden_size (int): The size of the hidden layer.
            num_classes (int): The number of output classes.
        """

        super(SimpleClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        """
        Define the forward pass of the classifier.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, input_size).

        Returns:
            torch.Tensor: Output tensor of shape (batch_size, num_classes).
        """
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out


The criterion used for all our classifiers will be the conventional cross-entropy loss

In [None]:
criterion = nn.CrossEntropyLoss()

**Classifier #1**

For our initial classifier, we'll employ all 11 features to train the model. We will train it using the dataset whose labels were determined based on decile divisions.

In [None]:
input_size = 11
hidden_size = 500
num_classes = 10  # 10 classes for deciles partition
classifier1 = SimpleClassifier(input_size, hidden_size, num_classes)


**Classifier #2**

For our second classifier, we'll utilize only 10 features (excluding the Y feature) to train the model. We will train it using the dataset whose labels were determined based on decile divisions.

In [None]:
input_size = 10
hidden_size = 500
num_classes = 10  # 10 classes for deciles partition
classifier2 = SimpleClassifier(input_size, hidden_size, num_classes)

**Classifier #3**

For our third classifier, we'll utilize all 11 features to train the model. We will train it using the dataset whose labels were determined based on percentiles divisions.

In [None]:
input_size = 11
hidden_size = 500
num_classes = 100  # 100 classes for percentiles partition
classifier3 = SimpleClassifier(input_size, hidden_size, num_classes)

**Classifier #4**

For our fourth and final classifier, we'll utilize only 10 features (excluding the Y feature) to train the model. We will train it using the dataset whose labels were determined based on percentile divisions.

In [None]:
input_size = 10
hidden_size = 500
num_classes = 100  # 100 classes for percentiles partition
classifier4 = SimpleClassifier(input_size, hidden_size, num_classes)

In [None]:
torch.linspace(-2, 2, 10)

tensor([-2.0000, -1.5556, -1.1111, -0.6667, -0.2222,  0.2222,  0.6667,  1.1111,
         1.5556,  2.0000])

In [None]:
diff = pd.DataFrame({'Num features': torch.linspace(1, 4, 4)})
diff = pd.concat([diff, pd.DataFrame(torch.randn(4, 1), columns=["labels Calculation"])],
               axis=1)

for i in range(diff.shape[0]):
  if i%2 == 0:
    diff.iloc[i, 0] = 11
  else:
    diff.iloc[i, 0] = 10

  if i<2:
    diff.iloc[i, 1] = "Deciles based"
    print
  else:
    diff.iloc[i, 1] = "Percentiles based"



# df.iloc[0, 1] = np.nan
print("Summarize classifiers differences:")
diff

Summarize classifiers differences:


Unnamed: 0,Num features,labels Calculation
0,11.0,Deciles based
1,10.0,Deciles based
2,11.0,Percentiles based
3,10.0,Percentiles based


# **Training**

In this section, we will train the classifiers described in the previous sections using their respective datasets.

In [None]:
def train_classifier(classifier, train_loader, input_size, n_epochs):
  """
  Train the classifier model.

  Args:
      classifier: The classifier model to be trained.
      train_loader: DataLoader for the training dataset.
      input_size (int): The input size.
      n_epochs (int): Number of epochs for training.
  """

  # I took this part from previous work of mine, maybe used the web there.
  # Define loss function and optimizer
  optimizer = optim.Adam(classifier.parameters(), lr=0.001)

  for epoch in range(num_epochs):
      total_correct = 0
      total_samples = 0

      for batch_idx, (inputs, labels) in enumerate(train_loader):
        inputs = inputs [:, :input_size] # Slice the inputs to fit desired input_size

        # Forward pass
        outputs = classifier(inputs.float())
        loss = criterion(outputs, labels)

        # Calculate accuracy
        _, predicted = torch.max(outputs, 1)
        total_samples += labels.size(0)
        total_correct += (predicted == labels).sum().item()

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

      if epoch%10 == 9:
        print('Epoch [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
          .format(epoch+1, num_epochs, loss.item(), 100 * total_correct / total_samples))



In [None]:
num_epochs = 30

Next, we'll proceed to train each of the classifiers

In [None]:
print("Train Classifier 1")
train_classifier(classifier1, train_dataloader_deciles, input_size=11,  n_epochs=num_epochs)


Train Classifier 1
Epoch [10/30], Loss: 5.0559, Accuracy: 44.19%
Epoch [20/30], Loss: 2.4509, Accuracy: 50.71%
Epoch [30/30], Loss: 1.1724, Accuracy: 52.41%


In [None]:
print("Train Classifier 2")
train_classifier(classifier2, train_dataloader_deciles, input_size=10,  n_epochs=num_epochs)

Train Classifier 2
Epoch [10/30], Loss: 0.7363, Accuracy: 23.80%
Epoch [20/30], Loss: 0.8453, Accuracy: 20.68%
Epoch [30/30], Loss: 7.2364, Accuracy: 24.93%


In [None]:
print("Train Classifier 3")
train_classifier(classifier3, train_dataloader_percentiles, input_size=11,  n_epochs=num_epochs)

Train Classifier 3
Epoch [10/30], Loss: 3.2620, Accuracy: 13.60%
Epoch [20/30], Loss: 0.3706, Accuracy: 23.23%
Epoch [30/30], Loss: 0.2088, Accuracy: 43.06%


In [None]:
print("Train Classifier 4")
train_classifier(classifier4, train_dataloader_percentiles, input_size=10,  n_epochs=num_epochs)

Train Classifier 4
Epoch [10/30], Loss: 5.5094, Accuracy: 4.82%
Epoch [20/30], Loss: 0.6800, Accuracy: 17.85%
Epoch [30/30], Loss: 0.3662, Accuracy: 32.01%


# **Test Results**

To perform a comprehensive evaluation of the model, we'll assess its performance on unseen data

In [None]:
def test_model(model, test_loader, input_size):
    """
    Test the trained model on unseen data.

    Args:
        model: The trained model to be evaluated.
        test_loader: DataLoader for the test dataset.
        input_size (int): The desired input size.

    Returns:
        tuple: A tuple containing the test loss and accuracy.
    """

    # I took this part from previous work of mine, maybe used the web there.
    model.eval()  # Set the model to evaluation mode
    test_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs = inputs.clone().detach() [:, :input_size] # Slice the inputs to fit the desired input_size
            outputs = model(inputs.float())
            test_loss += criterion(outputs, labels).item()

            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    test_loss /= len(test_loader)
    accuracy = 100 * correct / total
    print('Test Loss: {:.4f}, Accuracy: {:.2f}%'.format(test_loss, accuracy))




In [None]:
# Usage Example:
test_model(classifier1, test_dataloader_deciles, input_size=11 )
test_model(classifier2, test_dataloader_deciles, input_size=10)
test_model(classifier3, test_dataloader_percentiles, input_size=11 )
test_model(classifier4, test_dataloader_percentiles, input_size=10)


Test Loss: 2.3874, Accuracy: 52.81%
Test Loss: 5.4923, Accuracy: 13.48%
Test Loss: 8.4920, Accuracy: 6.74%
Test Loss: 9.2939, Accuracy: 1.12%


# **Discussion**


Let's review again the differences between the classifiers:

In [None]:
diff

Unnamed: 0,Num features,labels Calculation
0,11.0,Deciles based
1,10.0,Deciles based
2,11.0,Percentiles based
3,10.0,Percentiles based


Comparing classifier 1 to classifier 2, where classifier 1 was trained on all features while classifier 2 excluded feature Y, it's evident that classifier 1 achieved significantly higher accuracy on the test data (46.07)% versus 19.1%). This observation strongly suggests the importance of feature Y in classifying the data, which makes sense considering that class labels were derived directly from its values. Similarly, when comparing classifier 3 and classifier 4, where classifier 3 was trained with feature Y and classifier 4 was not, we still observe a trend favoring the inclusion of feature Y. Classifier 3 demonstrates slightly better accuracy compared to classifier 4 (4.49% versus 3.37%). This trend reaffirms the notion that feature Y provides valuable information for classification tasks.

Comparing classifier 1 to classifier 3, where classifier 1's labels are derived from the deciles of the Y column and classifier 3's labels are derived from percentiles, we observe a significant difference in their performance. Classifier 1 attempts to classify between 10 classes, while classifier 3 attempts to classify between 100 classes. As expected, classifier 1 achieves much better results. The same trend is evident when comparing classifier 2 to classifier 4. From these comparisons, we can conclude that classifying between 100 classes is a much more challenging task than classifying between 10 classes, which makes sense because the finer granularity of 100 classes necessitates a higher level of precision and discrimination, leading to increased complexity in the classification process.

**answer for section k:**
The superiority of the first classifier over the second stems from its comprehensive training on all features, including Y. In contrast, the second classifier lacked the crucial Y feature, which directly influenced the determination of class labels.

**answer for section n:**
The use of deciles when calculating the labels gave better classifing accuracy, and this might be a good reason to use it over percentiles based labels calculation. But, further investigation of the result is needed to determine wheter to use deciles or percentiles, because a small error when clasifing percentiles is much less painful than small error in deciles. for example, if we classify percentile 73 insted of 75, it might be a good assestment for the patient situation, whereas when classiffing 8 instead of 6 in deciles might be not such a good assetment. This suggests that accuracy alone may not provide a comprehensive picture of the performance of the classifiers. Additionally, the use of deciles is coarse and may not provide subtle enough information. In summary, the choice between deciles and percentiles requires further investigation, and it depends on the specific use case and needs.