<h1><center> 
    Enhancing Glioma Grading: Integrating Genomic and Clinical Features with Deep learning
</center></h1>

<h2>Course: CptS 534</h2>

##### Jingjing Nie: 11742013, School of Electrical Engineering and Computer Science
##### Wooyoung Kim: 11808206, Department of Mathematics and Statistics
##### Xinyue Wu: 11809269, Department of Sociology

### 1. Download dataset from UCI Machine learning Respository
###### (See detail in https://archive.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset)

In [36]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
glioma_grading_clinical_and_mutation_features = fetch_ucirepo(id=759) 
  
# data (as pandas dataframes) 
X = glioma_grading_clinical_and_mutation_features.data.features 
y = glioma_grading_clinical_and_mutation_features.data.targets 
  
# metadata 
print(glioma_grading_clinical_and_mutation_features.metadata) 
  
# variable information 
print(glioma_grading_clinical_and_mutation_features.variables) 


{'uci_id': 759, 'name': 'Glioma Grading Clinical and Mutation Features', 'repository_url': 'https://archive.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/759/data.csv', 'abstract': 'Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients.    In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects.  The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading 

### 2. import Python liberaries

In [8]:
import pandas as pd
import numpy as np
import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import Dataset
import torch.nn.functional as F

### 3. Split the dataset

In [37]:
# Convert them to numpy arrays and replace race info
X_numpy = X.to_numpy()
y_numpy = y.to_numpy()

my_dict = {'white': 0, 'black or african american': 1, 'asian': 2, 'american indian or alaska native':3}
for element in X_numpy:
    if element[2]=='white':
        element[2] = 0
    elif element[2]=='black or african american':
        element[2] = 1
    elif element[2]== 'asian':
        element[2] = 2
    elif element[2]== 'american indian or alaska native':
        element[2] = 3
X_numpy = X_numpy.astype(float)

# Convert numpy arrays to PyTorch tensors
X_tensor = torch.tensor(X_numpy, dtype=torch.float)
y_tensor = torch.tensor(y_numpy, dtype=torch.long)
y_tensor = y_tensor.squeeze()
print(y_tensor.shape)
# Create a TensorDataset
dataset = TensorDataset(X_tensor, y_tensor)
# Hyper-parameters 
input_size = 23
hidden_size = 10
num_classes = 2
learning_rate = 0.001
num_epochs = 10
batch_size = 100
# Define the sizes for training and test datasets
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

# Split the dataset
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)
print("Training data batches:")
for X, y in train_loader:
    print(X.shape, y.shape)
    
print("\nTest data batches:")
for X, y in test_loader:
    print(X.shape, y.shape)

torch.Size([839])
Training data batches:
torch.Size([100, 23]) torch.Size([100])
torch.Size([100, 23]) torch.Size([100])
torch.Size([100, 23]) torch.Size([100])
torch.Size([100, 23]) torch.Size([100])
torch.Size([100, 23]) torch.Size([100])
torch.Size([100, 23]) torch.Size([100])
torch.Size([71, 23]) torch.Size([71])

Test data batches:
torch.Size([100, 23]) torch.Size([100])
torch.Size([68, 23]) torch.Size([68])


In [40]:

# Device configuration: check if there is a configured GPU available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Define a model using class NeuralNet()
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

# Define loss function and optimization algorithm (optimizer)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  


In [41]:
# Train the model
total_step = len(train_loader)
test_acc_list, train_acc_list = [], []
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()  # Zero the parameter gradients
        outputs = model(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Compute the loss
        loss.backward()  # Backward pass
        optimizer.step()  # Optimize
        

    # Print average loss at the end of each epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

    # Evaluation on Test Set
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f"Accuracy on test set: {(100 * correct / total):.2f}%")
    with torch.no_grad():
            correct = 0
            total = 0
            for inputs, labels in train_loader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
    
            print('Accuracy of the network on the training inputs: {} %'.format(100 * correct / total))
            train_acc_list.append(100 * correct / total)

Epoch [1/10], Loss: 1.1227
Accuracy on test set: 60.12%
Accuracy of the network on the training inputs: 57.52608047690015 %
Epoch [2/10], Loss: 0.7509
Accuracy on test set: 60.12%
Accuracy of the network on the training inputs: 57.52608047690015 %
Epoch [3/10], Loss: 0.6732
Accuracy on test set: 57.74%
Accuracy of the network on the training inputs: 57.67511177347243 %
Epoch [4/10], Loss: 0.6794
Accuracy on test set: 42.26%
Accuracy of the network on the training inputs: 45.007451564828614 %
Epoch [5/10], Loss: 0.6380
Accuracy on test set: 42.86%
Accuracy of the network on the training inputs: 45.30551415797317 %
Epoch [6/10], Loss: 0.6773
Accuracy on test set: 49.40%
Accuracy of the network on the training inputs: 52.60804769001491 %
Epoch [7/10], Loss: 0.6721
Accuracy on test set: 70.83%
Accuracy of the network on the training inputs: 68.4053651266766 %
Epoch [8/10], Loss: 0.6716
Accuracy on test set: 71.43%
Accuracy of the network on the training inputs: 70.04470938897168 %
Epoch [9