# CS 4774 Final Project
### Instructor: Professor Yangfeng Ji
### Team member: David Da Lian, Wilson Zheng, Jianing Cai, Zhengguang Wang
### Topic Chosen: Brain Tumor Classification

## Section 1: Data Preprocessing
In this sesction, we first build a Python custom class ReadImages which implements methods to read and proprocess Data and do train_test_validation split. \
We use the Image.convert() and Image.resize() to preprocess the data into grey scale and (380,360). From there, we use the PCA to conduct dimensionality reduction 

In [8]:
from PIL import Image
import os
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Set the path to the folders containing the images and labels

class ReadImages():
    def __init__(self,parent_folder_path):
        self.parent_folder_path = parent_folder_path
        self.images = []
        self.labels = []
        self.label_map = {"glioma_tumor": 0, "meningioma_tumor": 1, "no_tumor": 2, "pituitary_tumor": 3}
        self.training_images=None
        self.training_labels=None
        self.testing_images=None
        self.testing_labels=None
        self.validation_images=None
        self.validation_labels=None

    # Loop through each subfolder and read in the images and labels
    def Reading(self):
        pca = PCA(n_components=128)
        for subfolder in os.listdir(self.parent_folder_path):
            if subfolder.startswith('.'):
                continue  
            if not os.path.isdir(os.path.join(self.parent_folder_path, subfolder)):
                continue
            subfolder_path = os.path.join(self.parent_folder_path, subfolder)
            for folder in os.listdir(subfolder_path):
                if folder.startswith('.'):
                    continue  
                if not os.path.isdir(os.path.join(subfolder_path, folder)):
                    continue
                folder_path = os.path.join(subfolder_path, folder)
                for file in os.listdir(folder_path):
                    if file.startswith('.'):
                        continue  # skip hidden files and folders
                    if file.endswith(".jpg") or file.endswith(".jpeg") or file.endswith(".png"):
                        img = Image.open(os.path.join(folder_path, file))
                        img = img.convert("L")
                        img=img.resize((380,360))
                        img_array = np.array(img)
                        self.images.append(img_array)  # add the image array to the list
                        label_str = folder
                        label = self.label_map[label_str]
                        self.labels.append(label)
        self.images=np.array(self.images)
        pca = PCA(n_components=128)
        self.images = pca.fit_transform(self.images.reshape(-1, 360*380))
        ##self.images = self.images.reshape(-1, 128)
        


    def Split(self):
        self.training_images, X_test_val, self.training_labels, y_test_val = train_test_split(self.images, self.labels, test_size=0.2, random_state=42)
        self.testing_images,self.validation_images,self.testing_labels,self.validation_labels=train_test_split(
            X_test_val, y_test_val, test_size=0.1, random_state=42)
        
    def get_training(self):
        return self.training_images,self.training_labels
    
    def get_testing(self):
        return self.testing_images,self.testing_labels
    
    def get_validation(self):
        return self.validation_images,self.validation_labels


## Section 2: Data Splits
In this section, we simply call the method in our ReadImages class to get the training set, validation set, and testing set. 

In [9]:
Factory = ReadImages("/Users/zhengguang/Downloads/archive (9)")
Factory.Reading()
Factory.Split()
X_train, y_train = Factory.get_training()
X_test, y_test = Factory.get_testing()
X_val, y_val = Factory.get_validation()

## Section 3: Build classifiers
Our two classifiers chosen are Multiple Layer Perceptron and Support Vector Machine. \
We use the method in sklearn to train these two model. 

In [10]:
# Build First Classifier (SVM) --------------------------------------------------
print('SVM:')
clf_svm = SVC(kernel='poly', C=0.2, gamma=0.1)
clf_svm.fit(X_train, y_train)

# Evaluate the classifier on the training data
print("Training Accuracy: {}".format(clf_svm.score(X_train, y_train)))

# Evaluate the classifier on the testing data
print("Testing Accuracy: {}".format(clf_svm.score(X_test, y_test)))


# Build Second Classifier (MLPC) --------------------------------------------------
print('MLPC:')
clf_mlpc = MLPClassifier(hidden_layer_sizes=(64,), activation='relu', solver='adam', random_state=42)
clf_mlpc.fit(X_train, y_train)

# Evaluate the classifier on the training data
print("Training Accuracy: {}".format(clf_mlpc.score(X_train, y_train)))

# Evaluate the classifier on the testing data
print("Testing Accuracy: {}".format(clf_mlpc.score(X_test, y_test)))

SVM:
Training Accuracy: 1.0
Testing Accuracy: 0.8500851788756388
MLPC:
Training Accuracy: 0.9835312140942167
Testing Accuracy: 0.7666098807495741


## Section 4: Hyper-parameter tuning
Instead of using the Grid Search method in sklearn, we used a Python Dictionary and nested for loops to fine-tune the best combination of hyper-parameters for SVM and MLPC across our hypothesis space of hyper-parameters. However, we admit that the hypothesis space for our hyper-parameter is arbitrary. 
### SVM

In [11]:
#Define the hyperparameters to search over
c_values = [0.1, 1, 10]
gamma_values = [0.01, 0.1, 1]
kernel_values = ['poly', 'rbf', 'sigmoid']

# Create empty lists to store the results
best_accuracy = 0
best_params = {}

for c in c_values:
    for gamma in gamma_values:
        for kernel in kernel_values:
            # Create the SVM classifier with the current hyperparameters
            clf_svm = SVC(kernel=kernel, C=c, gamma=gamma)

            # Train the classifier on the training set
            clf_svm.fit(X_train, y_train)

            # Evaluate the classifier on the validation set
            y_pred = clf_svm.predict(X_val)
            accuracy = accuracy_score(y_val, y_pred)

            # Print out the current hyperparameters and their validation accuracy
            print("Hyperparameters: C={}, gamma={}, degree={}".format(kernel, c, gamma))
            print("Validation accuracy: {}".format(accuracy))

            # Update the best hyperparameters if the current model is better
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = {'kernel': kernel, 'C': c, 'gamma': gamma}

# Print out the best hyperparameters and their validation accuracy
print("Best hyperparameters: ", best_params)
print("Validation accuracy: ", best_accuracy)

# Create the SVM classifier with the best hyperparameters
clf_svm = SVC(kernel=best_params['kernel'], C=best_params['C'], gamma=best_params['gamma'])

# Train the classifier on the training set
clf_svm.fit(X_train, y_train)

# Evaluate the classifier on the test set
y_pred = clf_svm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print out the test accuracy
print("Test accuracy: ", accuracy)

Hyperparameters: C=poly, gamma=0.1, degree=0.01
Validation accuracy: 0.9393939393939394
Hyperparameters: C=rbf, gamma=0.1, degree=0.01
Validation accuracy: 0.3484848484848485
Hyperparameters: C=sigmoid, gamma=0.1, degree=0.01
Validation accuracy: 0.3484848484848485
Hyperparameters: C=poly, gamma=0.1, degree=0.1
Validation accuracy: 0.9393939393939394
Hyperparameters: C=rbf, gamma=0.1, degree=0.1
Validation accuracy: 0.3484848484848485
Hyperparameters: C=sigmoid, gamma=0.1, degree=0.1
Validation accuracy: 0.3333333333333333
Hyperparameters: C=poly, gamma=0.1, degree=1
Validation accuracy: 0.9393939393939394
Hyperparameters: C=rbf, gamma=0.1, degree=1
Validation accuracy: 0.3484848484848485
Hyperparameters: C=sigmoid, gamma=0.1, degree=1
Validation accuracy: 0.3333333333333333
Hyperparameters: C=poly, gamma=1, degree=0.01
Validation accuracy: 0.9393939393939394
Hyperparameters: C=rbf, gamma=1, degree=0.01
Validation accuracy: 0.5606060606060606
Hyperparameters: C=sigmoid, gamma=1, degree

### MLPC

In [12]:
import warnings 
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=ConvergenceWarning)

activation = ["logistic","tanh","relu"]
hidden_layer_size=[(100,),(150,),(200,)]
learning_rate = ["constant","invscaling","adaptive"]

# Create empty lists to store the results
best_accuracy = 0
best_params = {}

for indiv_activation in activation:
    for indiv_layer_size in hidden_layer_size:
        for rate in learning_rate:
            # Create the SVM classifier with the current hyperparameters
            clf_mlpc = MLPClassifier(activation=indiv_activation,hidden_layer_sizes=indiv_layer_size,learning_rate=rate,max_iter=200)

            # Train the classifier on the training set
            clf_mlpc.fit(X_train, y_train)

            # Evaluate the classifier on the validation set
            y_pred = clf_mlpc.predict(X_val)
            accuracy = accuracy_score(y_val, y_pred)

            # Print out the current hyperparameters and their validation accuracy
            print("Hyperparameters: indiv_activation={}, indiv_layer_size={}, rate={}".format(indiv_activation, indiv_layer_size, rate))
            print("Validation accuracy: {}".format(accuracy))

            # Update the best hyperparameters if the current model is better
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params = {"Activation":indiv_activation,'layer_size': indiv_layer_size, "learning rate":rate}

# Print out the best hyperparameters and their validation accuracy
print("Best hyperparameters: ", best_params)
print("Validation accuracy: ", best_accuracy)



clf_mlpc_best = MLPClassifier(activation=best_params["Activation"],hidden_layer_sizes=best_params["layer_size"],
                              learning_rate=best_params["learning rate"])

# Train the classifier on the training set
clf_mlpc_best.fit(X_train, y_train)

# Evaluate the classifier on the test set
y_pred = clf_mlpc_best.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print out the test accuracy
print("Test accuracy: ", accuracy)

Hyperparameters: indiv_activation=logistic, indiv_layer_size=(100,), rate=constant
Validation accuracy: 0.8636363636363636
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(100,), rate=invscaling
Validation accuracy: 0.803030303030303
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(100,), rate=adaptive
Validation accuracy: 0.8787878787878788
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(150,), rate=constant
Validation accuracy: 0.8636363636363636
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(150,), rate=invscaling
Validation accuracy: 0.8939393939393939
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(150,), rate=adaptive
Validation accuracy: 0.8484848484848485
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(200,), rate=constant
Validation accuracy: 0.9242424242424242
Hyperparameters: indiv_activation=logistic, indiv_layer_size=(200,), rate=invscaling
Validation accuracy: 0.8636363636363636
Hyperparame

## Section 5: Analysis
- Based on the results above, which classifier is better, and why?

- We can evaluate which classifier is better by looking at the test accuracies obtained in the above sections after hyperparameter tuning. The accuracy for SVM is around 0.85, while the accuracy for MLPC is around 0.785, hence we can conclude that SVM is better.

- For further improvement on classification accuracy, what strategies that you can use and why do you think they will be helpful?

1. Feature engineering: By selecting and processing certain features from the raw input data, we can improve the quality of the input data for the classifier. Since only the most relavent features are being used, the classifier can better distinguish between the different classes, thus producing a more accurate model. In our case, this might be ensuring that the model doesn't pay too much attention to the black spaces around the head, the neck portion, and lower half of the head.

2. Data Augmentation: Generate additional training from the existing ones by appyling transformations. This technique would allow us to have more training data that can be used to better train the classifier, since it will have more exposure to different data. Since we are dealing with images that have very sensitive information, we need to make sure that the transformations we apply keep the crucial information constant, so the labels aren't messed up.

3. Regularization. Apart from the L2 regularization that we focused during hyperparameter tuning (constant c in SVM), there are many other regularization techniques that we can implement when training the models. For SVM, examples include L1 regularization to encourage a simpler decision boundary, and norm constraints which can be used to limit the size of the weight vector. For MLPC, we didn't incorporate any regularization. Some examples of techniques we can use are L1 and L2 regularization just as explained above, as well as dropout which randomly drops out some neurons during training to prevent overfitting, and weight decay whic