####Brain tumor detection
Predict the probability of tumor presence through MRI images of the brain.

##Before starting this notebook

This notebook is designed for **experimental and learning ourposes**. <br/> It aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline**. <br/> Due to its purposes, this notebook may contain notes, drafts, comments, etc.




## Sprint Goals
- Frame the problem
- Get the data
- Data cleaning
- Simple EDA to gain insights
- Initial data preprocessing
- Train a (single) ML algorithm with all features and default hyperparameters
---

##1. Frame the Problem

###1.1 Context
Tumors are swollen masses in the body caused by the abnormal growth of cells. When cancerous, the uncontrollable multiplication of cells has the potential to invade or spread to other parts of the body. <br/><br/> **Brain tumors** specifically are one of the deadliest cancers, and can have life long psychological impacts on the patient. <br/><br/> One of the diagnosis methods usually followed by hospitals is **MRI** (large machines with strong magnets connected to computers to take the detail picture of the brain. The professional responsible for the diagnosis, the radiologist, is, as any human, susceptible to inaccuraci in their work, which will affect the accuracy o detection. <br/><br/> In the attempt of reducing human errors and increasing the accuracy of detection of tumors, health experts have addopted modern technologies as to take advantage of its benefits and generate a scalable improvement in the area. The area of **Artificial Intelligence** have speacially led to exiting solutions with high accuracy at detecting the presence of tumors through MRI images.<br/><br/> This projects aims to implement a artificial intelligence at the task of detecting brain tumors as an attempt to assist and improve the accuracy of diagnosis.

###1.2 Challenge 
Brain MRIs are images with lots of informations that are not pertinent to the detections and diagnosis of tumors.<br/>
###Objective: 
**Build a machine learning solution to more accuratly predict the presence of tumors in MRI images.**<br/>
###Baseline:
Currently, there are some projects similar to this beeing slowly implemented in assisting health care professionals at diagnosing brain tumors. However, it is still a minority and this task is mostly done manually and is susceptive to human error.<br/>
#### **Solution Planning:**
- Neural network problem
- Metrics:
    - R²
    - Root Mean Squared Error (RMSE)
- Data sources:
    - Brain Tumor Classification (MRI) [link text](https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri?resource=download)
- No assumptions
- Project deliverable:
    - A simple exploratory data analysis
    - **A ML system/model** launched in _production_ <br/><br/> 

##2. Get the Data

###2.1. Download the data
We previously download the dataset from [kaggle](https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri?resource=download).<br/>  The images are already split into Training and Testing folders.
Each folder has more four subfolders. These folders have MRIs of respective tumor classes

###2.2. Importing dataset as a ZIP file

In [None]:
from zipfile import ZipFile
file_name = '/content/archive.zip'

with ZipFile(file_name, 'r') as zip:
  zip.extractall()
  print('Done')

BadZipFile: ignored

###2.3. Load Dependencies

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

###2.4. Collect data

In [None]:
path = os.listdir('/content/Training') #returns a list containing the names of the entries in the directory given by path
classes = {'no_tumor':0, 'pituitary_tumor':1} #binary indicator of the presence of tumor
X = []
Y = []

for cls in classes:
    pth = '/content/Training/'+cls
    for j in os.listdir(pth):
        img = cv2.imread(pth+'/'+j, 0) #loads an image from the specified file, the second space represents the flag, which specifies the way in which image should be read
        img = cv2.resize(img, (200,200)) #changes the dimensions of the image
        X.append(img)
        Y.append(classes[cls])


FileNotFoundError: ignored

In [None]:
X = np.array(X)
Y = np.array(Y)

X_updated = X.reshape(len(X), -1)

In [None]:
np.unique(Y)

In [None]:
pd.Series(Y).value_counts() #returns a Series (one-dimensional ndarray with axis labels) containing counts of unique values in descending order so that the first element is the most frequently-occurring element (excludes NA values by default)

In [None]:
X.shape, X_updated.shape

###2.6. Prepare data

In [None]:
X_updated = X.reshape(len(X), -1)
X_updated.shape

###2.7. Split data

*Splitting/samplig* the dataset into *training set* and *testing set*. The solution is trained using the *training set*, and the test is made using the *testing set*. <br/> The error rate - **generalization error** - is estimated through the test set. This value informs the predicted precision on instances it has never seen before.

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X_updated, Y, random_state=10, test_size=.20)

In [None]:
xtrain.shape, xtest.shape

##3. Data Cleaning

###3.1. Feature Scaling

We need to do scaling so that one significant number doesn’t impact the model just because of their large magnitude.

In [None]:
print(xtrain.max(), xtrain.min())
print(xtest.max(), xtest.min())
xtrain = xtrain/255
xtest = xtest/255
print(xtrain.max(), xtrain.min())
print(xtest.max(), xtest.min())

###3.2. Feature Selection: PCA

Proposed to select a subset of variables in principal component analysis (PCA) that preserves as much information present in the complete data as possible.

In [None]:
print(xtrain.shape, xtest.shape)

pca = PCA(.98)
# pca_train = pca.fit_transform(xtrain)
# pca_test = pca.transform(xtest)
pca_train = xtrain
pca_test = xtest

##4. Explore the Data

###4.1. Visualize data

In [None]:
plt.imshow(X[0], cmap='gray') #display data as an image, i.e., on a 2D regular raster

##5. Train Model

  Support-Vector Machine (SVM), is a supervised learning model with associated learning algorithms that analyze data for classification and regression analysis. An SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. 
(Read more at: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)

  Logistic Regression Algorithm is a technique used for classification or prediction in data sets which have many features, but where most of them have little value and should be ignored. (Read more at: https://www.kdnuggets.com/2022/07/logistic-regression-work.html)

The Linear Support Vector Classifier (SVC) method applies a linear kernel function to perform classification and it performs well with a large number of samples. If we compare it with the SVC model, the Linear SVC has additional parameters such as penalty normalization which applies 'L1' or 'L2' and loss function.  (Read more at: https://www.datatechnotes.com/2020/07/classification-example-with-linearsvm-in-python.html)

In [None]:
import warnings
warnings.filterwarnings('ignore') # warn the developer of situations that aren’t necessarily exceptions; not critical; it shows some message, but the program runs


###5.1. Models Test

In an effort to determine the best, we created a function to calculate the model performance of the following models: Decision Tree Regressor, Random Forest Regressor, SVR, SVC and Logistic Regression.

In [None]:
#find the best model to use
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# create a function to calculate the model performance
def model_performance(model, xtrain, xtest, ytrain, ytest):
    model.fit(xtrain, ytrain)
    ypred = model.predict(xtest)
    rmse = np.sqrt(mean_squared_error(ytest, ypred))
    mae = mean_absolute_error(ytest, ypred)
    r2 = r2_score(ytest, ypred)
    print(model)
    print(f'RMSE: {rmse:.2f}') # Root Mean Squared Error
    print(f'MAE: {mae:.2f}') # Mean absolute error
    print(f'R2: {r2:.2f}') # Regression score; it is the difference between the samples in the dataset and the predictions made by the model

#find the best model to use
models = [DecisionTreeRegressor(), RandomForestRegressor(), SVR(), SVC(), LogisticRegression()]
for model in models:
    model_performance(model, xtrain, xtest, ytrain, ytest)
    print('')

###Metrics


#####R²
The Regression Score (R²) is a summary measure that tells you how well the regression line fits the data. It is a value between 0 and 1 that represents proportion of total variability of the y-value that is accounted for by the independent variable x. It shows how well the model predicts the outcome.
#####MSE
Mean squared error regression loss. The loss is the mean overseen data of the squared differences between true and predicted values, or writing it as a formula. 
#####RMSE
Is the square root of the MSE. Commonly used to compare regression models. It shows how far predictions fall from measured true values.
#####MAE
Computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss.


###5.2. Trying a Neural Network model using tensorflow


####5.2.1. Importing dependencies

In [None]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import cv2 as cv
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image
from datetime import datetime
from sklearn.model_selection import train_test_split

####5.2.2. Rescaling images from the dataset

In [None]:
imgd=ImageDataGenerator(rescale=1/255)

####5.2.3. Exploring data

In [None]:
tumor_dataset=imgd.flow_from_directory('/content/Training')

In [None]:
print(tumor_dataset.class_indices)
classes=pd.DataFrame(tumor_dataset.classes)
classes.value_counts()

####5.2.4. Splitting data

In [None]:
path_train = '/content/Training'
path_test = '/content/Testing'

import tensorflow as tf
import tensorflow.keras.layers as tfl

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255,validation_split=0.2)
print("training")
train_data = train_datagen.flow_from_directory(
        path_train,
        subset='training',
        target_size=(200 , 200),
        batch_size=32)
print("validation")
val_data = train_datagen.flow_from_directory(
        path_train,
        subset='validation',
        target_size=(200 , 200),
        batch_size=32 )
print("testing")
test_data = train_datagen.flow_from_directory(
        path_test,
        target_size=(200 , 200),
        batch_size=32 )

In [None]:
test_data[1][1]

####5.2.5. Applying filter

In [None]:
from keras.models import Sequential
from keras.layers import Flatten,Activation,Dense,Dropout,Conv2D,MaxPool2D

In [None]:
model =Sequential()

#convolution and maxpoollayer
model.add(Conv2D(filters=25,kernel_size=3,
                 strides=2,padding='valid',input_shape=(200,200,3)))
model.add(Activation('relu'))
model.add(MaxPool2D(pool_size=2))

#flatten layer
model.add(Flatten())

#hidden layer
model.add(Dense(16))
model.add(Activation('relu'))

#output layer
model.add(Dense(4))
model.add(Activation('sigmoid'))


model.summary()

In [None]:
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

####5.2.6. Training and validating data

In [None]:
history=model.fit (train_data,epochs=10, validation_data=val_data)

####5.2.7. Analysing results

In [None]:
plt.plot(history.history['accuracy'],label='accuracy')
plt.plot(history.history['val_accuracy'],label='val_accuracy')
plt.legend()
plt.title('Accuracy Vs Epochs of CNN')
plt.xlabel('Epochs')
plt.ylabel('Accuracy');

In [None]:
model.evaluate(train_data)

In [None]:
model.evaluate(test_data)

In [None]:
y_predicte=model.predict(test_data)
y_predicte

The displayed results show close proximity in values of prediction, and can indicate a tendency to inaccuracy. This, however, will have to be more closely analyzed with tools such as a confusion matrix.

####5.2.8. Testing model

In [None]:
path = os.listdir('/content/Testing') #returns a list containing the names of the entries in the directory given by path
classes = {'glioma_tumor': 0, 'meningioma_tumor': 1, 'no_tumor': 2, 'pituitary_tumor': 3} # indicator of the presence of tumor
X = []
Y = []

for cls in classes:
    pth = '/content/Testing/'+cls
    for j in os.listdir(pth):
        img = cv2.imread(pth+'/'+j, 0) #loads an image from the specified file, the second space represents the flag, which specifies the way in which image should be read
        img = cv2.resize(img, (200,200)) #changes the dimensions of the image
        X.append(img)
        Y.append(classes[cls])

In [None]:
tumor_classes=['glioma_tumor', 'meningioma_tumor', 'no_tumor', 'pituitary_tumor']

for i in range (len(y_predicte)):
    predicted_tumor=tumor_classes[np.argmax(y_predicte[i])]
    print(predicted_tumor)

The model can predict and give results, but a visual display has yet to be implemented.

##6. Evaluation

The data from the metrics used to avaluate the training shows that, amongst the models tested (Decision Tree Regressor, Random Forest Regressor, SVR, SVC and Logistic Regression), the one that provides the most accurate results for this project is the Random Forest Regressor model.<br/> The Random Forest Regressor showed an accuracy of 97%, therefore, we decided to go ahead with testings using this model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg.fit(xtrain, ytrain)

As we can observe, Random Forest Regressor showed a great balance among training an testing score. So we can reach to the conclusion that it is ideal model for this particular dataset.

##7. Prediction

In [None]:
pred = reg.predict(xtest)

In [None]:
misclassified = np.where(ytest!=pred)
misclassified

In [None]:
print("Total Misclassified Samples: ",len(misclassified[0]))
print(pred[36],ytest[36])

##8. Testing

In [None]:
dec = {0:'No Tumor', 1:'Positive Tumor'}

In [None]:
plt.figure(figsize=(12,8))
p = os.listdir('/content/Testing/')
c=1
for i in os.listdir('/content/Testing/no_tumor/')[:9]:
    plt.subplot(3,3,c)
    
    img = cv2.imread('/content/Testing/no_tumor/'+i,0)
    img1 = cv2.resize(img, (200,200))
    img1 = img1.reshape(1,-1)/255
    p = reg.predict(img1)
    assert(0.0 <= p[0] <= 1.0)
    idx = 0 if p[0] < 0.5 else 1
    #print(dec[p[0]]) ##
    plt.title(dec[idx])
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    c+=1

In [None]:
plt.figure(figsize=(12,8))
p = os.listdir('/content/Testing/')
c=1
for i in os.listdir('/content/Testing/pituitary_tumor/')[:16]:
    plt.subplot(4,4,c)
    
    img = cv2.imread('/content/Testing/pituitary_tumor/'+i,0)
    img1 = cv2.resize(img, (200,200))
    img1 = img1.reshape(1,-1)/255
    p = reg.predict(img1)
    assert(0.0 <= p[0] <= 1.0)
    idx = 0 if p[0] < 0.5 else 1
    plt.title(dec[idx])
    plt.imshow(img, cmap='gray')
    plt.axis('off')
    c+=1

A visual indicator of the location of the tumor has yet to be implemented.

##9. Conclusion

###9.1. Used models

After implementing the displayed models, we have reached satisfactory tumor prediction results - with an accuracy of 97% in training. However, we were unable to fully implement the desired model, which was Neural Network, and we have yet to better analyze the results said model provided (comparing the accuracy of the different types of tumors and implementing a confusion matrix as to visualize this data).

###9.2. Next steps

Moving forward, we intend to attempt again to implement and analyze a Neural Network model. Moreover, we plan on applying a visual indicator of the detection of the tumors when displaying prediction results. Such improvements will hopefully create a better and more accurate assistance to medical professionals on providing a tumor diagnosis.

###9.3. Acknowledgments

As said previously, this notebook is intended only for research and learning purposes. We thank our professor and the online community for providing the tools and explanations we needed to complete this project.

###9.4. References


Kummar, Ajitesh. "SVM Classifier using Sklearn: Code Examples". Data Analytics, May 6, 2022. URL:https://vitalflux.com/svm-classifier-scikit-learn-code-examples/. <br/>
"Metrics and scoring: quantifying the quality of predictions". scikit learn. URL: https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix. <br/>
machine-learning. github, Oct 26, 2019. URL: https://github.com/laurelkeys/machine-learning/blob/master/assignment-2/Assignment2.ipynb.<br/>
Jadhav, Kishor. "brain tumor classification with 99% train accuracy".kaggle, Aug 8, 2022. URL: https://www.kaggle.com/code/kishorjadhav1/brain-tumor-classification-with-99-train-accuracy/data.<br/>
"Root Mean Square Error (RMSE)". C3.ai. URL: https://c3.ai/glossary/data-science/root-mean-square-error-rmse/.<br/>
"Mean squared error". Knowledge Center. URL: https://peltarion.com/knowledge-center/modeling-view/build-an-ai-model/loss-functions/mean-squared-error. <br/>
Abhishek Anil, Aditya Raj, H Aravind Sarma, Naveen Chandran R, Deepa P L. "Brain Tumor detection from brain MRI using Deep Learning". International Journal of Innovative Research in Applied Sciences and Engineering (IJIRASE)
Volume 3, Issue 2, DOI: 10.29027/IJIRASE.v3.i2.2019, 458-465, August 2019. URL:https://ijirase.com/assets/paper/issue_1/volume_3/V3-Issue-2-458-465.pdf. <br/>
"Brain Tumor Classification MRI | Brain Tumor Detection using Support Vector Machine in Python". Coding With Aman Dhillon, May 19, 2021. URL: https://www.youtube.com/watch?v=5lgrlddp-98. <br/>
"brain tumor detection using python and sklearn". github, jul 11, 2021. URL: https://github.com/akd6203/brain-tumor-detection.<br/>
"Brain_tumor_classifier_97%+_accuracy". kaggle, Jun 9, 2022. URL: https://www.kaggle.com/code/die9origephit/brain-tumor-classifier-97-accuracy.