**By :**
Zamzam Alsarayrah

# Chest X-Ray with 5 class

> # Contents

- [Problem Statement](#problem-statement)
- [Data Collection](#Data-Collection)
- [Observations from EDA](#observations-from-eda)
- [Modeling Methodology](#modeling-methodology)
- [Requirements](#Requirements)
- [Results](#results)
- [Conclusions](#conclusions)
- [References](#references)


> ## Problem statement:

**The global pandemic of coronavirus disease 2019 (COVID-19) has resulted in an increased demand for testing, diagnosis, and treatment. So the chest X-ray radiography (CXR) is a fast, effective, and affordable test that identifies the possible COVID-19-related pneumonia. We are going to check if we can diagnose the chest disease based on the X_RAY by using machine learning method.One main goal is the covid-19, we are going to test different machine learning models and their ability to diagnose the covid-19 chest-ray among other 4 different chest diseases.**


**[Back to Top](#Contents)**

> ## Data Collection

Source: [Dataverse.harvard](https://dataverse.harvard.edu/file.xhtml?fileId=5210381&version=5.0)

The dataset is organized into 3 folders:
1. Train 19610 (60%) 
2. Test  6540  (20%)
3. Validation 6534  (20%)

and contains subfolders for each image category:

| Clasess Name | Count |
| ----------- | ----------- |
| Normal  | 10,192 |
| COVID-19 | 4,189 |
| Tuberculosis | 4,897 |
| Lung-Opacity | 6,012 |
| Viral Pneumonia | 7,397 |
| **Total** | 32,687 |

### Test 
![Test Images](./assets/test-images.png)
### Train
![Train Images](./assets/train-images.png)
### Validation
![val Images](./assets/val-images.png)

**[Back to Top](#Contents)**

![](images/image.jpg)

> ## Observations from EDA ([Demo notebook_EDA](/notebooks/project/project-capstone/X_ray/chest_x-ray/Load%20the%20Data%20and%20EDA.ipynb))

**Class Dictionary** 

| Class Name | Label |
| ----------- | ----------- |
| COVID-19 | 0 |
| Lung-Opacity | 1 |
| Normal | 2 |
| Viral Pneumonia | 3 |
| Tuberculosis | 4 |

- The dataset contains colored images with three channels RGB.
- All the images has the same sizes (224*224) we don’t need to resize the images.


### The distribution of the classes in the train, test and validation data
![class distribution](./assets/class_distribution.png)

**We can see that the data has the same distribution among the different classes in all the dataset(train, test and validation)
The classes that has the most number of images is the normal classes where the one with the least number of images is COVID-19
This is expected situation as the other disease has long history where COVID-19 is a new disease, so there is no much data as the other classes.**

**[Back to Top](#Contents)**

> # Modeling Methodology ([Demo notebook_code](/notebooks/project/project-capstone/X_ray/chest_x-ray/notebook.ipynb))

**Train, test and validation split :**

The data set is already splitter to train, test and validation. We don’t have to do any thing with the splitting. The train and test images are used to train and validate the model during the training process where the validation images will be used to test the different models accuracy after the train.


- ### Base Model Architecture: 
Custom CNN Model This model has 2 Convolutional (Conv) (Conv2D + MaxPooling2D) layers, followed by 2 Fully Connected (FC) layers and a final SoftMax activation layer for classification. All layers use ReLU activation except where mentioned.
- ### Transfer learning via feature extraction:
We used two different techniques for feature extraction. One using PCA and the other using pre-trained CNN (VGG16). We input the images to the feature extractor, the output is the features of each image, then these feature used as an input to a classification technique to classify the image.
- **PCA (Principle component analysis): 
It is a dimensionally reduction technique where it does a feature extraction to the input data. In feature extraction the  existing features are combined together in a particular way, then some of these "new" variables are dropped, but the variables we keep are still a combination of the old variables. 
This allows us to still reduce the number of features in our model but we can keep all of the most important pieces of the original features.In our project we used 1000 component, the original features were 224*224 features for each image for each color, using the PCA with 1000 components, these huge number of features were reduced to 3000 features (1000 feature for each color).**
- ### VGG16 Model Architecture:
VGG16 is a pre-trained CNN network by Simonyan and Zisserman. It used as a classifier or features extractor. When using it as a feature extractor we chop “chop off” the network at a specified layer which is mainly prior to the fully-connected layers. The output of the VGG16 extractor will be  7 x 7 x 512 = 25,088 features, which is the shape of the final max-pooling layer of the chopped VGG16. These extracted features will be fed into other classifier to classify the image.
- ### Classification Models:
Classification using logistic regression and random forest models with multi class output. We used Random-search with both model to help with tuning the hyperparamters. We fed these models with the extracted features from the PCA and the VGG16.

**[Back to Top](#Contents)**

># Requirements

**The main requirements are listed below:**
- Python 3.6
- Numpy
- Tensorflow , keras
- OpenCV 4.2.0
- Scikit-Learn
- Matplotlib

**Additional requirements to generate dataset:**
- Pandas
- Jupyter
- Google colab

**[Back to Top](#Contents)**

| Model name | Train Score | Test Score |
| ----------- | ----------- |----------- |
| logistic_pca  | 0.74 |0.68 |
| logistic_vgg | 1.0 |0.91 |
| RFC_pca | 0.99 |0.58 |
| RFC_vgg | 0.99 |0.76 |

># Results:


From the table above witch shows the train and test accuracy for the different models we have applied, we can see that all of them has an overfit. The one with the best performance among them is the vgg16 with logistic regression, it has a 1 train accuracy and 0.91 test accuracy. If we would pick one model from the above model we will pick the vgg16 with logistic regression. We can notice that the PCA for feature extraction and then apply simple classification models didn’t work well. Even though the features extracted using the PCA covers almost more than 90% of the variance on the images, it didn’t do well in the classification. The base model we have did good work compare to the other models.

**[Back to Top](#Contents)**

># Conclusion :

 
The features extracted by the PCA can’t do well with the classification
VGG feature extractor did well but still has overfit
For our problem and data, logistic regression is better than random forest classifier
The base model did well and model with PCA features extractor could not get performance better than the base model.

**[Back to Top](#Contents)**

># Recommendation:

 

- Do More error analysis such that the precision and recall and check what class did confused with the COVID-19 more to work more on this class.
- Try to get the human level error (the percentage of error that the doctor can do while classify these x-ray) and compare it with the best model we have to work more on it.
- Try to get more data for the test so the model will have more data to test on and improve its performance.
- Use NN after the VGG feature extractor with overfitting solution such as drop out and regularization
- Do some images augmentation to increase the train and test data
- Use more powerful computer with better storage to be able to run the model faster as we are working with images data witch has large size.

># References

[1] [kaggle_COVID-19 Radiography Database](https://www.kaggle.com/tawsifurrahman/covid19-radiography-database)

[2] [github](https://github.com/agchung/Actualmed-COVID-chestxray-datase)

[3] [github_COVID-19 Chest X-ray Dataset ](https://github.com/agchung/Figure1-COVID-chestxray-dataset)

[4] [kaggle_RSNA Pneumonia Detection Challenge](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data)

[5] [kaggle_Tuberculosis (TB) Chest X-ray Database](https://www.kaggle.com/tawsifurrahman/tuberculosis-tb-chest-xray-dataset)

[6] [kaggle_Tuberculosis Chest X-rays](https://www.kaggle.com/raddar/tuberculosis-chest-xrays-shenzhen)

[7] [openi](https://openi.nlm.nih.gov/faq#collection)

[8] [radiologymasterclass](https://www.radiologymasterclass.co.uk)

**[Back to Top](#Contents)**