# **COVID-19 Outcome Prediction**
This notebook serves as the main notebook for the Covid-19 Outcome Prediction project. We will be training and assessing different models, and based on the results we will choose the best model to predict the outcome of Covid-19 patients. The models we will be training are:

- K-Nearest Neighbors
- Logistic Regression
- Naive Bayes
- Support Vector Machine
- Decision Tree

---

*This project is the collaborative effort of the following team members:*
- [Shehab Mahmoud](https://www.github.com/dizzydroid)
- [Adham Nasreldin](https://github.com/AdhamNasreldin)
- [Kareem Mostafa](https://github.com/KareemMostafa1)

*Delivered as part of the **CSE375: Machine Learning & Pattern Recognition** course project — FOE | ASU <br />
Under supervision of [**Dr. Nesma Rezk**](https://eng.asu.edu.eg/staff/nesma.rezk), [**Dr. Hazem Abbas**](https://eng.asu.edu.eg/staff/hazem.abbas) and [**Eng. Ahmed Elgazwy**](https://eng.asu.edu.eg/staff/ahmed.elgazwy).*

---

> **_NOTE:_**
> - **Individual model notebooks** can be found in the [`notebooks`](notebooks/) directory.
> - **Pretrained exported models** are also available and can be found in the [`notebooks/models`](notebooks/models/) directory.
> - As per project requirements, all models will be *re-trained and evaluated* in this notebook.

---


#### 1. Importing Required Libraries

In [28]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

# For model building and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, classification_report, roc_curve, confusion_matrix
from sklearn.preprocessing import StandardScaler

# For saving the model
import joblib

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

#### 2. Loading and exploring the Dataset
First, we use pandas to load the dataset from a csv file.

In [29]:
data = pd.read_csv('data/data.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,location,country,gender,age,vis_wuhan,from_wuhan,symptom1,symptom2,symptom3,symptom4,symptom5,symptom6,diff_sym_hos,result
0,0,104,8,1,66.0,1,0,14,31,19,12,3,1,8,1
1,1,101,8,0,56.0,0,1,14,31,19,12,3,1,0,0
2,2,137,8,1,46.0,0,1,14,31,19,12,3,1,13,0
3,3,116,8,0,60.0,1,0,14,31,19,12,3,1,0,0
4,4,116,8,1,58.0,0,0,14,31,19,12,3,1,0,0


Next, we explore the dataset.

In [30]:
# Examining the length of the dataset
print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns.")

# Examining the columns of the dataset
print("Features in the dataset: ", data.columns)

# Examining the labels in the dataset
print("Labels in the dataset: ", data['result'].unique())

# Examining how many patients died (result = 1)
print("Number of patients who died: ", data['result'].value_counts()[1])


The dataset has 863 rows and 15 columns.
Features in the dataset:  Index(['Unnamed: 0', 'location', 'country', 'gender', 'age', 'vis_wuhan',
       'from_wuhan', 'symptom1', 'symptom2', 'symptom3', 'symptom4',
       'symptom5', 'symptom6', 'diff_sym_hos', 'result'],
      dtype='object')
Labels in the dataset:  [1 0]
Number of patients who died:  108


#### 3. Cleaning up the data
It seems we have an unused column 'Unnamed: 0', we will drop it. Let's also check for any missing values.

In [31]:
data.isna().sum() # Checking for missing values

Unnamed: 0      0
location        0
country         0
gender          0
age             0
vis_wuhan       0
from_wuhan      0
symptom1        0
symptom2        0
symptom3        0
symptom4        0
symptom5        0
symptom6        0
diff_sym_hos    0
result          0
dtype: int64

In [32]:
# Drop the 'Unnamed: 0' column if it exists
if 'Unnamed: 0' in data.columns:
	data.drop('Unnamed: 0', axis=1, inplace=True)
data.head()

Unnamed: 0,location,country,gender,age,vis_wuhan,from_wuhan,symptom1,symptom2,symptom3,symptom4,symptom5,symptom6,diff_sym_hos,result
0,104,8,1,66.0,1,0,14,31,19,12,3,1,8,1
1,101,8,0,56.0,0,1,14,31,19,12,3,1,0,0
2,137,8,1,46.0,0,1,14,31,19,12,3,1,13,0
3,116,8,0,60.0,1,0,14,31,19,12,3,1,0,0
4,116,8,1,58.0,0,0,14,31,19,12,3,1,0,0


Binning the age column into age groups, for better representation and prediction.

In [33]:
bins = [0, 18, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['0-18', '19-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']
preprocessed_data = data.copy()
categorized_age = pd.cut(preprocessed_data['age'], bins)
preprocessed_data['Categorized_age'] = categorized_age
preprocessed_data = preprocessed_data.drop('age', axis=1)
preprocessed_data.head()

Unnamed: 0,location,country,gender,vis_wuhan,from_wuhan,symptom1,symptom2,symptom3,symptom4,symptom5,symptom6,diff_sym_hos,result,Categorized_age
0,104,8,1,1,0,14,31,19,12,3,1,8,1,"(60, 70]"
1,101,8,0,0,1,14,31,19,12,3,1,0,0,"(50, 60]"
2,137,8,1,0,1,14,31,19,12,3,1,13,0,"(40, 50]"
3,116,8,0,1,0,14,31,19,12,3,1,0,0,"(50, 60]"
4,116,8,1,0,0,14,31,19,12,3,1,0,0,"(50, 60]"


In [34]:
categorized_age = pd.get_dummies(preprocessed_data['Categorized_age'], prefix='Categorized_age')
preprocessed_data = pd.concat([preprocessed_data, categorized_age], axis=1)
preprocessed_data = preprocessed_data.drop('Categorized_age', axis=1)

In [35]:
preprocessed_data.head()

Unnamed: 0,location,country,gender,vis_wuhan,from_wuhan,symptom1,symptom2,symptom3,symptom4,symptom5,...,result,"Categorized_age_(0, 18]","Categorized_age_(18, 30]","Categorized_age_(30, 40]","Categorized_age_(40, 50]","Categorized_age_(50, 60]","Categorized_age_(60, 70]","Categorized_age_(70, 80]","Categorized_age_(80, 90]","Categorized_age_(90, 100]"
0,104,8,1,1,0,14,31,19,12,3,...,1,0,0,0,0,0,1,0,0,0
1,101,8,0,0,1,14,31,19,12,3,...,0,0,0,0,0,1,0,0,0,0
2,137,8,1,0,1,14,31,19,12,3,...,0,0,0,0,1,0,0,0,0,0
3,116,8,0,1,0,14,31,19,12,3,...,0,0,0,0,0,1,0,0,0,0
4,116,8,1,0,0,14,31,19,12,3,...,0,0,0,0,0,1,0,0,0,0


In [36]:
X = preprocessed_data.drop('result', axis=1)
y = preprocessed_data['result']

# Display feature names
print("Feature columns:")
print(X.columns)

Feature columns:
Index(['location', 'country', 'gender', 'vis_wuhan', 'from_wuhan', 'symptom1',
       'symptom2', 'symptom3', 'symptom4', 'symptom5', 'symptom6',
       'diff_sym_hos', 'Categorized_age_(0, 18]', 'Categorized_age_(18, 30]',
       'Categorized_age_(30, 40]', 'Categorized_age_(40, 50]',
       'Categorized_age_(50, 60]', 'Categorized_age_(60, 70]',
       'Categorized_age_(70, 80]', 'Categorized_age_(80, 90]',
       'Categorized_age_(90, 100]'],
      dtype='object')


**Feature Scaling:** <br />
Scaling the features to ensure that all features contribute equally to the model.
To [center the data](https://en.wikipedia.org/wiki/Standard_score) (make it have zero mean and unit standard error), we subtract the mean and then divide the result by the standard deviation:
$$
x' = \frac{x-\mu}{\sigma}
$$

In [37]:
# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for convenience
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Display the scaled features
X_scaled.head()

Unnamed: 0,location,country,gender,vis_wuhan,from_wuhan,symptom1,symptom2,symptom3,symptom4,symptom5,...,diff_sym_hos,"Categorized_age_(0, 18]","Categorized_age_(18, 30]","Categorized_age_(30, 40]","Categorized_age_(40, 50]","Categorized_age_(50, 60]","Categorized_age_(60, 70]","Categorized_age_(70, 80]","Categorized_age_(80, 90]","Categorized_age_(90, 100]"
0,0.698221,-1.15245,0.207592,2.12057,-0.347533,0.465755,0.401355,0.244914,0.135161,0.054668,...,2.971339,-0.137442,-0.339097,-0.388158,-0.7734,-0.453108,2.746735,-0.260901,-0.141755,-0.048196
1,0.621646,-1.15245,-1.170499,-0.471571,2.877424,0.465755,0.401355,0.244914,0.135161,0.054668,...,-0.42223,-0.137442,-0.339097,-0.388158,-0.7734,2.206977,-0.364069,-0.260901,-0.141755,-0.048196
2,1.54054,-1.15245,0.207592,-0.471571,2.877424,0.465755,0.401355,0.244914,0.135161,0.054668,...,5.092319,-0.137442,-0.339097,-0.388158,1.292991,-0.453108,-0.364069,-0.260901,-0.141755,-0.048196
3,1.004519,-1.15245,-1.170499,2.12057,-0.347533,0.465755,0.401355,0.244914,0.135161,0.054668,...,-0.42223,-0.137442,-0.339097,-0.388158,-0.7734,2.206977,-0.364069,-0.260901,-0.141755,-0.048196
4,1.004519,-1.15245,0.207592,-0.471571,-0.347533,0.465755,0.401355,0.244914,0.135161,0.054668,...,-0.42223,-0.137442,-0.339097,-0.388158,-0.7734,2.206977,-0.364069,-0.260901,-0.141755,-0.048196


#### 4. Training the models 

**4.1 Train-validation-test split**

In [38]:
# Split the data into 'Training' and 'Temp' sets, Temp set will be further split into 'Validation' and 'Test' sets
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Split the 'Temp' set into 'Validation' and 'Test' sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Display the shapes of the sets
print("Training set size:", X_train.shape[0])
print("Validation set size:", X_val.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 604
Validation set size: 129
Test set size: 130


**4.2 Training different models on the dataset** <br />
We will train the following models:
- K-Nearest Neighbors
- Logistic Regression
- Naive Bayes
- Support Vector Machine
- Decision Tree

In [39]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier