# COVID-19 Patient Survival Prediction Project

## Introduction

The COVID-19 pandemic has posed significant challenges to healthcare systems worldwide. One crucial aspect of managing the pandemic is predicting patient outcomes based on their symptoms and medical history. This project aims to predict whether COVID-19 patients will survive using various machine learning models, including Logistic Regression, Random Forest, SGD Classifier, and XGBoost. The evaluation metrics used include accuracy, ROC curves, and precision-recall curves.

## Data

The data used in this project consists of COVID-19 patients' medical history and symptoms. The dataset includes features such as age, gender, comorbidities, and various symptoms experienced by the patients.
This project is from kaggle at this [Link](https://www.kaggle.com/datasets/meirnizri/covid19-dataset)
## Methodology

### Data Preprocessing

- **Data Cleaning:** Handled missing values and outliers.
- **Feature Engineering:** Created new features and selected relevant ones.
- **Normalization:** Applied normalization techniques to scale the features.

### Models Used

1. **Logistic Regression**
2. **Random Forest**
3. **Support Vector Machine (SVM)**
4. **XGBoost**

### Evaluation Metrics

- **Accuracy:** Measures the proportion of correctly predicted instances.
- **ROC Curve:** Plots the true positive rate against the false positive rate.
- **Precision-Recall Curve:** Plots precision against recall.

## Modeling

### Logistic Regression

Logistic Regression is a linear model used for binary classification problems. It predicts the probability of the target variable belonging to a particular class. A Bayesian Optimization is used to optimize the logistic regression model parameter selections. The result below show the best model obtained using this technique.  

**Results:**
- **Accuracy:** [0.95]
- **ROC AUC:** [0.96]
- **Precision-Recall AUC:** [0.69]

![Logistic Regression ROC Curve](../figures/Logistic_Regression_Best_roc_curve.png)
![Logistic Regression ROC Curve](../figures/Logistic_Regression_Best_precision_recall_curve.png)

### Random Forest

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mode of the classes for classification.

**Results:**
- **Accuracy:** [0.945]
- **ROC AUC:** [0.95]
- **Precision-Recall AUC:** [0.65]

![Random Forest ROC Curve](../figures/Random_Forest_roc_curve.png)
![Random Forest Precision-Recall Curve](../figures/Random_Forest_precision_recall_curve.png)

### SGD Classifier

Support Vector Machine is a powerful and versatile machine learning model capable of performing linear or nonlinear classification, regression, and even outlier detection. Since the data is very large, SGD Classifier is replaced with SVM.

**Results:**
- **Accuracy:** [0.073]
- **ROC AUC:** [0.5]
- **Precision-Recall AUC:** [0.07]

![SGD Classifier ROC Curve](../figures/SGDClassifier_roc_curve.png)
![SGD Classifier Precision-Recall Curve](../figures/SGDClassifier_precision_recall_curve.png)

### XGBoost

XGBoost is an optimized gradient boosting algorithm designed to be highly efficient, flexible, and portable.

**Results:**
- **Accuracy:** [0.95]
- **ROC AUC:** [0.75]
- **Precision-Recall AUC:** [0.63]

![XGBoost ROC Curve](../figures/XGBoost_roc_curve.png)
![XGBoost Precision-Recall Curve](../figures/XGBoost_precision_recall_curve.png)

## Evaluation

### Accuracy

The accuracy of each model was calculated to determine how well they performed in predicting the survival of COVID-19 patients.

| Model                | Accuracy      |
|----------------------|---------------|
| Logistic Regression  | [0.95] |
| Random Forest        | [0.95] |
| SGDClassifier        | [0.073] |
| XGBoost              | [0.95] |

### ROC Curve

The ROC curve was plotted for each model to visualize the trade-off between the true positive rate and false positive rate.

![Logistic Regression ROC Curve](../figures/Logistic_Regression_Best_roc_curve.png)
![Random Forest ROC Curve](../figures/Random_Forest_roc_curve.png)
![SGD Classifier ROC Curve](../figures/SGDClassifier_roc_curve.png)
![XGBoost ROC Curve](../figures/XGBoost_roc_curve.png)

### Precision-Recall Curve

The precision-recall curve was plotted to visualize the trade-off between precision and recall for each model.

![Logistic Regression Precision-Recall Curve](../figures/Logistic_Regression_Best_precision_recall_curve.png)
![Random Forest Precision-Recall Curve](../figures/Random_Forest_precision_recall_curve.png)
![SGD Classifier Precision-Recall Curve](../figures/SGDClassifier_precision_recall_curve.png)
![XGBoost Precision-Recall Curve](../figures/XGBoost_precision_recall_curve.png)

## Conclusion

In this project, we used Logistic Regression, Random Forest, SGD Classifier, and XGBoost to predict the survival of COVID-19 patients based on their medical history and symptoms. The evaluation metrics indicated that:

- **Random Forest** and **Logisitc Regression** performed well in terms of both accuracy and AUC scores for both ROC and precision-recall curves.
- **XGBoost** also showed high performance and was competitive with Random Forest but not as good.
- **SVM** took so long and did not finish the execution so **SGD Classifier** was replaced which performed significantly poor. 

These findings can aid healthcare professionals in making informed decisions regarding patient care and resource allocation. Future work can involve exploring other advanced machine learning algorithms, incorporating additional data sources, and fine-tuning the models for improved performance.

## Future Work

Future work can involve exploring other advanced machine learning algorithms, incorporating additional data sources, and fine-tuning the models for improved performance.

## References

- [Link to Data Source](https://www.kaggle.com/datasets/meirnizri/covid19-dataset)
- [Link to Modeling Notebook](https://github.com/azadehansari/CapstoneProject2-COVID_Analysis/blob/master/notebooks/04.%20Moddeling.ipynb)