# Covid-19 Fatality Prediction

## Overview

Binary classification machine learning prediction model

Predict whether a Covid-19 patient will die or not based on their features.<br>
If a patient has a 50% or higher prediction of dying, then that person is high risk. <br>
If a patient has less than a 50% prediction of dying, then that person is low risk. 

Problem definition: Given clinical parameters about a Covid-19 patient, can we predict whether or not they will die?

Evaluation: If we can reach 90% accuracy at predicting whether or not a patient will die during the proof of concept, we'll pursue the project.

STEPS
1. Explore data
2. Choose features
3. Create models 
4. Test and choose model
5. Test and improve model

## Data source

This model will use a Kaggle dataset created from reports by Korea Centers for Disease Control and Prevention and local governments. It is maintained and updated by volunteers.

Specifically, the PatientInfo.csv, which contains epidemiological data of Covid-19 patients in South Korea, is used.

Source
- https://www.kaggle.com/kimjihoo/coronavirusdataset
- https://github.com/ThisIsIsaac/Data-Science-for-COVID-19
- http://www.cdc.go.kr/

### Data dictionary

See detailed descript of the dataset here: <br>
https://www.kaggle.com/kimjihoo/ds4c-what-is-this-dataset-detailed-description


## Data exploration / analysis (EDA)

What's missing from the data, and how do you deal with it?<br>
Are there any outliers?<br>
Do you need to add, change or remove columns?

In [3]:
# Do all imports at the top

# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

In [4]:
# LOAD DATA
# Create dataframe with pandas

df = pd.read_csv("PatientInfo.csv")

# Number of (rows, columns)
df.shape 

(3253, 18)

In [5]:
# Preview table
df.head() # first 5 rows

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released


In [6]:
# Look at column info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3253 entries, 0 to 3252
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          3253 non-null   int64  
 1   global_num          2082 non-null   float64
 2   sex                 3200 non-null   object 
 3   birth_year          2833 non-null   float64
 4   age                 3192 non-null   object 
 5   country             3142 non-null   object 
 6   province            3253 non-null   object 
 7   city                3177 non-null   object 
 8   disease             18 non-null     object 
 9   infection_case      2441 non-null   object 
 10  infection_order     31 non-null     float64
 11  infected_by         763 non-null    float64
 12  contact_number      597 non-null    float64
 13  symptom_onset_date  462 non-null    object 
 14  confirmed_date      3253 non-null   object 
 15  released_date       1137 non-null   object 
 16  deceas

In [10]:
# Column datatype
df.dtypes

patient_id              int64
global_num            float64
sex                    object
birth_year            float64
age                    object
country                object
province               object
city                   object
disease                object
infection_case         object
infection_order       float64
infected_by           float64
contact_number        float64
symptom_onset_date     object
confirmed_date         object
released_date          object
deceased_date          object
state                  object
dtype: object

In [7]:
# Check for missing values
df.isna().sum()

# 3253 rows
# Columns with lots of nulls
# global_num / birth_year / disease / infection_case / infection_order
# infected_by / contact_number / symptom_onset_date
# released_date / deceased_date

# 1. disease 2. infection order 3. deceased_date

patient_id               0
global_num            1171
sex                     53
birth_year             420
age                     61
country                111
province                 0
city                    76
disease               3235
infection_case         812
infection_order       3222
infected_by           2490
contact_number        2656
symptom_onset_date    2791
confirmed_date           0
released_date         2116
deceased_date         3193
state                    0
dtype: int64

In [23]:
# View column
# df["released_date"]

# Drop column
df = df.drop("released_date", axis=1)
# df = df.drop('column_name', 1)
# df.drop('column_name', axis=1, inplace=True) // use inplace if you don't want to reassign

NameError: name 'patient_id' is not defined

In [24]:
# List of columns 
df.columns

Index(['patient_id', 'global_num', 'sex', 'birth_year', 'age', 'country',
       'province', 'city', 'disease', 'infection_case', 'infection_order',
       'infected_by', 'contact_number', 'symptom_onset_date', 'confirmed_date',
       'deceased_date', 'state'],
      dtype='object')

In [27]:
# df.head()
# df.isna().sum()

# Drop column
df = df.drop("infection_order", axis=1)

In [29]:
df = df.drop("disease", axis=1)
df = df.drop("infected_by", axis=1)
df = df.drop("contact_number", axis=1)
df = df.drop("symptom_onset_date", axis=1)
df = df.drop("deceased_date", axis=1)

In [37]:
# df.head()
# df.isna().sum()
df = df.drop("global_num", axis=1)

In [53]:
# FIND OUT HOW MANY OPTIONS THERE ARE IN INFECTION_CASE
df['country'].value_counts()

Korea            3118
China              10
United States       6
Thailand            2
Indonesia           1
Canada              1
Mongolia            1
Switzerland         1
France              1
Spain               1
Name: country, dtype: int64

In [54]:
df = df.drop("birth_year", axis=1)
df = df.drop("confirmed_date", axis=1)
df = df.drop("city", axis=1)

In [57]:
# Export dataframe with dropped columns
df.to_csv("PatientInfo_Clean1.csv", index=False) 

In [59]:
# DELETE ROWS WITH NULL VALUES
# df.isna().sum()
df.dropna(inplace=True)

In [62]:
# df.shape // (2319, 7)
df.isna().sum()

patient_id        0
sex               0
age               0
country           0
province          0
infection_case    0
state             0
dtype: int64

In [63]:
# EXPORT DATAFRAME WITH NO NULL VALUES
df.to_csv("PatientInfo_Clean2.csv", index=False)

In [None]:
# CONVERT STRING VALUES TO NUMBERS
