Anastasiia Leskiv

# Business Understanding

COVID-19 also known as SARS-CoV-2, caused by a newly identified coronavirus, typically results in mild to moderate respiratory illness for most infected individuals, often resolving without specific medical intervention. However, older individuals and those with pre-existing health issues such as cardiovascular disease, diabetes, chronic respiratory conditions, or cancer, are at a higher risk of developing severe illness.

Throughout the pandemic, healthcare providers have grappled with a critical challenge: a scarcity of medical resources and an effective distribution strategy. Anticipating the specific medical needs of individuals upon testing positive or even beforehand is paramount. This predictive ability could significantly assist authorities in proactively sourcing and organizing the requisite resources, potentially saving lives in critical situations during these challenging times.

The goal of this project is to build a machine learning model that, shows patient's current symptom, status, and medical history, predict patients who are at high risk of death from covid.

# Data Understanding

This data set contains 21 unique features and 1,048,576 unique patients
sex: 1 for female and 2 for male.In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.

sex: 1 for female and 2 for male.

age: of the patient.

classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different
degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.

patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.

pneumonia: whether the patient already have air sacs inflammation or not.
pregnancy: whether the patient is pregnant or not.

diabetes: whether the patient has diabetes or not.

copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.

asthma: whether the patient has asthma or not.

inmsupr: whether the patient is immunosuppressed or not.

hypertension: whether the patient has hypertension or not.

cardiovascular: whether the patient has heart or blood vessels related disease.

renal chronic: whether the patient has chronic renal disease or not.

other disease: whether the patient has other disease or not.

obesity: whether the patient is obese or not.

tobacco: whether the patient is a tobacco user.

usmr: Indicates whether the patient treated medical units of the first, second or third level.

medical unit: type of institution of the National Health System that provided the care.

intubed: whether the patient was connected to the ventilator.

icu: Indicates whether the patient had been admitted to an Intensive Care Unit.

date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.

#  Data Preparation


In [1]:
# import all necessary libraries
import numpy as np # for data matipulation
import pandas as pd # for data matipulation
import matplotlib.pyplot as plt #for plotting
%matplotlib inline 
import numpy as np #for data matipulation
import matplotlib.pyplot as plt#for plotting
import seaborn as sns #for plotting
%matplotlib inline
from sklearn.model_selection import train_test_split # for modeling 
from sklearn.linear_model import LogisticRegression# for modeling 
from sklearn.metrics import confusion_matrix# for modeling 
from sklearn.metrics import classification_report# for modeling 
import warnings
warnings.filterwarnings('ignore') # for warning ignoring
import sklearn.metrics as metrics# for modeling 
from sklearn.metrics import accuracy_score# for modeling 
from six import StringIO #used as input or output to the most function that would expect a standard file object
from IPython.display import Image 
from sklearn.tree import export_graphviz# for modeling 
from sklearn.model_selection import cross_val_score# for modeling 
import seaborn as sns #for plotting
from sklearn.tree import DecisionTreeClassifier# for modeling  
from tqdm import tqdm
from sklearn.ensemble import RandomForestClassifier# for modeling 


In [None]:
! pip install kaggle 

In [None]:
import json 
import os
from pathlib import Path

# user name and API key
username = ''
key = ''
user_folder = ''
# your api key
api_key = {
'username':username ,
'key':key}

# uses pathlib Path
kaggle_path = Path(f'/Users/{user_folder}/.kaggle')
os.makedirs(kaggle_path, exist_ok=True)

# opens file and dumps python dict to json object 
with open (kaggle_path/'kaggle.json', 'w') as handl:
    json.dump(api_key,handl)

os.chmod(kaggle_path/'kaggle.json', 777) 

In [None]:
import kaggle

In [None]:
!kaggle datasets download -d meirnizri/covid19-dataset

In [None]:
! unzip covid19-dataset.zip

In [None]:
# read CSV file and print 5 raws
df = pd.read_csv('Covid Data.csv')
df.head()

In [None]:
df.shape

In [None]:
#print information about the DataFrame
df.info()

In [None]:
#Checking null values
df.isnull().sum()

A good day is when there is no "NaN"in the data :D Let's look at our data to understand something about it.

In [None]:
df.describe()

In [None]:
#checkind Unique Values
for col in df.columns : 
    print('{:<20} => {:>10}'.format(col, len(df[col].unique())))

I can see that some of the columns have more then 2 unique values. I will take third out and leave just 2 just like I was expecting it to be.values as 97 and 99 are missing data.

 

In [None]:
df = df[(df.PNEUMONIA == 1) | (df.PNEUMONIA == 2)]
df = df[(df.DIABETES == 1) | (df.DIABETES == 2)]
df = df[(df.COPD == 1) | (df.COPD == 2)]
df = df[(df.ASTHMA == 1) | (df.ASTHMA == 2)]
df = df[(df.INMSUPR == 1) | (df.INMSUPR == 2)]
df = df[(df.HIPERTENSION == 1) | (df.HIPERTENSION == 2)]
df = df[(df.OTHER_DISEASE == 1) | (df.OTHER_DISEASE == 2)]
df = df[(df.CARDIOVASCULAR == 1) | (df.CARDIOVASCULAR == 2)]
df = df[(df.OBESITY == 1) | (df.OBESITY == 2)]
df = df[(df.RENAL_CHRONIC == 1) | (df.RENAL_CHRONIC == 2)]
df = df[(df.TOBACCO == 1) | (df.TOBACCO == 2)]

In [None]:
# Ploting countplot to compare whether gender affects the number 
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="SEX", data=df, palette="pastel")
plt.show()

In [None]:
df['SEX'].value_counts()

Okay, so gender does not affect the number 

In [None]:
# Ploting countplot to compare people who have pneumonia or not
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="PNEUMONIA", data=df, palette="pastel")
plt.show()

In [None]:
df['PNEUMONIA'].value_counts()

1 means "yes" and 2 means "no".This plot shows us that a lot less patients in our data set have pneumonia

In [None]:
# Ploting countplot to compare people who have diabetes or not
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="DIABETES", data=df, palette="pastel")
plt.show()

In [None]:
df['DIABETES'].value_counts()

This plot shows us that a less more patients in our data set have

In [None]:
# Ploting countplot to see whether the patient has Chronic obstructive pulmonary disease or not
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="COPD", data=df, palette="pastel")
plt.show()

In [None]:
df['COPD'].value_counts()

This plot shows us that a lot less patients in our data set have Chronic obstructive pulmonary disease

In [None]:
# Ploting countplot to see whether the patient has Asthma
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="ASTHMA", data=df, palette="pastel")
plt.show()

In [None]:
df['ASTHMA'].value_counts()

This plot shows us that a lot less patients in our data set have Asthma

In [None]:
# Ploting countplot to see whether the patient is immunosuppressed or not
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="INMSUPR", data=df, palette="pastel")
plt.show()

In [None]:
df['INMSUPR'].value_counts()

This plot shows us that a lotless patients in our data set is immunosuppressed

In [None]:
# Ploting countplot to see whether the patient has hypertension or not
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="HIPERTENSION", data=df, palette="pastel")
plt.show()

In [None]:
df['HIPERTENSION'].value_counts()

This plot shows us that a lot less patients in our data set have hypertension

In [None]:
# Ploting countplot to see whether the patient has any other DISEASE
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="OTHER_DISEASE", data=df, palette="pastel")
plt.show()

In [None]:
df['OTHER_DISEASE'].value_counts()

This plot shows us that a lot less patients in our data set also have some other diseases

In [None]:
# Ploting countplot to see whether the patient has CARDIOVASCULAR disease
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="CARDIOVASCULAR", data=df, palette="pastel")
plt.show()

In [None]:
df['CARDIOVASCULAR'].value_counts()

This plot shows us that a lot less patients in our data set also have CARDIOVASCULAR disease

In [None]:
# Ploting countplot to see whether the patient has OBESITY
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="OBESITY", data=df, palette="pastel")
plt.show()

In [None]:
df['OBESITY'].value_counts()

This plot shows us that a lot less patients in our data set also have OBESITY

In [None]:
# Ploting countplot to see whether the patient has chronic renal disease or not.
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="RENAL_CHRONIC", data=df, palette="pastel")
plt.show()

In [None]:
df['RENAL_CHRONIC'].value_counts()

This plot shows us that a lot less patients in our data set also have chronic renal disease

In [None]:
# Ploting countplot to compare people who use tabacco or not 
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="TOBACCO", data=df, palette="pastel")
plt.show()

In [None]:
df['TOBACCO'].value_counts()

This plot shows us that a lot less patients in our data are smokers

In [None]:
# Ploting countplot to see whether the patient had been admitted to an Intensive Care Unit
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="ICU", data=df, palette="pastel")
plt.show()

In [None]:
df['CARDIOVASCULAR'].value_counts()

Okay, here we have a lot of missing value so I will drop thi scolumn

In [None]:
df.drop(columns=["ICU"], inplace=True)

In [None]:
#covid test findings. Values 1-3 mean that the patient was diagnosed with covid 
#in different degrees. 4 or higher means that the patient is not a carrier of covid 
#or that the test is inconclusive.
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="CLASIFFICATION_FINAL", data=df, palette="pastel")
plt.show()

In [None]:
df['CLASIFFICATION_FINAL'].value_counts()

In [None]:
#2- means this patient is alive, 1- dead
df["DEATH"] = [2 if each=="9999-99-99" else 1 for each in df.DATE_DIED]

In [None]:
#droping DATE_DIED since we don't need this column enymore
df.drop(columns=["DATE_DIED"], inplace=True)
df

In [None]:
# create a heatmap plot
plt.figure(figsize=(16, 16))
sns.heatmap(df.corr(), cmap="seismic", annot=True, vmin=-1, vmax=1);

Each square shows the correlation between the variables on each axis. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is. A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases. The diagonals are all 1/dark because those squares are correlating each variable to itself (so it's a perfect correlation). For the rest the larger the number and darker the color the higher the correlation between the two variables.

In [None]:
import  matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x="DEATH", data=df, palette="pastel")
plt.show()

Using crosstab plot to see Number of patients who died from COVID-19 based on age where 2- means this patient is alive, 1- dead

In [None]:
df['DEATH'].value_counts()

It is a big number for death, next I want to check at what age it mostly happens

In [None]:
# plating crosstab plot
pd.crosstab(df.AGE,df.DEATH).plot(kind="bar",figsize=(20,6))
plt.title('COVID-19 death Ages')
plt.xlabel('Age')
plt.ylabel('Number of patients with COVID-19')
plt.show()

Here we can come to the conclusion, it's very rerely happens at a youg age it's mostly at the age of 50-80 y.o

In [None]:
#ploting crosstab to see wether gender affects the number of death
pd.crosstab(df.SEX,df.DEATH).plot(kind="bar",figsize=(15,6),color=['green','red' ])
plt.title('DEATH')
plt.xlabel('Gender(1 = Female, 2 = Male)')
plt.xticks(rotation=0)
plt.legend(["dead", "alive"])
plt.ylabel('Frequency')
plt.show()

Gender does not really affect the number of death. 

In [None]:
#ploting crosstab to see wether TABACCO affects the number of death
pd.crosstab(df.TOBACCO,df.DEATH).plot(kind="bar",figsize=(15,6),color=['green','red' ])
plt.title('DEATH')
plt.xlabel('Tabacco user(1 = YES, 2 = NO)')
plt.xticks(rotation=0)
plt.legend(["dead", "alive"])
plt.ylabel('Frequency')
plt.show()

Tabacco users are more likely to die from COVIS-19

In [None]:
#ploting crosstab to see wether diabetes affects the number of death
pd.crosstab(df.DIABETES,df.DEATH).plot(kind="bar",figsize=(15,6),color=['green','red' ])
plt.title('DEATH')
plt.xlabel('DIABETES(1 = YES, 2 = NO)')
plt.xticks(rotation=0)
plt.legend(["dead", "alive"])
plt.ylabel('Frequency')
plt.show()

Patients with diabetes are more likely to die from COVIS-19

In [None]:
#ploting crosstab to see wether OBESITY affects the number of death
pd.crosstab(df.OBESITY,df.DEATH).plot(kind="bar",figsize=(15,6),color=['green','red' ])
plt.title('DEATH')
plt.xlabel('OBESITY(1 = YES, 2 = NO)')
plt.xticks(rotation=0)
plt.legend(["dead", "alive"])
plt.ylabel('Frequency')
plt.show()

Patients with obesity are more likely to die from COVIS-19

In [None]:
#ploting crosstab to see wether ASTHMA affects the number of death
pd.crosstab(df.ASTHMA,df.DEATH).plot(kind="bar",figsize=(15,6),color=['green','red' ])
plt.title('DEATH')
plt.xlabel('ASTHMA(1 = YES, 2 = NO)')
plt.xticks(rotation=0)
plt.legend(["dead", "alive"])
plt.ylabel('Frequency')
plt.show()

Asthma doec not affect number of death from covid

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots.

Next I am splitting data into train and test to be ready for modeling. 

In [None]:
df = pd.get_dummies(df,columns=["MEDICAL_UNIT","CLASIFFICATION_FINAL"],drop_first=True)

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df.AGE = scaler.fit_transform(df.AGE.values.reshape(-1,1))

In [None]:
x = df.drop(columns="DEATH")
y = df["DEATH"]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#split data into test and train 80/20
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
print("X_Train :",x_train.shape)
print("X_Test :",x_test.shape)
print("Y_Train :",y_train.shape)
print("Y_Test :",y_test.shape)

## Modeling

For modeling: I used logistic regration, random forest, decision tree, stacking resembling for setting on rendomforest as the model with the best cross-validation perfomance, random forest feature importance ranking I used for guiding the choice anf order of variables to be included as the model underwent refinement

### Random Forest
A Gaussian classifier is a generative approach in the sense that it attempts to model class posterior as well as input class-conditional distribution. Therefore, we can generate new samples in input space with a Gaussian classifier.

In [None]:
# runing RandomForestClassifier
model = RandomForestClassifier(max_depth=5)
model.fit(x_train, y_train)

In [None]:
#evaluate the model
from sklearn.metrics import confusion_matrix
y_predict = model.predict(x_test)
y_pred_quant = model.predict_proba(x_test)[:, 1]
y_pred_bin = model.predict(x_test)

In [None]:
#Assess the fit with a confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred_bin)
confusion_matrix

In [None]:
#checking the model using sensitivity and specificity
total=sum(sum(confusion_matrix))

sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )

specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred=logreg.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score

# Assuming y_test and y_pred are your true labels and predicted labels, respectively
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sn
cm=confusion_matrix(y_test,y_pred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True,fmt='d',cmap="Blues")

True Positive(we predict our patient has Heart Disease and patient actually has it)-6581

True Negative (we predict our patient does not have Heart Disease and patient actually has it)-186125

False Positive(we predict our patient has Heart Disease and patient actually does not have it)-4228

False Negative(we predict our patient does not have Heart Disease and patient actually has it)-8097

!!!!!!!!!!!!!!!!!!!!!Inference:
We got well accuracy with Logistic Regression.
But it can mislead us so we have to check the other metrics.
When we look at the F1 Score it says that we predicted the patients who survived well but we can't say the same thing for dead patients.
Also we see the same thing when we check the confusion matrix. This problem is based on imbalance dataset as i mentioned about it.

## Evaluation

Next, I will check feature importance

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rand_under = RandomUnderSampler(random_state=0)
x_resampled, y_resampled = rand_under.fit_resample(x,y)

Let's do it again and see if it will help to fix our problem 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_resampled,y_resampled, test_size=0.2, random_state=42)
print("train_x :",x_train.shape)
print("test_x :",x_test.shape)
print("train_y :",y_train.shape)
print("test_y :",y_test.shape)

In [None]:
print(x.shape)
print(y.shape)


In [None]:
logreg.fit(x_train, y_train)
print("Logistic Regression Accuracy :", logreg.score(x_test, y_test))

In [None]:
from sklearn.metrics import f1_score

In [None]:
print("F1 Score :",f1_score(y_test,logreg.predict(x_test),average=None))

In [None]:
from sklearn.metrics import confusion_matrix

sns.heatmap(confusion_matrix(y_test, logreg.predict(x_test)), annot=True, fmt='d',cmap="Blues")
plt.title("Confusion Matrix", fontsize=18)

!!!!!!!!We solved the problem with Undersampling. We also could've used Oversampling and probably we would get better accuracy but i think it would be so tiring for computer.

In [None]:
from sklearn.metrics import roc_curve
y_test= y_test.replace({2:1,1:0})
# Probabilities
logreg_predict = logreg.predict_proba(x_test)
fpr, tpr, thresholds = roc_curve(y_test, logreg_predict[:,1])
plt.plot([0,1],[0,1],"k--")
plt.plot(fpr, tpr, label = "Logistic Regression")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC Curve")
plt.show()

# Conclusion

## Limitation

## Recommendation

## Next Steps