<a href="https://www.kaggle.com/code/amarmoibrahim964/covid-19-machine-learning-cart?scriptVersionId=142881288" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Covid 19 Machine Learning Project CART

![/kaggle/input/corona/corona.jpg](https://images.pexels.com/photos/4031867/pexels-photo-4031867.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

# Introduction

**Coronavirus** disease 2019 (COVID-19) is a contagious disease caused by the virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019.The disease quickly spread worldwide, resulting in the COVID-19 pandemic.

Machine learning (ML) is a subfield of artificial intelligence, which is used to perform complex tasks in a way that is similar to how humans solve problems. ML starts with data numbers, photos, or text, like bank transactions, repair records, time series data from sensors or reports and predicts the
corresponding result. There are two ways in which the machine learns. It could be supervised ML,
unsupervised ML or reinforcement ML. Supervised ML could be in the form of CART algorithm, which
is used in this project for regression analysis. CART can be applied to predict a categorical target variable producing a classification tree, or continuous target variable producing a regression tree. We have used this method to explain the statistics of corona cases around the world depending on region specificities.


# Method

To set up our model, we rely on two main methods, **the classification and regression tree ( CART)** and
to assess our performance, we utilized **( K fold cross validation)** , but we will shed light on the cross
validation in details when we evaluate our model.

 

![/kaggle/input/tree-reg2/tree reg.jpg](https://www.analyticssteps.com/backend/media/thumbnail/2578400/1990226_1626945689_CART%20algorithmArtboard%201%20copy.jpg)

Regression analysis of the decision tree type is used in predictive models to predict a continuous target variable in supervised learning. The fundamental idea is to divide the data set into more manageable sections. Both linear and non-linear relationships can be studied using this non-parametric approach.
Decision trees come in two primary variants: categorical (classification trees) and numerical variables
(regression tree). Numerical or categorical explanatory variables are also acceptable. A numerical label
is estimated via regression. This implies that the possible values for the output are unlimited ( Glenn
De'ath, et al 2000) . We utilize decision tree regression in the corona dataset because it contains
numerical variable.


On the one hand, the advantage of decision trees is that they are easy to understand. The decision tree
can tolerate missing data and preserve accuracy, and it doesn't require extensive data preparation such
as normalization or standardization ( Glenn De'ath, et al 2000 ) . It can simulate nonlinear input–output
relationships. On the other hand, the disadvantage of decision trees, is that they can be biased towards
features with many levels, which makes it more likely that they will be chosen as splits in the tree.
Additionally, the decision tree algorithm chooses the best split at each step without taking the effects
2
of future splits into account. And changing the data slightly can result in insignificant changes to the
structure.

# Application
1. **Installing the packages and importing the dataset** For our project we chose Python language through installing packages, libraries and loading the dataset. We require in our model packages such as Pyreadr to read data in python, numPy (Numerical Python library), matplotlib (Python data visualization library), seaborn (Python advanced data visualization library), Scikit-learn (Python machine learning library)


In [None]:
! pip install pyreadr



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyreadr
from sklearn.model_selection import train_test_split

In [None]:
result = pyreadr.read_r('/kaggle/input/corona-data/CoronaData (1).rdata')
df = result["df"]
df

In [None]:
df.info()

For our project we chose Python language through installing packages, libraries and loading the
dataset. We require in our model packages such as Pyreadr to read data in python, numPy (Numerical
Python library), matplotlib (Python data visualization library), seaborn (Python advanced data
visualization library), Scikit-learn (Python machine learning library). When we are exploring our dataset,
we find 14 columns and 2227 rows in our data frame which contain 10 columns, numeric float 64, and
4 string variables. Then columns of our data include location, year, month, new_cases_pm, iso_code,
continent, population_density, median_age, aged_65_older, gdp_per_capita, extreme_poverty,
life_expectancy, human development index and date.

**2.Preparation and cleaning the data**


Before using the data, we need to prepare our data to be used in the model. That means removing the
duplicate data and dealing with the null values. First, we filtered our data by choosing the data of two
months (January and July 2021) and then only selecting the entries whose dependent parameter stays
either in the lowest 40% or the highest 40%.

In [None]:
#Get same Statistical information (mean & stander error & max & Min)
df.describe()

In [None]:
df.columns

In [None]:
df['location'].unique()

In [None]:
df['year'].unique()

In [None]:
df['month'].unique()

In [None]:
# we filter the data with  January 2021 and July 2021

df1=df[df['year']==2021.0]
df=df1[(df1['month']==1.0)|(df1['month']==7.0 )]
df

In [None]:
#code in low and high (lowest 40%, highest 40%)
# get highest 40%
df_high=df[df['new_cases_pm'] >=df['new_cases_pm'].quantile(0.60)]


# get Lowest 40%
df_low=df[df['new_cases_pm'] <=df['new_cases_pm'].quantile(0.40)]



In [None]:
df40=df[ (df['new_cases_pm'] <=df['new_cases_pm'].quantile(0.40) ) | (df['new_cases_pm'] >=df['new_cases_pm'].quantile(0.60) )  ]
df40

**cleaning data** : Check For Duplicate Data & Check Missing Values In

In [None]:
##Check For Duplicate Data

dup=df40.duplicated().any()
print("Any duplicate Value?",dup)

In [None]:
## Check Missing Values In The Dataset

df40.isnull().sum()

pre_missing=df40.isnull().sum()*100/len(df)
pre_missing


In [None]:
sns.heatmap(df40.isnull(),cmap='viridis',cbar=True,yticklabels=False)
plt.title("Missing Data")
plt.show()

We did not find any duplicate data but we detected some Null entries in independent variable. This
includes 3% of (Population density, human development index, aged_65_older), 5% in (GDP per capita)
and 20% in (extreme poverty). To address this issue, we can either replace the missing data with the
mean value if it is numerical, or just ignore and remove it. As the number of missing data was not
significant, we decided to remove entries with missing data. For that, we needed to drop out
(extreme_poverty) column first, because it has around 20% of missing entries. After cleaning our data.

In [None]:
#Drop the column extreme_poverty because it has 19.53125% miss data

df40.drop('extreme_poverty',axis=1,inplace=True)

In [None]:
#Drop All the Missing Values
df40.dropna(how='any',inplace=True)


In [None]:
df40

In [None]:
df40['continent'].unique()

Encoding Categorical Data

Before creating our model, we needed to determine our desired predictors (independent variables)
and the target (dependent variable), where all of them must be numerical values. For that, we need to
assign numbers to some string parameters. Specifically, we created another column (locat_SOrN) as a
categorical value (North or South). We assigned North America, Europe, and Asia to North and the rest
of the world to South. Then we created another column (North_or_South) as the Dummy variable; for
North taking the value of 1 and for South the value of 0. Finally, we dropped out all categorical data
and unwanted columns and kept only numerical variables. In addition, we used the two months
(January and July) as a dummy variable.

In [None]:
#Create a Categorical North and South
def locat(locat):
    if locat in ['North America','Europe','Asia' ]:
        return "North"
    else:
        return "South"

df40['locat_SOrN']=df40['continent'].apply(locat)
df40

In [None]:
###create Dummy Variable
x=df40['locat_SOrN'].map({'North':1,'South':0})
df40.insert(14,'North_or_South ',x)


In [None]:
###create Dummy Variable
def season(season):
    if season == 1.0:
        return  1
    else:
        return  0

df40['Season']=df40['month'].apply(season)
df40

In [None]:
#3Check the correlation

#corr=df40.corr()
#corr.style.background_gradient(cmap='coolwarm',axis=None)

plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df40.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df40.corr()[['new_cases_pm']].sort_values(by='new_cases_pm', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with new_cases_pm', fontdict={'fontsize':18}, pad=16);

In [None]:
#we need only the nomric columns
# Drop unuseless columns , we need only the nomric columns

df_f=df40.drop(columns=['continent','iso_code','location','date','location','year','locat_SOrN','month'])


**3.Creating the model**

a) Splitting variables into Predictors (x) and Response variable (y).
We split data into Predictors (x) as(month, population_density, median_age, aged_65_older, GDP_ per
capita, life_expectancy, human _development index, North or South) and target (y) as (new cases pm).

In [None]:
#1 .Split into on Predictors (x) and Respondse variable (y)
x=df_f.drop('new_cases_pm',axis=1)
y=df_f['new_cases_pm']

In [None]:
#2.Splitting the dataset into the Training set and Test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20,random_state=42)

In [None]:
#3.Fit Model with Training Data
# import the regressor
from sklearn.tree import DecisionTreeRegressor

# create a regressor object
model = DecisionTreeRegressor()

# fit the regressor with X and Y data
model.fit(x_train,y_train)

In [None]:
#Ploting the tree decicion
from sklearn import tree
fig=plt.figure(figsize=(43,14))
tree.plot_tree(model,filled=True,rounded=True,max_depth=3,fontsize=20)
plt.show()

**4. Evaluating the model using cross-validated metrics**

the easiest method to enhance the system's performance without sacrificing too much is to verify it using a tiny portion of the training data, since this will give us an indication of the model's capacity to predict unknown data.

 

K-fold cross-validation is a prominent type of cross-validation approach in which, for example, if k=5, 4 folds are used for training and 1 fold is used for testing, and this process repeats until all folds have a chance to be the test set one by one.

![/kaggle/input/cross-validation-img12/cross-validation.png](https://static.javatpoint.com/tutorial/machine-learning/images/cross-validation.png)

In [None]:
from sklearn.model_selection import cross_val_score
np.random.seed(42)
Score=cross_val_score(model,x_train,y_train,cv=5)

score_mean=-Score.mean()

print("%0.2f accuracy with a standard deviation of %0.2f " %(score_mean, Score.std()))

By applying
this method to our model, we get these result of accuracy 70% of with a standard deviation of 0.94.Consequently, we can say that this method is considered the most proficient way to estimate our performance of machine
learning because it ensures that every observation has the opportunity to be clear in training and
testing the mode.

# Conclusion
To recapitulate, Machine learning is a powerful method to explain large dataset and create models
based on statistics. However, the data work up and classification is a very important step that can affect the final model. The classification and regression tree (CART) and cross-validation are the most
prominent parts for evaluating the performance of our model, especially if we need to alleviate
overfitting issue.