# Covid19 Outbreak Predicion


# 1. Introduction

In this project we aim towards predicting the daily number of confirmed COVID19 cases in various locations across the world, as well as the number of resulting fatalities, for future dates. In this notebook, We will use COVID19 Global Forecasting dataset by John Hopkins CSSE, including the train, test and submission csv files. First, We will perform data analysis to identify the factors that impact the transmission rate of COVID-19. Afterwards, we will analyze the the effect of COVID-19 accross the world1. Afterwards, we will use XGBoost and Random Forest regressor as ensemble learning models as well as to predict the daily number of confirmed COVID19 cases as well as the number of resulting fatalities in various locations across the world.

![](https://hbr.org/resources/images/article_assets/2020/03/Mar20_01_Wikimedia3.jpg)

## 2. Installing Required Package 

In [None]:
!pip install dataprep
!pip install plotly

## 3. Importing Libraries and Packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#visualiation data
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 
import matplotlib
import plotly.graph_objects as go
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

#default theme
sns.set(context='notebook', style='darkgrid', palette='Spectral', font='sans-serif', font_scale=1, rc=None)
matplotlib.rcParams['figure.figsize'] =[8,8]
matplotlib.rcParams.update({'font.size': 15})
matplotlib.rcParams['font.family'] = 'sans-serif'

# dataprep library
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import create_report

In [None]:
#machine learning Library
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn import ensemble
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics

## 4. Loading Data


In [None]:
train = pd.read_csv('../input/covid19-global-forecasting-week-5/train.csv')
test = pd.read_csv('../input/covid19-global-forecasting-week-5/test.csv')
sample = pd.read_csv('../input/covid19-global-forecasting-week-5/submission.csv')
sub = pd.read_csv('../input/covid19-global-forecasting-week-5/submission.csv')

In [None]:
train.head()

In [None]:
test.head()

## 5. Data Analysis

In [None]:
train.shape

In [None]:
train.info()

In [None]:
sample.head()

In [None]:
train.dtypes.value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True)
plt.title('data type');

1. our data had (969640 Rows, 9 Columns)
2. like we see :
    * more then 55% our data is **object** type 
    * 33% is integer 
    * 11% float

In [None]:
train.describe(include='all')

what we can see from the describtion :
* most of Country_Region case are in US with 895440 
* most of Province_State case are in Texas with 71400 
* most of County cases are in Washington with 8680 
* the most case are at the date of 2020-05-20 with 6926 
* the moste Target are  Fatalities with 484820 

## 6. Missing Data

### 6.1 Training Data

In [None]:
missing = train.isnull().sum()
missing_pourcent = train.isnull().sum()/train.shape[0]*100

dic = {
    'mising':missing,
    'missing_pourcent %':missing_pourcent
}
frame=pd.DataFrame(dic)
frame

### 6.2 Test Data

In [None]:
missing = test.isnull().sum()
missing_pourcent = test.isnull().sum()/train.shape[0]*100

dic = {
    'mising':missing,
    'missing_pourcent %':missing_pourcent
}
frame=pd.DataFrame(dic)
frame

## 7. Data Visualization

### Pie chart for confirmed cases and fatalities

In [None]:
fig = px.pie(train, values='TargetValue', names='Target')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

### Pie chart for countries and TargetValue

In [None]:
fig = px.pie(train, values='TargetValue', names='Country_Region')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

### Bar plot for County

In [None]:
plt.figure(figsize=(30,9))
county_plot=train.County.value_counts().head(100)
sns.barplot(county_plot.index,county_plot)
plt.xticks(rotation=90)
plt.title('County count')

### Bar plot for Provinance State

In [None]:
plt.figure(figsize=(30,9))
Province_State_plot=train.Province_State.value_counts().head(100)
sns.barplot(Province_State_plot.index,Province_State_plot)
plt.xticks(rotation=90)
plt.title('Province State count');

### Bar Plot for County Region

In [None]:
plt.figure(figsize=(30,9))
Country_Region_plot=train.Country_Region.value_counts().head(30)
sns.barplot(Country_Region_plot.index,Country_Region_plot)
plt.xticks(rotation=90)
plt.title('Country Region count');

### Countries Share of Worldwide Confirmed Cases

In [None]:
confirmed = train[train['Target']=='ConfirmedCases']
fig = px.treemap(confirmed, path=['Country_Region'], values='TargetValue',width=900, height=600)
fig.update_traces(textposition='middle center', textfont_size=15)
fig.update_layout(
    title={
        'text': 'Total Share of Worldwide COVID19 Confirmed Cases',
        'y':0.92,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

### Countries Share of Worldwide Fatalities

In [None]:
dead = train[train['Target']=='Fatalities']
fig = px.treemap(dead, path=['Country_Region'], values='TargetValue',width=900,height=600)
fig.update_traces(textposition='middle center', textfont_size=15)
fig.update_layout(
    title={
        'text': 'Total Share of Worldwide COVID19 Fatalities',
        'y':0.92,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

### Countries Share of Worldwide Active Cases

In [None]:
fig = px.treemap(train, path=['Country_Region'], values='TargetValue',
                  color='Population', hover_data=['Country_Region'],
                  color_continuous_scale='matter', title='Current share of Worldwide COVID19 Active Cases')
fig.show()

### Growth rate in top 10 Affected Countries

In [None]:
last_date = train.Date.max()
df_countries = train[train['Date']==last_date]
df_countries = df_countries.groupby('Country_Region', as_index=False)['TargetValue'].sum()
df_countries = df_countries.nlargest(5,'TargetValue')
df_trend = train.groupby(['Date','Country_Region'], as_index=False)['TargetValue'].sum()
df_trend = df_trend.merge(df_countries, on='Country_Region')
df_trend.rename(columns={'Country_Region':'Country', 'TargetValue_x':'Cases'}, inplace=True)


In [None]:
px.line(df_trend, x='Date', y='Cases', color='Country', title='COVID19 Total Cases growth for top 5 worst affected countries')

## 8.  Data Preprocessing¶

In [None]:
train = train.drop(['County','Province_State','Country_Region','Target'],axis=1)
test = test.drop(['County','Province_State','Country_Region','Target'],axis=1)
train

In [None]:
train.isnull().sum()

date can be used to create new features.

In [None]:
def create_features(df):
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['dayofweek'] = df['Date'].dt.dayofweek
    df['dayofyear'] = df['Date'].dt.dayofyear
    df['quarter'] = df['Date'].dt.quarter
    df['weekofyear'] = df['Date'].dt.weekofyear
    return df

In [None]:
#function to split training and devlopment (validation) data
def train_dev_split(df, days):
    #Last days data as dev set
    date = df['Date'].max() - dt.timedelta(days=days)
    return df[df['Date'] <= date], df[df['Date'] > date]


In [None]:
test_date_min = test['Date'].min()
test_date_max = test['Date'].max()


In [None]:
def to_integer(dt_time):
    return 10000*dt_time.year + 100*dt_time.month + dt_time.day


In [None]:
train['Date']=pd.to_datetime(train['Date'])
test['Date']=pd.to_datetime(test['Date'])

In [None]:
test['Date']=test['Date'].dt.strftime("%Y%m%d")
train['Date']=train['Date'].dt.strftime("%Y%m%d").astype(int)


## 9. Splitting The Data

In [None]:
predictors = train.drop(['TargetValue', 'Id'], axis=1)
target = train["TargetValue"]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.22, random_state = 0)

## 10. RandomForestRegressor

### Model 1 (10 estimator) ->

In [None]:
model1 = RandomForestRegressor(n_jobs=-1)
estimators = 10
model1.set_params(n_estimators=estimators)

scores = []

pipeline1 = Pipeline([('scaler2' , StandardScaler()),
                        ('RandomForestRegressor: ', model1)])
pipeline1.fit(X_train , y_train)
prediction = pipeline1.predict(X_test)

scores.append(pipeline1.score(X_test, y_test))

In [None]:
plt.figure(figsize=(8,6))
plt.plot(y_test,y_test,color='deeppink')
plt.scatter(y_test,prediction,color='dodgerblue')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Target Value',fontsize=15)
plt.title('Random Forest Regressor (R2 Score= 0.95)',fontsize=14)
plt.show()

In [None]:
print('RMSE of model1 =', np.sqrt(metrics.mean_squared_error(y_test,prediction)))
print('R2 Score of model1 = ',metrics.r2_score(y_test,prediction))

### Model2 (100 estimator) ->

In [None]:
model2 = RandomForestRegressor(n_jobs=-1)
estimators = 100
model2.set_params(n_estimators=estimators)

pipeline2 = Pipeline([('scaler2' , StandardScaler()),
                        ('RandomForestRegressor: ', model2)])
pipeline2.fit(X_train , y_train)
prediction = pipeline2.predict(X_test)


In [None]:
plt.figure(figsize=(8,6))
plt.plot(y_test,y_test,color='deeppink')
plt.scatter(y_test,prediction,color='dodgerblue')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Target Value',fontsize=15)
plt.title('Random Forest Regressor (R2 Score= 0.95)',fontsize=14)
plt.show()

In [None]:
print('RMSE of model2 =', np.sqrt(metrics.mean_squared_error(y_test,prediction)))
print('R2 Score of model2 = ',metrics.r2_score(y_test,prediction))

## 11. XGBoost Regressor

In [None]:
import xgboost as xgb

In [None]:
xgbr= xgb.XGBRegressor(n_estimators=300, learning_rate=0.01, gamma=0, subsample=.7,
                       colsample_bytree=.7, max_depth=10,
                       min_child_weight=0, 
                       objective='reg:squarederror', nthread=-1, scale_pos_weight=1,
                       seed=27, reg_alpha=0.00006, n_jobs=-1)

In [None]:
xgbr.fit(X_train,y_train)

In [None]:
prediction_xgbr=xgbr.predict(X_test)

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(x=y_test, y=prediction_xgbr, color='dodgerblue')
plt.plot(y_test,y_test, color='deeppink')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Target Value',fontsize=15)
plt.title('XGBoost Regressor (R2 Score= 0.89)',fontsize=14)
plt.show()

In [None]:
print('RMSE_XGBoost Regression=', np.sqrt(metrics.mean_squared_error(y_test,prediction_xgbr)))
print('R2 Score_XGBoost Regression=',metrics.r2_score(y_test,prediction_xgbr))

## 12. Result

We found out that model1 works best for this purpose. So we are going to compute final result on test data using this model. 

In [None]:
X_test.head()

In [None]:
test.head()

In [None]:
# drop the ForecastId fro test data
test.drop(['ForecastId'],axis=1,inplace=True)
test.index.name = 'Id'
test

In [None]:
predictions = pipeline1.predict(test)

pred_list = [int(x) for x in predictions]

output = pd.DataFrame({'Id': test.index, 'TargetValue': pred_list})
print(output)

# The End