## Coronavirus disease (COVID-19) Pandemic Forecasting using Random Forest

This is a very simple starter submission kernel using a random forest. Feature engineering and tuning will help performance.

Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/covid19-global-forecasting-week-1/train.csv
/kaggle/input/covid19-global-forecasting-week-1/submission.csv
/kaggle/input/covid19-global-forecasting-week-1/test.csv


## Import Data

In [2]:
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/train.csv")
test = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/test.csv")
submission = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-1/submission.csv")
train.head()


Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,33.0,65.0,2020-01-22,0.0,0.0
1,2,,Afghanistan,33.0,65.0,2020-01-23,0.0,0.0
2,3,,Afghanistan,33.0,65.0,2020-01-24,0.0,0.0
3,4,,Afghanistan,33.0,65.0,2020-01-25,0.0,0.0
4,5,,Afghanistan,33.0,65.0,2020-01-26,0.0,0.0


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17892 entries, 0 to 17891
Data columns (total 8 columns):
Id                17892 non-null int64
Province/State    8190 non-null object
Country/Region    17892 non-null object
Lat               17892 non-null float64
Long              17892 non-null float64
Date              17892 non-null object
ConfirmedCases    17892 non-null float64
Fatalities        17892 non-null float64
dtypes: float64(4), int64(1), object(3)
memory usage: 1.1+ MB


In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12212 entries, 0 to 12211
Data columns (total 6 columns):
ForecastId        12212 non-null int64
Province/State    5590 non-null object
Country/Region    12212 non-null object
Lat               12212 non-null float64
Long              12212 non-null float64
Date              12212 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 572.6+ KB


In [5]:
train.shape #No. of Rows and Columns in Training dataset

(17892, 8)

In [6]:
test.shape #No. of Rows and Columns in Testing dataset

(12212, 6)

In [7]:
train.describe()

Unnamed: 0,Id,Lat,Long,ConfirmedCases,Fatalities
count,17892.0,17892.0,17892.0,17892.0,17892.0
mean,13191.5,26.287693,4.766191,325.207523,11.974737
std,7624.675152,22.935092,79.923261,3538.599684,174.346267
min,1.0,-41.4545,-157.4983,0.0,0.0
25%,6596.25,13.145425,-71.516375,0.0,0.0
50%,13191.5,32.98555,9.775,0.0,0.0
75%,19786.75,42.501575,64.688975,10.0,0.0
max,26382.0,71.7069,174.886,69176.0,6820.0


**Basic Information**

In [8]:
print(f'Total reported cases are {len(train)}.')
print(f'Total confirmed cases are {int(train["ConfirmedCases"].sum())}.')
print(f'Total fatality cases are {int(train["Fatalities"].sum())}.')
print(f'Total countries are {len(train["Country/Region"].unique())}.')

Total reported cases are 17892.
Total confirmed cases are 5818613.
Total fatality cases are 214252.
Total countries are 163.


# Data Cleaning

In [9]:
# Format date
train["Date"] = train["Date"].apply(lambda x: x.replace("-",""))
train["Date"]  = train["Date"].astype(int)
train.head()

Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,33.0,65.0,20200122,0.0,0.0
1,2,,Afghanistan,33.0,65.0,20200123,0.0,0.0
2,3,,Afghanistan,33.0,65.0,20200124,0.0,0.0
3,4,,Afghanistan,33.0,65.0,20200125,0.0,0.0
4,5,,Afghanistan,33.0,65.0,20200126,0.0,0.0


### Drop NaNs

In [10]:
# drop nan's
train = train.drop(['Province/State'],axis=1)
train = train.dropna()
train.isnull().sum()

Id                0
Country/Region    0
Lat               0
Long              0
Date              0
ConfirmedCases    0
Fatalities        0
dtype: int64

In [11]:
# Do same to Test data
test["Date"] = test["Date"].apply(lambda x: x.replace("-",""))
test["Date"]  = test["Date"].astype(int)
# deal with nan's for lat and lon
#test = test.dropna()
test.isnull().sum()



ForecastId           0
Province/State    6622
Country/Region       0
Lat                  0
Long                 0
Date                 0
dtype: int64

### Prepare Training

In [12]:
x = train[['Lat', 'Long', 'Date']]
y1 = train[['ConfirmedCases']]
y2 = train[['Fatalities']]
x_test = test[['Lat', 'Long', 'Date']]

In [13]:
from sklearn.ensemble import RandomForestClassifier
Tree_model = RandomForestClassifier(max_depth=200, random_state=0)

### Train Confirmed Cases Tree

In [14]:
##
Tree_model.fit(x,y1)
pred1 = Tree_model.predict(x_test)
pred1 = pd.DataFrame(pred1)
pred1.columns = ["ConfirmedCases_prediction"]

  


In [15]:
pred1.head()

Unnamed: 0,ConfirmedCases_prediction
0,7.0
1,7.0
2,11.0
3,21.0
4,21.0


### Train Deaths Tree

In [16]:


##
Tree_model.fit(x,y2)
pred2 = Tree_model.predict(x_test)
pred2 = pd.DataFrame(pred2)
pred2.columns = ["Death_prediction"]



  


### Prepare for Submission

In [17]:

Sub = pd.read_csv("../input/covid19-global-forecasting-week-1/submission.csv")
sub_new = Sub[["ForecastId"]]
sub_new

Unnamed: 0,ForecastId
0,1
1,2
2,3
3,4
4,5
...,...
12207,12208
12208,12209
12209,12210
12210,12211


In [18]:
# submit

submit = pd.concat([pred1,pred2,sub_new],axis=1)
submit.head()


Unnamed: 0,ConfirmedCases_prediction,Death_prediction,ForecastId
0,7.0,0.0,1
1,7.0,0.0,2
2,11.0,0.0,3
3,21.0,0.0,4
4,21.0,0.0,5


In [19]:
# Clean
submit.columns = ['ConfirmedCases', 'Fatalities', 'ForecastId']
submit = submit[['ForecastId','ConfirmedCases', 'Fatalities']]

submit["ConfirmedCases"] = submit["ConfirmedCases"].astype(int)
submit["Fatalities"] = submit["Fatalities"].astype(int)

In [20]:

submit.describe()


Unnamed: 0,ForecastId,ConfirmedCases,Fatalities
count,12212.0,12212.0,12212.0
mean,6106.5,1208.889125,53.222486
std,3525.445078,6234.287452,417.608734
min,1.0,0.0,0.0
25%,3053.75,6.0,0.0
50%,6106.5,81.0,0.0
75%,9159.25,367.0,3.0
max,12212.0,67800.0,6077.0


In [21]:
#Preparing the Submission file.

submit.to_csv('submission.csv', index=False)