# COVID-19 - analysis and prediction

## Checking the data
First, it is required to analyse our dataset so that it is sure to not have any problems. To do so, we will print the first entries of the dataset:

In [5]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
data = pd.read_csv('data/covid19.csv', parse_dates=['Date'])
data.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.0,65.0,2020-01-22,0,0,0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0


The first problem we can identify is that some entries don't have a value on the *Country/Region* column.

We can fix this by assigning the value 'NA' to those entries:

In [8]:
# filling missing values 
data[['Province/State']] = data[['Province/State']].fillna('')
data.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.0,65.0,2020-01-22,0,0,0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0


We can now take a look at some summary statistics about the data set:

In [22]:
data.describe()

Unnamed: 0,Lat,Long,Confirmed,Deaths,Recovered
count,24890.0,24890.0,24890.0,24890.0,24890.0
mean,21.433571,22.597991,2336.763198,141.207915,580.246605
std,24.740917,70.570914,22688.167259,1499.521732,5021.307422
min,-51.7963,-135.0,-1.0,-1.0,0.0
25%,7.0,-19.0208,0.0,0.0,0.0
50%,23.65975,20.921188,5.0,0.0,0.0
75%,41.2044,81.0,176.0,2.0,16.0
max,71.7069,178.065,938154.0,53755.0,109800.0


We see that the count is the same for values in the columns *Lat*, *Long*, *Confirmed*, *Deaths* and *Recovered*.

There is, however, something strange with this dataset: The minimum value for *Confirmed* and *Deaths* is -1. Because negative values don't make sense for these two columns, we will use **mean imputation** to replace them with the average number for each column.

In [23]:
average_confirmed_count = data['Confirmed'].mean()
data.loc[(data['Confirmed'] < 0), 'Confirmed'] = average_confirmed_count

average_death_count = data['Deaths'].mean()
data.loc[(data['Deaths'] < 0), 'Deaths'] = average_death_count

data.describe()

Unnamed: 0,Lat,Long,Confirmed,Deaths,Recovered
count,24890.0,24890.0,24890.0,24890.0,24890.0
mean,21.433571,22.597991,2338.359903,141.29933,580.246605
std,24.740917,70.570914,22688.084938,1499.517394,5021.307422
min,-51.7963,-135.0,0.0,0.0,0.0
25%,7.0,-19.0208,0.0,0.0,0.0
50%,23.65975,20.921188,6.0,0.0,0.0
75%,41.2044,81.0,178.0,2.0,16.0
max,71.7069,178.065,938154.0,53755.0,109800.0


Now  those columns make more sense.

Also, to ensure data integrity, we will replace any missing values with 0s.

In [24]:
data[['Confirmed', 'Deaths', 'Recovered']] = data[['Confirmed', 'Deaths', 'Recovered']].fillna(0)

It would be a shame to lose all this tidied up data. Let's save it.

In [25]:
data.to_csv('data/covid19_clean.csv', index=False)

We'll now take a look at the scatterplot matrix now that we've tidied the data:

In [35]:
data_clean = pd.read_csv('data/covid19_clean.csv')
data_clean.fillna(data_clean.mean(), inplace=True)
# sb.pairplot(data.dropna())

## Classification

In [36]:
inputs = data_clean[['Lat', 'Long', 'Confirmed', 'Deaths', 'Recovered']].values

In [37]:
labels = data_clean[['Province/State', 'Country/Region', 'Date']].values