### Exploratory Data analysis for the Corona Virus

Data used for this project is the covid_19_clean_complete.csv

Special thanks to [Devakumar kp](https://www.kaggle.com/imdevskp/corona-virus-report) for his awesome work on providing this dataset on kaggle

In [1]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"
from plotly.subplots import make_subplots

from pathlib import Path
data_dir = Path('../data/raw')
import os
os.listdir(data_dir)

['covid_19_clean_complete.csv', 'submission.csv', 'test.csv', 'train.csv']

In [4]:
DIR = '../data/raw/'
data = pd.read_csv(DIR + 'covid_19_clean_complete.csv', parse_dates=['Date'])
data.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Thailand,15.0,101.0,2020-01-22,2,0,0
1,,Japan,36.0,138.0,2020-01-22,2,0,0
2,,Singapore,1.2833,103.8333,2020-01-22,0,0,0
3,,Nepal,28.1667,84.25,2020-01-22,0,0,0
4,,Malaysia,2.5,112.5,2020-01-22,0,0,0


#### First is some basic preprocessing of the data.

We start by renaming the columns in the data to convenient names, then we calculate the active cases with some little arithmetic using the confirmed cases, deaths and recovered.
After which we replace "Mainland China" with "China" for the sake of avoiding studying them as different entities. Then fill missing values

In [5]:
data.rename(columns={'ObservationDate': 'date',
                     'Province/State': 'state',
                     'Country/Region': 'country',
                     'Last Update': 'last_updated',
                     'Confirmed': 'confirmed',
                     'Deaths': 'deaths',
                     'Recovered': 'recovered'}, inplace=True)

cases = ['confirmed', 'deaths', 'recovered', 'active']
data['active'] = data['confirmed'] - (data['deaths'] + data['recovered'])

data['country'] = data['country'].replace('Mainland China', 'China')

data[['state']] = data[['state']].fillna('')
data[cases] = data[cases].fillna(0)
data.rename(columns={'Date': 'date'}, inplace=True)

In [6]:
print("External Data")
print(f"Earliest Entry: { data['date'].min() }")
print(f"Last Entry: { data['date'].max() }")
print(f"Total Days: { data['date'].max() - data['date'].min() }")

External Data
Earliest Entry: 2020-01-22 00:00:00
Last Entry: 2020-03-20 00:00:00
Total Days: 58 days 00:00:00


In [7]:
figures_dir = '../reports/figures/'

#### First Analysis: Confirmed cases across the globe

In [11]:
grouped = data.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()
fig = px.line(grouped, x="date", y="confirmed", title="Worldwide Confirmed Cases Over Time")
fig.show()
# pio.write_image(fig, figures_dir + 'Worldwide Confirmed Cases Over Time.png')

fig = px.line(grouped, x="date", y="confirmed", title="Worldwide confirmed cases (logarithmic scale) over time", log_y=True)
fig.show()
# pio.write_image(fig, figures_dir + 'Worldwide Confirmed Cases (logarithmic scale) Over Time.png')

ValueError: 
The orca executable is required to export figures as static images,
but it could not be found on the system path.

Searched for executable 'orca' on the following path:
    C:\Users\Snapnet-User\Anaconda3
    C:\Users\Snapnet-User\Anaconda3\Library\mingw-w64\bin
    C:\Users\Snapnet-User\Anaconda3\Library\usr\bin
    C:\Users\Snapnet-User\Anaconda3\Library\bin
    C:\Users\Snapnet-User\Anaconda3\Scripts
    C:\Users\Snapnet-User\Anaconda3\bin
    C:\Users\Snapnet-User\Anaconda3\condabin
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp
    C:\WINDOWS\system32
    C:\WINDOWS
    C:\WINDOWS\System32\Wbem
    C:\WINDOWS\System32\WindowsPowerShell\v1.0
    C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn
    C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn
    C:\Program Files\Microsoft SQL Server\140\Tools\Binn
    C:\Program Files\Microsoft SQL Server\140\DTS\Binn
    C:\Program Files (x86)\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn
    C:\Program Files (x86)\Microsoft SQL Server\140\DTS\Binn
    C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio
    C:\Program Files\Microsoft SQL Server\130\Tools\Binn
    C:\Program Files (x86)\Microsoft SQL Server\110\DTS\Binn
    C:\Program Files (x86)\Microsoft SQL Server\120\DTS\Binn
    C:\Program Files (x86)\Microsoft SQL Server\130\DTS\Binn
    C:\WINDOWS\System32\OpenSSH
    C:\Program Files (x86)\Microsoft SQL Server\130\Tools\Binn
    C:\Program Files\Microsoft SQL Server\130\DTS\Binn
    C:\Program Files\nodejs
    C:\Program Files\Git\cmd
    C:\Users\Snapnet-User\Documents\SNAPNET\codes\datarevenue
    .
    C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common
    C:\Program Files\PuTTY
    C:\Program Files\Docker\Docker\resources\bin
    C:\ProgramData\DockerDesktop\version-bin
    C:\Users\Snapnet-User\Anaconda3
    C:\Users\Snapnet-User\Anaconda3\Library\mingw-w64\bin
    C:\Users\Snapnet-User\Anaconda3\Library\usr\bin
    C:\Users\Snapnet-User\Anaconda3\Library\bin
    C:\Users\Snapnet-User\Anaconda3\Scripts
    C:\Users\Snapnet-User\AppData\Local\Microsoft\WindowsApps
    C:\Users\Snapnet-User\AppData\Local\Programs\Microsoft VS Code\bin
    "C:\Users\Snapnet-User\AppData\Local\Programs\Python\Python37\Scripts
    C:\Users\Snapnet-User\Documents\SNAPNET\codes\datarevenue"
    C:\Users\Snapnet-User\AppData\Roaming\npm
    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
    .
    C:\Users\Snapnet-User\Anaconda3\lib\site-packages\numpy\.libs

If you haven't installed orca yet, you can do so using conda as follows:

    $ conda install -c plotly plotly-orca

Alternatively, see other installation methods in the orca project README at
https://github.com/plotly/orca

After installation is complete, no further configuration should be needed.

If you have installed orca, then for some reason plotly.py was unable to
locate it. In this case, set the `plotly.io.orca.config.executable`
property to the full path of your orca executable. For example:

    >>> plotly.io.orca.config.executable = '/path/to/orca'

After updating this executable property, try the export operation again.
If it is successful then you may want to save this configuration so that it
will be applied automatically in future sessions. You can do this as follows:

    >>> plotly.io.orca.config.save()

If you're still having trouble, feel free to ask for help on the forums at
https://community.plot.ly/c/api/python


We can see from the above, that the virus is spreading at an alarming rate. This is not good for mankind. At this rate, the virus will hit 500k confirmed cases in the next 1 week. This is very serious

Now let us see trends for specific countries

In [10]:
grouped_nigeria = data[data['country'] == "Nigeria"].reset_index()
grouped_nigeria_date = grouped_nigeria.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()

grouped_china = data[data['country'] == "China"].reset_index()
grouped_china_date = grouped_china.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()

grouped_italy = data[data['country'] == "Italy"].reset_index()
grouped_italy_date = grouped_italy.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()

grouped_us = data[data['country'] == "US"].reset_index()
grouped_us_date = grouped_us.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()

# group the rest countries together
grouped_rest = data[~data['country'].isin(['China', 'Italy', 'US', 'Nigeria'])].reset_index()
grouped_rest_date = grouped_rest.groupby('date')['date', 'confirmed', 'deaths'].sum().reset_index()

In [13]:
plot_titles = ['Nigeria', 'China', 'Italy', 'USA', 'Rest of the World']

fig = px.line(grouped_nigeria_date, x="date", y="confirmed",
              title=f"Confirmed Cases in { plot_titles[0].upper() } Over Time",
              color_discrete_sequence=['#99FF66'],
              height=500)
fig.show()

fig = px.line(grouped_china_date, x="date", y="confirmed", 
              title=f"Confirmed Cases in {plot_titles[1].upper()} Over Time", 
              color_discrete_sequence=['#F61067'],
              height=500)
fig.show()

fig = px.line(grouped_italy_date, x="date", y="confirmed", 
              title=f"Confirmed Cases in {plot_titles[2].upper()} Over Time", 
              color_discrete_sequence=['#91C4F2'],
              height=500)
fig.show()

fig = px.line(grouped_us_date, x="date", y="confirmed", 
              title=f"Confirmed Cases in {plot_titles[3].upper()} Over Time", 
              color_discrete_sequence=['#6F2DBD'],
              height=500)
fig.show()

fig = px.line(grouped_rest_date, x="date", y="confirmed", 
              title=f"Confirmed Cases in {plot_titles[4].upper()} Over Time", 
              color_discrete_sequence=['#FFDF64'],
              height=500)
fig.show()

1. Looking at the plot for Nigeria, the numbers (12) still seems comforting with respect to the rest of the world, however, a look at the trend from Mar 8 will show that the numbers are growing exponentially. This does not look good at all

2. China seemed to show an exponential growth in their number of confirmed cases in the initial stages, however the trend seems to have been tamed. This looks good.

3. Italy, U.S and the rest of the world trend does not seem to have been tamed unlike China, this does not look good at all

Now let us visualize the countries with confirmed cases

In [None]:
data['state'] = data['state'].fillna('')

data['state'] = data['state'].fillna('')
temp = data[[col for col in data.columns if col != 'state']]

latest = temp[temp['date'] == max(temp['date'])].reset_index()
latest_grouped = latest.groupby('country')['confirmed', 'deaths'].sum().reset_index()