<h1 align=center><font size = 6>INDIA Vs COVID-19</font></h1>
<h2 align=center><font size = 6>Exploratory Data Analysis & Visualization with Python</font></h2>
By : <a href="https://www.blogger.com/profile/01288628031125822619" target="_blank">Neeraj Singh Rawat</a>

Welcome to **2020** … it’s crazy.

Assuming that you’re quarantined right now, you probably have some extra time on your hands.

This is a perfect time to learn a new skill or do something productive.

In fact, it’s a great time to <a href="https://www.dexlabanalytics.com/" target="_blank">Master Data Science in Python</a>.

So.. Lets Start...!!!

# Where India stands currently?
__The Story of COVID-19 in India__

* The COVID-19 pandemic is the defining global health crisis of our time and the greatest global humanitarian challenge the world has faced since World War II. The virus has spread widely, and the number of cases is rising daily as governments work to slow its spread. India has moved quickly, implementing a proactive, nationwide, lockdown, with the goal of flattening the curve and using the time to plan and resource responses adequately.
* COVID19 outbreak has started a bit late in India as compared to other countries. But, it has started to pick up pace. With limited testing and not a well funded healthcare system, India is surely up for a challenge. Still the fight is on after 3 lockdowns (4th in progress, till May 31, 2020) and the virus shows no signs of slowing down.

__A QUICK REVIEW OF THE PROCESS__

Here we’ll analyze some covid19 data with Python. Specifically, we’re going to use Pandas, matplotlib, seaborn, and possibly a little Numpy.

You’ll probably want to be familiar with all of them, but if not, that’s okay. You’ll still be able to run the code, and I’ll link to some other tutorials that explain individual techniques in more depth.

That being said, you’ll be able to play with the code regardless.
But if you’re really serious about this, you’ll want to eventually <a href="https://www.dexlabanalytics.com/" target="_blank">Master Data Science in Python</a>.

## PART 1:  WRANGLING

We’re going to get our data and “wrangle” it into shape. Here’s a quick table of contents, which I will try to demonstrate the:
* Import packages
* Get Raw data
* Rename columns
* Reshape data
* Convert dates
* Rearange columns and sort data
* Set index
* Check data

__IMPORT PACKAGE__

In [2]:
import pandas as pd
import datetime

__GET RAW DATA__

we’re going to get the raw csv data.

This data comes from a github <a href="https://dsreka.blogspot.com" target="_blank">repository for Covid-19 data</a>, originaly created by <a href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" target="_blank">Johns Hopkins</a>.

We’re going to use the “raw” data, which is easier to retrieve.

To get the raw data, we’re going to use the Pandas read_csv function. The read_csv function does exactly what it sounds like …. it reads in csv data.

In [3]:
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
covid_data= pd.read_csv(url)

__INSPECT__


Now, let’s just inspect a few rows.

First, we’ll take a look at the columns:

In [5]:
#covid_data.columns
#OR
covid_data.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,4/23/20,4/24/20,4/25/20,4/26/20,4/27/20,4/28/20,4/29/20,4/30/20,5/1/20,5/2/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,1279,1351,1463,1531,1703,1828,1939,2171,2335,2469
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,663,678,712,726,736,750,766,773,782,789
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,3007,3127,3256,3382,3517,3649,3848,4006,4154,4295
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,723,731,738,738,743,743,743,745,745,747
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,25,25,25,26,27,27,27,27,30,35


Keep in mind that your data might look a little different. I’m retrieving this data on March 19, 2020, so the dataset has records up to 3/18/20. If you run this code on a different date, your data will be more up-to-date.

__RENAME COLUMNS__


Now, we’re going to rename the columns.

To do this, we’re going to use the Pandas rename method. Notice that in order to do this, we need to specify new “key/value pairs” with a Python dictionary. Each “key” in the dictionary is the old column name, and the “value” in the dictionary is the new name.

So, you can read the items in the dictionary as {"old_name":"new_name"}.

In [4]:
covid19_data = covid_data.rename(columns = {'Province/State':'subregion', # subregion=state.3
                                            'Country/Region':'country',
                                            'Lat':'lat',
                                            'Long':'long'
                                           }
                                )

__RESHAPE DATA__

Why?

One thing that you’ll notice about the data is that the dates all exist as separate columns. This is not what we need.

When we use many data analysis or visualization techniques, like groupby, agg, and Seaborn functions, we need the dates to exist as values underneath a single column.

So right now we need to transpose the data so that those dates exist under one single “date” column.

To do this, we need to use the Pandas melt function. Pandas has two functions for transposing data, melt and pivot, but melt is used when we need to transform data from wide form to long form.

So here, we’re going to use the melt function to reshape our data:

In [5]:
covid19_data = (covid19_data.melt(id_vars = ['country','subregion','lat','long'],
                                 var_name = 'date_RAW',
                               value_name = 'confirmed'
                                 )
               )

Notice that when we did this, we created two new variables: date_RAW and confirmed.

The first is a date field (which we’ll need to wrangle further), and the second is the number of confirmed cases.

Now that we’re finished with that step, let’s print some data to take a look:

In [6]:
print(covid19_data)

                     country subregion        lat       long date_RAW  \
0                Afghanistan       NaN  33.000000  65.000000  1/22/20   
1                    Albania       NaN  41.153300  20.168300  1/22/20   
2                    Algeria       NaN  28.033900   1.659600  1/22/20   
3                    Andorra       NaN  42.506300   1.521800  1/22/20   
4                     Angola       NaN -11.202700  17.873900  1/22/20   
...                      ...       ...        ...        ...      ...   
27127         Western Sahara       NaN  24.215500 -12.885800   5/2/20   
27128  Sao Tome and Principe       NaN   0.186360   6.613081   5/2/20   
27129                  Yemen       NaN  15.552727  48.516388   5/2/20   
27130                Comoros       NaN -11.645500  43.333300   5/2/20   
27131             Tajikistan       NaN  38.861034  71.276093   5/2/20   

       confirmed  
0              0  
1              0  
2              0  
3              0  
4              0  
...      

Take a look at the output. We now have two new variables named date_RAW and confirmed.

The confirmed variable is the number of confirmed covid-19 cases, for a particular place, on a particular date.

The date_RAW variable is a string-based date variable, which means that it’s not in the form that we need it to be in.

So next, we’ll convert the dates into proper date/time data.

__CONVERT DATES__

First, let’s just print out some dates.

Here, we’re going to do this with the Pandas filter method to select the date_RAW column.

In [7]:
#Inspect Date
(covid19_data.filter(['date_RAW']))

Unnamed: 0,date_RAW
0,1/22/20
1,1/22/20
2,1/22/20
3,1/22/20
4,1/22/20
...,...
27127,5/2/20
27128,5/2/20
27129,5/2/20
27130,5/2/20


The dates are in a form that with a one-digit month, two-digit day, and a two-digit year.

Let’s try to convert these dates to proper datetime data.

__TEST DATE CONVERSION__

Instead of directly converting the data, I actually want to test this out first.

Here, we’re going to use several Pandas methods in a “chain” to test our date conversion and give us a preview of the results.

Keep in mind, because we’re not storing the output with the equal sign, this will not modify the original DataFrame. That’s good … it will give us the ability to test the operation first.

In [8]:
(covid19_data.assign(date = pd.to_datetime(covid19_data.date_RAW, format='%m/%d/%y'))
                              .filter(['date','date_RAW','confirmed'])
                              .groupby(['date','date_RAW'])
                              .agg('sum')
                              .sort_values('date')
)

Unnamed: 0_level_0,Unnamed: 1_level_0,confirmed
date,date_RAW,Unnamed: 2_level_1
2020-01-22,1/22/20,555
2020-01-23,1/23/20,654
2020-01-24,1/24/20,941
2020-01-25,1/25/20,1434
2020-01-26,1/26/20,2118
...,...,...
2020-04-28,4/28/20,3097190
2020-04-29,4/29/20,3172287
2020-04-30,4/30/20,3256853
2020-05-01,5/1/20,3343777


What happened here?

The code created a new variable in the output called date.

Then we retrieved only a few columns (date, date_RAW, confirmed) using the filter method.

Then we grouped and aggregated the data. The grouping is actually the critical step. What I want to see here is all of the unique combinations of date and date_RAW, side by side. The way I chose to do that is by using groupby, then agg.

So again, why did I do this?

I did it because I want to see the old date (date_RAW) and the new date (date) side by side. I want to compare and “spot check” to make sure that the assign operation using pd.to_datetime worked properly.

When you compare the old date and new date side by side, it looks like they match. That is, it looks like our new datetime field, date, was created properly.

But remember: the code we just ran did NOT change the original dateset yet. It was just a “test”.

Now that we tested this date conversion, we’ll run it properly and save the output so that we have our new date.

__CREATE DATE VARIABLE__

Now, we’ll actually convert our date.

Before we do that though, we’ll create a copy of our date, just to back it up.

This might be unnecessary and will take up extra storage. But I sometimes do this at intermediate points in my scripts, just to create a backup before I make changes.

You can chose to skip the backup if you want.

In [9]:
# BACKUP
covid19_data_backup_BEFOREDATE = covid19_data.copy()

Ok. Now that we backed up our data, we’ll modify covid_data and create our new variable, date.

In [10]:
# CONVERT DATE

covid19_data = covid19_data.assign(date = pd.to_datetime(covid19_data.date_RAW, format='%m/%d/%y'))

At this point, covid19_data contains a date variable that’s formatted as a proper datetime.

__REARANGE COLUMNS AND SORT__

Now, we’re just going to clean up our data a little.

I’m going to rearrange the columns with filter, and sort the data with sort_values.

In [11]:
# SORT & REARANGE DATA

covid19_data = (covid19_data.filter(['country', 'subregion', 'date', 'lat', 'long', 'confirmed'])
               .sort_values(['country','subregion','lat','long','date'])
               )

In [12]:
print(covid_data)

    Province/State         Country/Region        Lat       Long  1/22/20  \
0              NaN            Afghanistan  33.000000  65.000000        0   
1              NaN                Albania  41.153300  20.168300        0   
2              NaN                Algeria  28.033900   1.659600        0   
3              NaN                Andorra  42.506300   1.521800        0   
4              NaN                 Angola -11.202700  17.873900        0   
..             ...                    ...        ...        ...      ...   
261            NaN         Western Sahara  24.215500 -12.885800        0   
262            NaN  Sao Tome and Principe   0.186360   6.613081        0   
263            NaN                  Yemen  15.552727  48.516388        0   
264            NaN                Comoros -11.645500  43.333300        0   
265            NaN             Tajikistan  38.861034  71.276093        0   

     1/23/20  1/24/20  1/25/20  1/26/20  1/27/20  ...  4/23/20  4/24/20  \
0          0

__SET INDEX__

Let’s do one last thing.

Here, we’ll set the index to ‘country‘. This will probably be temporary (we’ll probably change the index in future tutorials), but for the time being, this will give us the ability to retrieve data based on country name.

To set the index for the DataFrame, we’ll use the Pandas set_index method.

In [13]:
# SET INDEX
covid19_data.set_index('country', inplace = True)           #   Notice that when we do this, we’re setting "inplace = True".
                                                            #   This will directly modify our dataset, covid19_data.

__CHECK DATA__

Finally, I’m just going to look at a few things.

I want to get a list of the country names, because we’re currently using country as our index.

To do this, I’m going to chain together several Pandas methods, including reset_index, filter, and drop_duplicates.

Essentially, we’re creating a list of the unique values of country.

In [23]:
# GET COUNTRY NAMES

pd.set_option('display.max_rows', 155)

In [25]:
(covid19_data
    .reset_index()
    .filter(['country'])
    .drop_duplicates()
    .head(n = 200)
)

Unnamed: 0,country
0,Afghanistan
102,Albania
204,Algeria
306,Andorra
408,Angola
...,...
26622,West Bank and Gaza
26724,Western Sahara
26826,Yemen
26928,Zambia


In [26]:
pd.reset_option('display.max_rows')

In [41]:
# PULL DATA FOR INDIA

covid19_data.loc['India']

Unnamed: 0_level_0,subregion,date,lat,long,confirmed
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,,2020-01-22,21.0,78.0,0
India,,2020-01-23,21.0,78.0,0
India,,2020-01-24,21.0,78.0,0
India,,2020-01-25,21.0,78.0,0
India,,2020-01-26,21.0,78.0,0
...,...,...,...,...,...
India,,2020-04-28,21.0,78.0,31324
India,,2020-04-29,21.0,78.0,33062
India,,2020-04-30,21.0,78.0,34863
India,,2020-05-01,21.0,78.0,37257


Here, you can see that we have data for the __INDIA__ at the country level and the subregion level (individual cities). We have multiple dates, and the latitude/longitude for the data. And ultimately, we have the number of confirmed cases.

This gives us a lot that we can potentially do.

__Change to Code Cell, then run__

India1=covid19_data.loc['India']
INDIA=India1.tail(31)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline


INDIA['confirmed'].plot()
sns.countplot(x='confirmed',data=INDIA,hue='date',palette='viridis')

## PART 2: MERGE DATASETS