### Task 1 - Data Collection
Here you will obtain the required data for the analysis. As described in the project instructions, you will perform a web scrap to obtain data from the NCDC website, import data from the John Hopkins repository, and import the provided external data.


### A - NCDC Website scrap
Website - https://covid19.ncdc.gov.ng/

In [17]:
# Write Your Code Below
# Import all libraries in this cell
import os
import glob
from IPython.display import display
#End of my imports

import requests
import numpy as np
import urllib.request
import pandas as pd
import csv
from bs4 import BeautifulSoup
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')  
import warnings
warnings.filterwarnings('ignore')


In [18]:
# Save the data to a DataFrame object.
url = 'https://covid19.ncdc.gov.ng/'
tables = pd.read_html(url)
dataframes ={}

ncdc = tables[0]
dataframes['ncdc'] = ncdc


### B - John Hopkins Data Repository
Here you will obtain data from the John Hopkins repository. Your task here involves saving the data from the GitHub repo link to DataFrame for further analysis. Find the links below. 
* Global Daily Confirmed Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
* Global Daily Recovered Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv)
* Global Daily Death Cases - Click [Here](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv)

In [19]:
#[Write Your Code Here]
gcc =pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
dataframes['gcc']=gcc

In [20]:
grc = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
dataframes['grc']=grc

In [21]:
gdc = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
dataframes['gdc'] = gdc

### C - External Data 
* Save the external data to a DataFrame
* External Data includes but not limited to: `covid_external.csv`, `Budget data.csv`, `RealGDP.csv`

In [22]:
#[Write Your Code Here]
externals = glob.glob('*.csv')
for file in externals:
    dataframes[file[0:-4].lower().replace(" ", "_")] = pd.read_csv(file)
    
dataframes.keys()


dict_keys(['ncdc', 'gcc', 'grc', 'gdc', 'budget_data', 'covidnig', 'covid_external', 'realgdp'])

### Task 2 - View the data
Obtain basic information about the data using the `head()` and `info()` method.

In [23]:
#[Write Your Code Here]
for dataframe in dataframes:
    display(dataframes[dataframe].head())

Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
0,Lagos,73906,2643,70618,645
1,FCT,20684,491,20017,176
2,Rivers,10765,1004,9624,137
3,Kaduna,9280,64,9150,66
4,Plateau,9214,76,9077,61


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,152960,152960,152960,153148,153220,153260,153306,153375,153375,153375
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,143174,144079,144847,145333,146387,147369,148222,149117,150101,150997
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,194186,194671,195162,195574,196080,196527,196915,197308,197659,198004
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,15025,15025,15025,15032,15033,15046,15052,15055,15055,15055
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,46929,47079,47168,47331,47544,47781,48004,48261,48475,48656


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,7101,7101,7101,7116,7118,7123,7127,7127,7127,7127
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,2487,2490,2492,2495,2498,2501,2505,2508,2512,2515
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,5148,5179,5209,5240,5269,5302,5339,5373,5399,5420
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,130,130,130,130,130,130,130,130,130,130
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1186,1192,1201,1210,1217,1227,1235,1248,1258,1270


Unnamed: 0,states,Initial_budget (Bn),Revised_budget (Bn)
0,Abia,136.6,102.7
1,Adamawa,183.3,139.31
2,Akwa-Ibom,597.73,366.0
3,Anambra,137.1,112.8
4,Bauchi,167.2,128.0


Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
0,Lagos,26708,2435,24037,236
1,FCT,9627,2840,6694,93
2,Kaduna,4504,579,3877,48
3,Plateau,4262,280,3948,34
4,Oyo,3788,368,3374,46


Unnamed: 0,states,region,Population,Overall CCVI Index,Age,Epidemiological,Fragility,Health System,Population Density,Socio-Economic,Transport Availability,Acute IHR
0,FCT,North Central,4865000,0.3,0.0,0.9,0.4,0.6,0.9,0.6,0.2,0.79
1,Plateau,North Central,4766000,0.4,0.5,0.4,0.8,0.3,0.3,0.5,0.3,0.93
2,Kwara,North Central,3524000,0.3,0.4,0.3,0.2,0.4,0.2,0.6,0.7,0.93
3,Nassarawa,North Central,2783000,0.1,0.3,0.5,0.9,0.0,0.1,0.6,0.5,0.85
4,Niger,North Central,6260000,0.6,0.0,0.6,0.3,0.7,0.1,0.8,0.8,0.84


Unnamed: 0,Year,Q1,Q2,Q3,Q4
0,2014,15438679.5,16084622.31,17479127.58,18150356.45
1,2015,16050601.38,16463341.91,17976234.59,18533752.07
2,2016,15943714.54,16218542.41,17555441.69,18213537.29
3,2017,15797965.83,16334719.27,17760228.17,18598067.07
4,2018,16096654.19,16580508.07,18081342.1,19041437.59


In [24]:
for dataframe in dataframes:
    print(dataframe.upper())
    display(dataframes[dataframe].info())

NCDC
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   States Affected               37 non-null     object
 1   No. of Cases (Lab Confirmed)  37 non-null     int64 
 2   No. of Cases (on admission)   37 non-null     int64 
 3   No. Discharged                37 non-null     int64 
 4   No. of Deaths                 37 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.6+ KB


None

GCC
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Columns: 597 entries, Province/State to 9/5/21
dtypes: float64(2), int64(593), object(2)
memory usage: 1.3+ MB


None

GRC
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Columns: 597 entries, Province/State to 9/5/21
dtypes: float64(2), int64(593), object(2)
memory usage: 1.2+ MB


None

GDC
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Columns: 597 entries, Province/State to 9/5/21
dtypes: float64(2), int64(593), object(2)
memory usage: 1.3+ MB


None

BUDGET_DATA
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   states               37 non-null     object 
 1   Initial_budget (Bn)  37 non-null     float64
 2   Revised_budget (Bn)  37 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1016.0+ bytes


None

COVIDNIG
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   States Affected               37 non-null     object
 1   No. of Cases (Lab Confirmed)  37 non-null     object
 2   No. of Cases (on admission)   37 non-null     object
 3   No. Discharged                37 non-null     object
 4   No. of Deaths                 37 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 1.6+ KB


None

COVID_EXTERNAL
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   states                   37 non-null     object 
 1   region                   37 non-null     object 
 2   Population               37 non-null     int64  
 3   Overall CCVI Index       37 non-null     float64
 4   Age                      37 non-null     float64
 5   Epidemiological          37 non-null     float64
 6   Fragility                37 non-null     float64
 7   Health System            37 non-null     float64
 8   Population Density       37 non-null     float64
 9   Socio-Economic           37 non-null     float64
 10   Transport Availability  37 non-null     float64
 11  Acute IHR                37 non-null     float64
dtypes: float64(9), int64(1), object(2)
memory usage: 3.6+ KB


None

REALGDP
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    7 non-null      int64  
 1   Q1      7 non-null      float64
 2   Q2      7 non-null      float64
 3   Q3      7 non-null      float64
 4   Q4      7 non-null      float64
dtypes: float64(4), int64(1)
memory usage: 408.0 bytes


None

### Task 3 - Data Cleaning and Preparation
From the information obtained above, you will need to fix the data format. 
<br>
Examples: 
* Convert to appropriate data type.
* Rename the columns of the scraped data.
* Remove comma(,) in numerical data
* Extract daily data for Nigeria from the Global daily cases data

TODO A - Clean the scraped data

In [25]:
#[Write Your Code Here]
ncdc = dataframes['ncdc']
ncdc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   States Affected               37 non-null     object
 1   No. of Cases (Lab Confirmed)  37 non-null     int64 
 2   No. of Cases (on admission)   37 non-null     int64 
 3   No. Discharged                37 non-null     int64 
 4   No. of Deaths                 37 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.6+ KB


In [26]:
ncdc

Unnamed: 0,States Affected,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
0,Lagos,73906,2643,70618,645
1,FCT,20684,491,20017,176
2,Rivers,10765,1004,9624,137
3,Kaduna,9280,64,9150,66
4,Plateau,9214,76,9077,61
5,Oyo,8358,750,7440,168
6,Edo,5671,522,4949,200
7,Ogun,5300,100,5122,78
8,Ondo,4148,272,3800,76
9,Akwa Ibom,4135,738,3355,42


In [27]:
#checking statistics of each column to optimize for space
ncdc.iloc[:,:].describe()

Unnamed: 0,No. of Cases (Lab Confirmed),No. of Cases (on admission),No. Discharged,No. of Deaths
count,37.0,37.0,37.0,37.0
mean,5284.081081,227.837838,4987.27027,68.972973
std,12256.182536,477.149843,11720.396226,108.794017
min,5.0,0.0,3.0,2.0
25%,1103.0,13.0,1011.0,21.0
50%,2108.0,50.0,2057.0,35.0
75%,4135.0,268.0,3800.0,76.0
max,73906.0,2643.0,70618.0,645.0


In [28]:
#We can use a lesser type since the values are not large.
ncdc_reduced = ncdc.astype({'No. of Cases (Lab Confirmed)':'int32',
                            'No. of Cases (on admission)':'int16',
                            'No. Discharged':'int16',
                            'No. of Deaths':'int16'
                           })
ncdc_reduced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   States Affected               37 non-null     object
 1   No. of Cases (Lab Confirmed)  37 non-null     int32 
 2   No. of Cases (on admission)   37 non-null     int16 
 3   No. Discharged                37 non-null     int16 
 4   No. of Deaths                 37 non-null     int16 
dtypes: int16(3), int32(1), object(1)
memory usage: 794.0+ bytes


In [29]:
gcc.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,152960,152960,152960,153148,153220,153260,153306,153375,153375,153375
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,143174,144079,144847,145333,146387,147369,148222,149117,150101,150997
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,194186,194671,195162,195574,196080,196527,196915,197308,197659,198004
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,15025,15025,15025,15032,15033,15046,15052,15055,15055,15055
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,46929,47079,47168,47331,47544,47781,48004,48261,48475,48656


In [40]:
#Extract Nigeria data from global Covid-19 cases
gcc = dataframes['gcc']
grc = dataframes['grc']
gdc = dataframes['gdc']
gcc_ng = gcc.loc[gcc['Country/Region']=='Nigeria']

grc_ng = grc.loc[grc['Country/Region']== 'Nigeria']
gdc_ng = gdc.loc[gdc['Country/Region']=='Nigeria']

In [46]:
gcc_ng

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
202,,Nigeria,9.082,8.6753,0,0,0,0,0,0,...,190333,190983,191345,191805,192431,193013,193644,194088,195052,195511


In [42]:
grc_ng

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
187,,Nigeria,9.082,8.6753,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
gdc_ng

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,8/27/21,8/28/21,8/29/21,8/30/21,8/31/21,9/1/21,9/2/21,9/3/21,9/4/21,9/5/21
202,,Nigeria,9.082,8.6753,0,0,0,0,0,0,...,2308,2361,2454,2455,2469,2480,2488,2495,2522,2552


In [59]:
#clean Budget Data.csv
budget_data = dataframes['budget_data']
budget_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   states               37 non-null     object 
 1   Initial_budget (Bn)  37 non-null     float64
 2   Revised_budget (Bn)  37 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1016.0+ bytes


In [60]:
budget_data.describe()

Unnamed: 0,Initial_budget (Bn),Revised_budget (Bn)
count,37.0,37.0
mean,276.22027,171.092432
std,299.3763,142.974439
min,108.0,62.96
25%,152.92,108.3
50%,183.3,128.8
75%,242.18,174.0
max,1680.0,920.5


In [61]:
budget_data = budget_data.astype({'Initial_budget (Bn)':'float16', 'Revised_budget (Bn)':'float16'})
budget_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   states               37 non-null     object 
 1   Initial_budget (Bn)  37 non-null     float16
 2   Revised_budget (Bn)  37 non-null     float16
dtypes: float16(2), object(1)
memory usage: 572.0+ bytes


In [66]:
budget_data.describe()

Unnamed: 0,Initial_budget (Bn),Revised_budget (Bn)
count,37.0,37.0
mean,276.25,171.125
std,inf,inf
min,108.0,62.96875
25%,152.875,108.3125
50%,183.25,128.75
75%,242.125,174.0
max,1680.0,920.5


TODO B - Get a Pandas DataFrame for Daily Confirmed Cases in Nigeria. Columns are Date and Cases

TODO C - Get a Pandas DataFrame for Daily Recovered Cases in Nigeria. Columns are Date and Cases

TODO D - Get a Pandas DataFrame for Daily Death Cases in Nigeria. Columns are Date and Cases

### Task 4 - Analysis
Here you will perform some analyses on the datasets. You are welcome to communicate findings in charts and summary. 
<br>
We have included a few TODOs to help with your analysis. However, do not let this limit your approach, feel free to include more, and be sure to support your findings with chart and summary 

TODO A - Generate a plot that shows the Top 10 states in terms of Confirmed Covid cases by Laboratory test

TODO B - Generate a plot that shows the Top 10 states in terms of Discharged Covid cases. Hint - Sort the values

TODO D - Plot the top 10 Death cases

TODO E - Generate a line plot for the total daily confirmed, recovered and death cases in Nigeria

TODO F - 
* Determine the daily infection rate, you can use the Pandas `diff` method to find the derivate of the total cases.
* Generate a line plot for the above

TODO G - 
* Calculate maximum infection rate for a day (Number of new cases)
* Find the date

TODO H - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
Here you will generate a line plot of top 10 confirmed cases and the overall community vulnerability index on the same axis. From the graph, explain your observation.
<br>
Steps
* Combine the two dataset together on a common column(states)
* Create a new dataframe for plotting. This DataFrame will contain top 10 states in terms of confirmed cases i.e sort by confirmed cases. ** Hint: Check out Pandas [nlargest](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html) function. This [tutorial](https://cmdlinetips.com/2019/03/how-to-select-top-n-rows-with-the-largest-values-in-a-columns-in-pandas/) can help out ** 
* Plot both variable on the same axis. Check out this [tutorial](http://kitchingroup.cheme.cmu.edu/blog/2013/09/13/Plotting-two-datasets-with-very-different-scales/)

TODO I - Determine the relationship between the external dataset and the NCDC COVID-19 dataset. 
* Here you will generate a regression plot between two variables to visualize the linear relationships - Confirmed Cases and Population Density.
Hint: Check out Seaborn [Regression Plot](https://seaborn.pydata.org/generated/seaborn.regplot.html).
* Provide a summary of your observation

TODO J - 
* Provide more analyses by extending TODO G & H. Meaning, determine relationships between more features.
* Provide a detailed summary of your findings. 
* Note that you can have as many as possible.

### TODO L - 
Determine the effect of the Pandemic on the economy. To do this, you will compare the Real GDP value Pre-COVID-19 with Real GDP in 2020 (COVID-19 Period, especially Q2 2020)
<br>
Steps
* From the Real GDP Data, generate a `barplot` using the GDP values for each year & quarters. For example: On x-axis you will have year 2017 and the bars will be values of each quarters(Q1-Q4). You expected to have subplots of each quarters on one graph.
<br>
Hint: Use [Pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to create your plot DataFrame 
* Set your quarter legend to lower left.
* Using `axhline`, draw a horizontal line through the graph at the value of Q2 2020.
* Write out your observation

### Note: Do not limit your analysis to the provided TODOs. Perform more analyses e.g 
* Check for more external dataset
* Ask more questions & find the right answers by exploring the data