<a href="https://colab.research.google.com/github/ayushirastogi15/covid-19-analysis/blob/master/Covid_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analysis & Data Representation of COVID-19 data
Analyzing confirmed, recovered & death cases from all over the world and then finding out different insights, patterns & visualizations.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv
/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv
/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths_US.csv
/kaggle/input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv
/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv


# Importing required libraries

In [None]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# Loading the data which contains daily basis data

In [None]:
covid_data = pd.read_csv("/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv")
covid_data.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


# Checking non-null rows & columns.
It seems that all the columns except Province/State have no null values. For now, we're ignoring this because we're analyzing on the basis of dates & countries & not on the basis of Province/State. So, it doesn't affect our analysis.

In [None]:
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85638 entries, 0 to 85637
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   SNo              85638 non-null  int64  
 1   ObservationDate  85638 non-null  object 
 2   Province/State   57341 non-null  object 
 3   Country/Region   85638 non-null  object 
 4   Last Update      85638 non-null  object 
 5   Confirmed        85638 non-null  float64
 6   Deaths           85638 non-null  float64
 7   Recovered        85638 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 5.2+ MB


# Analysis on the basis of dates
The starting date of the observation of these cases (confirmed, death & recovered) is 22nd Jan, 2020. And, the last available data we've is upto 8th Aug, 2020.

In [None]:
print("Starting date : ", min(covid_data.ObservationDate.values))
print("Ending date : ", max(covid_data.ObservationDate.values))

Starting date :  01/22/2020
Ending date :  08/12/2020


Now, we're counting the overall data according to the dates. From this analysis, we get to know the answer of various questions like what are the number of confirmed cases on the date - 4th July, 2020? What are the death cases/death count on 27th of May? What're the recovered cases/recovered counts on 23rd of April? 
We're trying to analyze the data on the basis of daily data which could tell us the rate of increasing cases world wide as well as country wide. It shows us how fast the confirmed cases are increasing, how fast/slow the people are recovering etc. 

In [None]:
covid_data.ObservationDate = pd.to_datetime(covid_data.ObservationDate)
tot_rates = covid_data.groupby('ObservationDate').sum()
tot_rates.head()

Unnamed: 0_level_0,SNo,Confirmed,Deaths,Recovered
ObservationDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-22,741,555.0,17.0,28.0
2020-01-23,2829,653.0,18.0,30.0
2020-01-24,4305,941.0,26.0,36.0
2020-01-25,6490,1438.0,42.0,39.0
2020-01-26,9071,2118.0,56.0,52.0


# Plotting of Confirmed, Recovered & Death counts over the period of approx. 6 months

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x = tot_rates.index, y = tot_rates.Confirmed, name = 'Confirmed'))
fig.add_trace(go.Scatter(x = tot_rates.index, y = tot_rates.Recovered, name = 'Recovered'))
fig.add_trace(go.Scatter(x = tot_rates.index, y = tot_rates.Deaths, name = 'Deaths'))
fig.update_layout(title = 'COVID-19 CASES ALL OVER THE WORLD', xaxis_title='Time (Jan 2020 - Aug 2020)',
                   yaxis_title='Count of Cases')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/corona_cases.png

# Analysis on the basis of country
Here, we're analyzing the data according to the countries. From this analysis, we can tell which is the worst corona-virus hit country? What're the top 10 or top 20 countries where there are large number of confirmed covid cases or in other words, where there's a high spread of corona-virus. Which is the country where there's a high infection rate, high death rate, or high recovery rate?

In [None]:
# Since the data is cumulative so, we're taking only the last date data.

country = covid_data[covid_data.ObservationDate == max(covid_data.ObservationDate)]
cntry_case = country.groupby('Country/Region').sum()
cntry_case

Unnamed: 0_level_0,SNo,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,84897,37345.0,1354.0,26694.0
Albania,84898,6817.0,208.0,3552.0
Algeria,84899,36699.0,1333.0,25627.0
Andorra,84900,977.0,53.0,855.0
Angola,84901,1762.0,80.0,577.0
...,...,...,...,...
West Bank and Gaza,85060,15184.0,105.0,8369.0
Western Sahara,85061,10.0,1.0,8.0
Yemen,85062,1841.0,528.0,937.0
Zambia,85063,8501.0,246.0,7233.0


We're trying to find those countries where there is a large number of confirmed cases, large number of death cases and large number of recovered cases. So, firstly, we're sorting the data wrt 'Confirmed' column into 'Confirmed_df' dataframe. Then, sorting data wrt 'Recovered' column into 'Recovered_df' dataframe & lastly, wrt 'Deaths' column into 'Death_df' dataframe.
Also, there are total of 190 countries where there is a spread of corona virus. 
And, finding recovery rate & death rate as well.

In [None]:
cntry_case['recover_rate'] = (cntry_case.Recovered/cntry_case.Confirmed)*100
cntry_case['death_rate'] = (cntry_case.Deaths/cntry_case.Confirmed)*100

cntry_case.head()

Unnamed: 0_level_0,SNo,Confirmed,Deaths,Recovered,recover_rate,death_rate
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,84897,37345.0,1354.0,26694.0,71.479448,3.625653
Albania,84898,6817.0,208.0,3552.0,52.105032,3.051196
Algeria,84899,36699.0,1333.0,25627.0,69.830241,3.632252
Andorra,84900,977.0,53.0,855.0,87.512794,5.42477
Angola,84901,1762.0,80.0,577.0,32.746879,4.540295


In [None]:
Confirmed_df = cntry_case.sort_values(by=['Confirmed'], ascending=False)
Recovered_df = cntry_case.sort_values(by=['Recovered'], ascending=False)
Death_df = cntry_case.sort_values(by=['Deaths'], ascending=False)
Recover_rate_df = cntry_case.sort_values(by=['recover_rate'], ascending=False)
Death_rate_df = cntry_case.sort_values(by=['death_rate'], ascending=False)

# Visualization of confirmed cases of top 15 countries
Below is the visualization which shows us the increament of confirmed, recovered and death counts over the time period of 6 months. Some key points to look at :-
1. It tells us about the comparative idea of how all these 3 (confirmed, recovered & death cases) are relative to each other. 
2. By considering and plotting the 'Confirmed_df' dataframe we can conclude that **'USA' is the worst hit country by corona with a count of 50,44,864 where the recovered cases are very low as compared to the confirmed cases with a count of 16,56,864. The death counts can also be seen very low as compared to the confirmed & recovered counts with a count of 1,62,938** which means it has a **recovery rate of 32.84% and death rate of 3.23%**.
3. If we look at the chart, **India is one of the top 5 or one of the 5 worst-hit countries and stands at 3rd position in terms of confirmed cases with a count of 22,15,074. Here, the recovered cases with a count of 15,35,743 is comparatively good wrt USA and the death cases with a count of 44,386 is much less** which means it has a **recovery rate of 69.33% which is more than the double of USA's recovery rate & death rate of 2.004%**. 
4. **China (Mainland China) which was heard to be the centre of this virus, found out that it doesn't even exist in top 15 worst-hit countries.**
5. From the chart, it can be seen that **'Chile', 'Iran', 'Pakistan' and 'Saudi Arabia' country has approximately equal number of confirmed cases and recovered cases** with -
    - **'Chile' having a count of 3,73,056 confirmed and 3,45,826 recovered cases,**
    - **'Iran' with a count of 3,26,712 confirmed and 2,84,371 recovered cases,**
    - **'Pakistan' with a count of 2,84,121 confirmed and 2,60,248 recovered cases,**
    - **'Saudi Arabia' with a count of 2,88,690 confirmed and 2,52,039 recovered cases.**
This can also means that the recovery rate is better than the spread rate. In other words, most of the confirmed cases got recovered.   
6. The country **'UK'** which is at **11th position** seems to **have very low recovering rate comparatively, 0.464%**. It has **3,05,572 confirmed cases and 1,441 recovered cases which is much less** as compared to confirmed cases.

*This analysis has been done upto 8th Aug, 2020.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = Confirmed_df.index[:15], y = Confirmed_df['Confirmed'][:15], name='Confirmed'))
fig.add_trace(go.Bar(x = Confirmed_df.index[:15], y = Confirmed_df['Recovered'][:15], name='Recovered'))
fig.add_trace(go.Bar(x = Confirmed_df.index[:15], y = Confirmed_df['Deaths'][:15], name='Deaths'))
fig.update_layout(title='15 Worst Corona-Virus hit countries uptill now', xaxis_title='Countries',
                 yaxis_title='Counts of Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-worst-hit-countries.png

# Visualization of last 15 countries 
Below is the visualization of those countries where the confirmed cases are very low uptill now. Some points to look at :-
1. **'MS Zaandam' is the country where there is lowest number of confirmed cases with a count of 9 but the number of recovered cases is 0 and death cases are 2 which means it has a recovery rate of 0% & death rate of 22.22%.**
2. There are many **countries which have 0 deaths and almost every person got recovered which means they have a death rate of 0% (which is good) & recovery rate of either 100% or more than 95%**.
3. **'Macau', 'Dominica', 'Holy See' are the countries which have 100% recovery rate & 0% death rate.**

*This chart helps us in analyzing & finding those countries where the spread rate as well as the death rate is very low.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = Confirmed_df.index[-15:], y = Confirmed_df['Confirmed'][-15:], name='Confirmed'))
fig.add_trace(go.Bar(x = Confirmed_df.index[-15:], y = Confirmed_df['Recovered'][-15:], name='Recovered'))
fig.add_trace(go.Bar(x = Confirmed_df.index[-15:], y = Confirmed_df['Deaths'][-15:], name='Deaths'))
fig.update_layout(title='15 Less Corona-Virus hit countries uptill now', xaxis_title='Countries',
                 yaxis_title='Counts of Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-less-hit-countries.png

# Visualization of top 15 countries of high recovered cases
In this, we're visualizing and analyzing the data according to the recovered cases. We have here top 15 or most recovered cases countries. From this chart, it can be seen that -
1. **The highest recovered cases are in Brazil with a count of 23,56,983 recovered cases out of 30,35,422 confirmed cases which has a recovery rate of 77.65%.**
2. **USA which is at the top in terms of confirmed cases is at the second in terms of Recovered cases.**
3. **India is at the 3rd position even in the recovered cases. We may assume that as the confirmed cases are increasing, the number of recovered cases is also increasing at the same rate which gives us less active cases**.
4. The below chart confirms that the countries **'Chile', 'Pakistan', 'Turkey' and 'Germany' has greater than 90% recovery rate**.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = Recovered_df.index[:15], y = Recovered_df['Confirmed'][:15], name='Confirmed'))
fig.add_trace(go.Bar(x = Recovered_df.index[:15], y = Recovered_df['Recovered'][:15], name='Recovered'))
fig.add_trace(go.Bar(x = Recovered_df.index[:15], y = Recovered_df['Deaths'][:15], name='Deaths'))
fig.update_layout(title='Top 15 countries in recovering rate', xaxis_title='Countries',
                 yaxis_title='Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-top-recovered-countries.png

# Visualization of last 15 countries of low recovered cases
Below is the chart which shows us the last 15 countries or top 15 countries where recovered cases is either 0 or much less. Some points to look at are :-
1. There are 3 countries namely **'Sweden', 'MS Zaandam', 'Serbia' where the recovered count is 0**.
2. But, **'Sweden' & 'Serbia'** found out to be the countries where **there's a high confirmed and death cases comparatively**. **'Sweden' having 82,323 confirmed cases and 5,763 death cases & 'Serbia' having 28,099 confirmed and 641 death cases**.
3. These two countries can be seen as a outlier as other countries have lower confirmed, death and recovered cases. And have 0% of recovery rate and 7% (Sweden), 2.28% (Serbia) death rate.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = Recovered_df.index[-15:], y = Recovered_df['Confirmed'][-15:], name='Confirmed'))
fig.add_trace(go.Bar(x = Recovered_df.index[-15:], y = Recovered_df['Recovered'][-15:], name='Recovered'))
fig.add_trace(go.Bar(x = Recovered_df.index[-15:], y = Recovered_df['Deaths'][-15:], name='Deaths'))
fig.update_layout(title="Last 15 countries in recovered cases", xaxis_title='Countries',
                 yaxis_title='Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-last-recovered-countries.png

Below chart shows us the exact overview of those top 15 countries where recovering rate is low. Here, we are not considering Sweden & Serbia for our analysis.
It is found out that the countries **'Macau', 'Timor-Leste', 'Dominica', 'Holy See' are the ones where each and every covid positive person gets recovered and there're now no more new cases uptill now** which gives us 100% recovery rate.

In [None]:
last15 = Recovered_df[(Recovered_df.index != 'Sweden') & (Recovered_df.index != 'Serbia')][-15:]

fig = go.Figure()
fig.add_trace(go.Bar(x = last15.index, y = last15['Confirmed'], name='Confirmed'))
fig.add_trace(go.Bar(x = last15.index, y = last15['Recovered'], name='Recovered'))
fig.add_trace(go.Bar(x = last15.index, y = last15['Deaths'], name='Deaths'))
fig.update_layout(title="Last 15 countries in recovered cases", xaxis_title='Countries',
                 yaxis_title='Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/last-15-recovered-countries.png

# Visualization of countries of large number of death cases
From the below chart, we can see the count of death cases where :-
1. 'USA' is again at the top in deaths with a count of 1,62,938 (more than 1.5 lakh).
2. 'Brazil' which is at the top in recovered cases and at 2nd in the confirmed cases, is also at the 2nd in terms of death cases with a count of 1,01,049 (almost 1 lakh).
3. 'India' is at the 5th position with a count of 44,386 (nearly 45K) which is much less than USA & Brazil. It is more than half less as compared to these countries.

In [None]:
fig = go.Figure(data=go.Bar(x = Death_df.index[:15], y = Death_df['Deaths'][:15]))
fig.update_layout(title="Top 15 countries in death cases", xaxis_title='Countries',
                 yaxis_title='Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-top-death-countries.png

# Visualization of last 15 countries of less number of death cases
Below chart shows us the last 15 countries where the death count is either 0 or low. It is found out that out of 15 there're 14 countries where the death count is 0 and the 15th one 'Liechtenstein' has 1 death count which means we can assume that the spread of infection is not much dangerous as compared to other countries as there are no deaths.

In [None]:
fig = go.Figure(data=go.Bar(x = Death_df.index[-15:], y = Death_df['Deaths'][-15:]))
fig.update_layout(title="Last 15 countries in death cases", xaxis_title='Countries',
                 yaxis_title='Cases (in millions)')
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/15-last-death-countries.png

# Increament of COVID-19 cases in India

In [None]:
India_df = covid_data[covid_data['Country/Region'] == 'India']
India_df = India_df.groupby('ObservationDate').sum()
    
fig = go.Figure()
fig.add_trace(go.Scatter(x = India_df.index, y = India_df.Confirmed, name = 'Confirmed'))
fig.add_trace(go.Scatter(x = India_df.index, y = India_df.Recovered, name = 'Recovered'))
fig.add_trace(go.Scatter(x = India_df.index, y = India_df.Deaths, name = 'Deaths'))
fig.update_layout(title='Increase of COVID-19 Cases of India', xaxis_title='Time',
                  yaxis_title='Count in millions')

fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/corona_cases.png

# Distribution of Cases all over the world

In [None]:
fig = go.Figure(data = go.Pie(labels = cntry_case.index, values = cntry_case.Confirmed,
                name = 'Pie chart of countries'))
fig.update_layout(title = 'Pie chart showing distribution of corona virus cases of different countries')
fig.show()

See above pie chart from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/pie-chart-distrib.png

# Graph showing increament/decreament of Recovery rate & Death Rate

In [None]:
India_df = covid_data[covid_data['Country/Region'] == 'India']
India_df = India_df.groupby('ObservationDate').sum()
India_df['Recover_rate'] = (India_df.Recovered/India_df.Confirmed)*100
India_df['Death_rate'] = (India_df.Deaths/India_df.Confirmed)*100

fig = go.Figure(data = go.Scatter(x = India_df.index, y = India_df.Recover_rate, name = 'Recovery Rate'))
fig.update_layout(title = 'Graph of Recovery rate', xaxis_title = 'Time', yaxis_title = "Recovery rate in percentage(%)")
fig.show()

fig = go.Figure(data = go.Scatter(x = India_df.index, y = India_df.Death_rate, name = 'Death Rate'))
fig.update_layout(title = 'Graph of Death Rate', xaxis_title = 'Time', yaxis_title = "Death rate in percentage(%)")
fig.show()

See above graph from below link : https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/graph-of-recovery.png

https://github.com/ayushirastogi15/covid-19-analysis/blob/master/images/graph-of-death.png