# EXPLORATORY DATA ANALYSIS OF COVID 19 INDIA DATASET

THIS DATA SET IS COLLECTED FROM - https://www.kaggle.com/sudalairajkumar/covid19-in-india

KAGGLE LINK OF THIS PROJECT: https://www.kaggle.com/diptaraj23/eda-of-covid-india

GITHUB LINK OF THIS PROJECT: https://github.com/diptaraj23/Covid-19-India-EDA

##### NOTE :
This dataset is getting updated continuously in the given link above. I download the dataset in .CSV format on [22 ‎April ‎2021, ‏‎22:34:41]. 

This notebook on Kaggle will not get affected as the data it is taking is also getting updated. But in other platforms (like Github) where I have uploaded the .CSV files downloaded on [22 ‎April ‎2021, ‏‎22:34:41] 

Analysis can only be retrieved based on the data collected till [22 ‎April ‎2021, ‏‎22:34:41]. So it is suggested to use the KAGGLE link to view this Analysis to get the Analysis base4d on the latest data.

In [202]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
from datetime import datetime

## Exploratory Data Analysis of Covid_19_India Dataset

In [203]:
covid_data = pd.read_csv("covid_19_india.csv")
covid_data['Date'] =  pd.to_datetime(covid_data['Date'] ,format ='%d-%m-%Y')
covid_data.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,2020-01-30,6:00 PM,Kerala,1,0,0,0,1
1,2,2020-01-31,6:00 PM,Kerala,1,0,0,0,1
2,3,2020-02-01,6:00 PM,Kerala,2,0,0,0,2
3,4,2020-02-02,6:00 PM,Kerala,3,0,0,0,3
4,5,2020-02-03,6:00 PM,Kerala,3,0,0,0,3


In [204]:
covid_data.shape

(14114, 9)

In [205]:
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14114 entries, 0 to 14113
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Sno                       14114 non-null  int64         
 1   Date                      14114 non-null  datetime64[ns]
 2   Time                      14114 non-null  object        
 3   State/UnionTerritory      14114 non-null  object        
 4   ConfirmedIndianNational   14114 non-null  object        
 5   ConfirmedForeignNational  14114 non-null  object        
 6   Cured                     14114 non-null  int64         
 7   Deaths                    14114 non-null  int64         
 8   Confirmed                 14114 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object(4)
memory usage: 992.5+ KB


In [206]:
covid_data.describe()

Unnamed: 0,Sno,Cured,Deaths,Confirmed
count,14114.0,14114.0,14114.0,14114.0
mean,7057.5,153438.4,2464.965212,167434.5
std,4074.505185,309245.5,6593.425888,336540.4
min,1.0,0.0,0.0,0.0
25%,3529.25,1216.0,10.0,2338.25
50%,7057.5,16764.5,316.0,20921.5
75%,10585.75,177004.8,1912.75,205206.8
max,14114.0,3268449.0,61911.0,4027827.0


## Statewise Analysis

In [207]:
state_wise = covid_data.groupby('State/UnionTerritory')['Confirmed','Cured','Deaths'].sum().reset_index()
state_wise["Death_percentage"] = ((state_wise["Deaths"] / state_wise["Confirmed"]) * 100)
state_wise.style.background_gradient(cmap='magma')


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Unnamed: 0,State/UnionTerritory,Confirmed,Cured,Deaths,Death_percentage
0,Andaman and Nicobar Islands,1156097,1089930,14729,1.274028
1,Andhra Pradesh,204630098,194086560,1676597,0.819331
2,Arunachal Pradesh,3565479,3319089,10760,0.301783
3,Assam,50864662,47588973,230274,0.452719
4,Bihar,59137203,56032256,323777,0.547501
5,Cases being reassigned to states,345565,0,0,0.0
6,Chandigarh,4460750,4067193,67213,1.506765
7,Chhattisgarh,59468765,53245886,687470,1.156019
8,Dadra and Nagar Haveli and Daman and Diu,860095,813537,578,0.067202
9,Daman & Diu,2,0,0,0.0


In [208]:
px.bar(x=state_wise.nlargest(10,"Confirmed")["State/UnionTerritory"],
       y = state_wise.nlargest(10,"Confirmed")["Confirmed"],
       color_discrete_sequence=px.colors.diverging.Picnic,
       title="Top 10 states with highest number of Confirmed cases")

In [209]:
px.bar(x=state_wise.nlargest(10,"Cured")["State/UnionTerritory"],
       y = state_wise.nlargest(10,"Cured")["Cured"],
       color_discrete_sequence=px.colors.sequential.Sunset,
       title="Top 10 states with highest number of Cured cases")

In [210]:
px.bar(x=state_wise.nlargest(10,"Deaths")["State/UnionTerritory"],
       y = state_wise.nlargest(10,"Deaths")["Deaths"],
       color_discrete_sequence=px.colors.diverging.curl,
       title="Top 10 states with highest number of Deaths")

In [211]:
px.bar(x=state_wise.nlargest(10,"Death_percentage")["State/UnionTerritory"],
       y = state_wise.nlargest(10,"Death_percentage")["Death_percentage"],
       color_discrete_sequence=px.colors.diverging.Portland,
       title="Top 10 states with highest of Death percentage")

## Monthwise Analysis

In [212]:
month_wise = covid_data.groupby(pd.Grouper(key='Date',freq='M')).sum()

month_wise = month_wise.drop(['Sno'], axis = 1)
month_wise['Date'] = month_wise.index

first_column = month_wise.pop('Date')
month_wise.insert(0, 'Date', first_column)

index = [x for x in range(len(month_wise))]
month_wise['index'] = index
month_wise = month_wise.set_index('index')

second_column = month_wise.pop('Confirmed')
month_wise.insert(1, 'Confirmed', second_column)
month_wise["Death_percentage"] = ((month_wise["Deaths"] / month_wise["Confirmed"]) * 100)
month_wise.style.background_gradient(cmap='twilight_shifted')

Unnamed: 0_level_0,Date,Confirmed,Cured,Deaths,Death_percentage
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2020-01-31 00:00:00,2,0,0,0.0
1,2020-02-29 00:00:00,86,0,0,0.0
2,2020-03-31 00:00:00,9687,808,202,2.085269
3,2020-04-30 00:00:00,422442,75443,13270,3.14126
4,2020-05-31 00:00:00,2938234,1133341,89834,3.057415
5,2020-06-30 00:00:00,10558374,5668946,319690,3.027834
6,2020-07-31 00:00:00,31726501,19980130,793511,2.501098
7,2020-08-31 00:00:00,80749620,58580895,1553468,1.923808
8,2020-09-30 00:00:00,149113758,118592934,2443374,1.638597
9,2020-10-31 00:00:00,226770312,198824412,3457615,1.524721


In [213]:
fig = px.bar(month_wise, x='Date', y='Confirmed',
             hover_data=['Cured', 'Deaths'], color='Date',
             labels={'Date':'Date(monthwise)'}, height=400,
             title="Monthwise Increase in Confirmed cases")
fig.show()

In [214]:
fig = px.bar(month_wise, x='Date', y='Cured',
             hover_data=['Confirmed','Deaths'], color='Date',
             labels={'Date':'Date(monthwise)'},
             title="Monthwise Increase in Cured cases")
fig.show()

In [215]:
fig = px.bar(month_wise, x='Date', y='Deaths',
             hover_data=['Confirmed','Cured'], color='Date',
             labels={'Date':'Date(monthwise)'},
             title="Monthwise Increase in Deaths cases")
fig.show()

In [216]:
fig = px.bar(month_wise , 
             x='Date', 
             y='Death_percentage' ,
             hover_data=['Confirmed','Deaths'],color='Date',
             labels={'Death_percentage':'Death percentage'},
             title="Top 10 states with highest of Death percentage")
fig.show()

## Exploratory Data Analysis of StatewiseTestingDetails Dataset

In [217]:
covid_testing = pd.read_csv("StatewiseTestingDetails.csv")
covid_testing['Date'] = covid_testing['Date'].astype('datetime64[ns]')
covid_testing.head()

Unnamed: 0,Date,State,TotalSamples,Negative,Positive
0,2020-04-17,Andaman and Nicobar Islands,1403.0,1210.0,12.0
1,2020-04-24,Andaman and Nicobar Islands,2679.0,,27.0
2,2020-04-27,Andaman and Nicobar Islands,2848.0,,33.0
3,2020-05-01,Andaman and Nicobar Islands,3754.0,,33.0
4,2020-04-02,Andhra Pradesh,1800.0,1175.0,132.0


In [218]:
covid_testing.shape

(926, 5)

In [219]:
covid_testing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 926 entries, 0 to 925
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          926 non-null    datetime64[ns]
 1   State         926 non-null    object        
 2   TotalSamples  926 non-null    float64       
 3   Negative      756 non-null    float64       
 4   Positive      918 non-null    float64       
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 36.3+ KB


In [220]:
covid_testing['Negative'] = covid_testing['TotalSamples'] - covid_testing['Positive']
covid_testing = covid_testing.dropna()
covid_testing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 918 entries, 0 to 925
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          918 non-null    datetime64[ns]
 1   State         918 non-null    object        
 2   TotalSamples  918 non-null    float64       
 3   Negative      918 non-null    float64       
 4   Positive      918 non-null    float64       
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 43.0+ KB


## Statewise Analysis

In [221]:
covid_testing_state = covid_testing.groupby('State')['TotalSamples','Negative','Positive'].max().reset_index()
covid_testing_state["Positive_percentage"] = ((covid_testing["Positive"] / covid_testing["TotalSamples"]) * 100)
covid_testing_state.style.background_gradient(cmap='gist_earth_r')


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Unnamed: 0,State,TotalSamples,Negative,Positive,Positive_percentage
0,Andaman and Nicobar Islands,3754.0,3721.0,33.0,0.85531
1,Andhra Pradesh,173735.0,171755.0,1980.0,1.007839
2,Arunachal Pradesh,1823.0,1821.0,2.0,1.158708
3,Assam,18002.0,17940.0,62.0,0.879062
4,Bihar,36053.0,35346.0,707.0,7.333333
5,Chandigarh,2142.0,1969.0,173.0,5.726388
6,Chhattisgarh,25282.0,25223.0,59.0,5.475711
7,Delhi,97678.0,90445.0,7233.0,5.820638
8,Goa,4848.0,4841.0,7.0,4.934323
9,Gujarat,113493.0,105298.0,8195.0,4.502618


In [222]:
px.bar(x=covid_testing_state.nlargest(10,"TotalSamples")["State"],
       y = covid_testing_state.nlargest(10,"TotalSamples")["TotalSamples"],
       labels={'y':'Total Samples','x':'State'},
       color_discrete_sequence=px.colors.sequential.haline,
       title="Top 10 states with highest number of Total Samples")

In [223]:
px.bar(x=covid_testing_state.nlargest(10,"Negative")["State"],
       y = covid_testing_state.nlargest(10,"Negative")["Negative"],
       labels={'y':'Total Negative cases','x':'State'},
       color_discrete_sequence=px.colors.sequential.turbid,
       title="Top 10 states with highest number of Negative cases")

In [224]:
px.bar(x=covid_testing_state.nlargest(10,"Positive")["State"],
       y = covid_testing_state.nlargest(10,"Positive")["Positive"],
       labels={'y':'Total Positive Cases','x':'State'},
       color_discrete_sequence=px.colors.sequential.solar,
       title="Top 10 states with highest number of Positive cases")

In [225]:
px.bar(x=covid_testing_state.nlargest(10,"Positive_percentage")["State"],
       y = covid_testing_state.nlargest(10,"Positive_percentage")["Positive_percentage"],
       labels={'y':'Positive Percentage','x':'State'},
       color_discrete_sequence=px.colors.sequential.Aggrnyl,
       title="Top 10 states with highest Positive percentage",
       height = 420)

### There is another dataset named "covid_vaccine_statewise" attached to this same project. I will upload another Notebook analysing that data and will attach a link to that Notebook as a reference in this Project.