# Data Analysis using Python and Pandas

#### Pandas is a flexible and easy-to-use data analysis library built on top of the Python programming language

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# The dataset we are using has been downloaded from https://covid19.org
# This is time-series data representing covid19 status in India
# Read the CSV file using pandas

covid_df = pd.read_csv('./Data/covid19_time_series_India.csv')

##### Data from the file is read and stored in 'DataFrame' obect - one of the core structures in Pandas for storing and working with tabular data

In [3]:
type(covid_df)

pandas.core.frame.DataFrame

In [4]:
# Display the data
covid_df

Unnamed: 0,Date,Date_YMD,Daily Confirmed,Total Confirmed,Daily Recovered,Total Recovered,Daily Deceased,Total Deceased
0,30 January,2020-01-30,1,1,0,0,0,0
1,31 January,2020-01-31,0,1,0,0,0,0
2,01 February,2020-02-01,0,1,0,0,0,0
3,02 February,2020-02-02,1,2,0,0,0,0
4,03 February,2020-02-03,1,3,0,0,0,0
...,...,...,...,...,...,...,...,...
336,31 December,2020-12-31,19026,10286234,21969,9881565,244,148427
337,01 January,2021-01-01,20159,10306393,23838,9905403,237,148664
338,02 January,2021-01-02,18144,10324537,20903,9926306,216,148880
339,03 January,2021-01-03,16678,10341215,19658,9945964,215,149095


Here's what we can say about the data:
    - The file provides 6 day-wise metrics for Covid 19 in India
    - The metrics are: Daily Confirmed, Total Confirmed, Daily Recovered, Total Recovered, Daily Deceased and Total Deceased
    - Data provided is for 341 days, from 30th January 2020 to 4th January 2021

NOTE: These are officially provided numbers. Actual numbers may vary since not all cases are diagnosed / recorded

In [5]:
# View basic information about the data frame
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341 entries, 0 to 340
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Date             341 non-null    object
 1   Date_YMD         341 non-null    object
 2   Daily Confirmed  341 non-null    int64 
 3   Total Confirmed  341 non-null    int64 
 4   Daily Recovered  341 non-null    int64 
 5   Total Recovered  341 non-null    int64 
 6   Daily Deceased   341 non-null    int64 
 7   Total Deceased   341 non-null    int64 
dtypes: int64(6), object(2)
memory usage: 21.4+ KB


We see that each column contains values of a specific data type. Statistical information can be viewed for numerical columns
(Mean, Standard Deviation, Min / Max Values, and number of null / empty values) using the 'describe' method

In [6]:
covid_df.describe()

Unnamed: 0,Daily Confirmed,Total Confirmed,Daily Recovered,Total Recovered,Daily Deceased,Total Deceased
count,341.0,341.0,341.0,341.0,341.0,341.0
mean,30373.879765,3300570.0,29252.706745,2905230.0,437.815249,51925.797654
std,29393.037881,3791066.0,29373.625649,3556720.0,383.847271,54722.137416
min,0.0,1.0,0.0,0.0,0.0,0.0
25%,1580.0,24448.0,580.0,5496.0,53.0,781.0
50%,22718.0,1077873.0,20977.0,677662.0,394.0,26830.0
75%,50488.0,7119309.0,54133.0,6146402.0,710.0,108598.0
max,97860.0,10357490.0,102070.0,9975173.0,2004.0,149295.0


We can view a list of all columns in the data frame

In [7]:
covid_df.columns

Index(['Date', 'Date_YMD', 'Daily Confirmed', 'Total Confirmed',
       'Daily Recovered', 'Total Recovered', 'Daily Deceased',
       'Total Deceased'],
      dtype='object')

The number of rows and columns can be viewed using the 'shape' method

In [8]:
covid_df.shape

(341, 8)

Here's a summary of the methods and tools we have looked at so far:

    - pd.read_csv(): reads data from a csv file into a pandas dataframe object
    - .info(): gives basic information on rows, columns and data types
    - .describe(): shows statistical information for numerical columns
    - .columns: gives a list a columns
    - .shape: gives the number of rows and columns 
