## Analyzing COVID-19 dataset from Italy with Pandas
The data provides COVID-19 daily counts for Italy, including metrics reported with new cases, deaths and tests. It contains 248 days from Dec 2019 to sept 2020.

Using url retrieve to get the csv file

In [1]:
from urllib.request import urlretrieve

In [2]:
urlretrieve('https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv', 'italy-covid-daywise.csv')

('italy-covid-daywise.csv', <http.client.HTTPMessage at 0x7dc1a5c08df0>)

Importing pandas

In [3]:
import pandas as pd

In [4]:
covid_df = pd.read_csv('italy-covid-daywise.csv')

In [5]:
type(covid_df)

In [6]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


We can view some basic information about the data frame using the .info method.

In [7]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        248 non-null    object 
 1   new_cases   248 non-null    float64
 2   new_deaths  248 non-null    float64
 3   new_tests   135 non-null    float64
dtypes: float64(3), object(1)
memory usage: 7.9+ KB


The .describe provides further statistical information

In [None]:
covid_df.describe()

Unnamed: 0,new_cases,new_deaths,new_tests
count,248.0,248.0,135.0
mean,1094.818548,143.133065,31699.674074
std,1554.508002,227.105538,11622.209757
min,-148.0,-31.0,7841.0
25%,123.0,3.0,25259.0
50%,342.0,17.0,29545.0
75%,1371.75,175.25,37711.0
max,6557.0,971.0,95273.0


Other methods that can provide  insights about the dataframe are:

*   .columns - Get the list of column names.
*   .shape - get the number of rows and columns as a tuple.



# Analyzing Data from data frames

Q1: What was the total number of reported cases and deaths related to COVID-19 in Italy?

In [8]:
#using the .sum method()
total_cases = covid_df.new_cases.sum()

In [9]:
total_cases

271515.0

In [10]:
total_deaths = covid_df.new_deaths.sum()

In [11]:
total_deaths

35497.0

Q2: What was the overall number of tests conducted?

In [12]:
total_tests = covid_df.new_tests.sum()

In [13]:
total_tests

4279456.0

Q3: What fraction of tests had positive results?

In [14]:
positive = total_cases / total_tests

In [15]:
positive

0.06344614829548428

In [16]:
print(f"{positive * 100:.2f}%")

6.34%


Querying and sorting data


In [17]:
#days that had more than 1000 reported cases
high_new_cases = covid_df.new_cases > 1000

In [18]:
high_new_cases

Unnamed: 0,new_cases
0,False
1,False
2,False
3,False
4,False
...,...
243,True
244,True
245,False
246,False


Passing the previous value as an index. This shows only thae cases that were above 10000 and skips the false.

In [19]:
covid_df[high_new_cases]

Unnamed: 0,date,new_cases,new_deaths,new_tests
68,2020-03-08,1247.0,36.0,
69,2020-03-09,1492.0,133.0,
70,2020-03-10,1797.0,98.0,
72,2020-03-12,2313.0,196.0,
73,2020-03-13,2651.0,189.0,
...,...,...,...,...
241,2020-08-28,1409.0,5.0,65135.0
242,2020-08-29,1460.0,9.0,64294.0
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0


Displays a certain number of rows, to customize visibility

In [None]:
from IPython.display import display
with pd.option_context('display.max_rows', 100) :
  display(covid_df[covid_df.new_cases > 200])

Unnamed: 0,date,new_cases,new_deaths,new_tests
59,2020-02-28,250.0,5.0,
60,2020-02-29,238.0,4.0,
61,2020-03-01,240.0,8.0,
62,2020-03-02,561.0,6.0,
63,2020-03-03,347.0,17.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


Days that had a high ratio of positive cases

In [None]:
high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests > positive]

In [None]:
high_ratio_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
111,2020-04-20,3047.0,433.0,7841.0
112,2020-04-21,2256.0,454.0,28095.0
114,2020-04-23,3370.0,437.0,37083.0
116,2020-04-25,3021.0,420.0,38676.0
117,2020-04-26,2357.0,415.0,24113.0
118,2020-04-27,2324.0,260.0,26678.0
124,2020-05-03,1900.0,474.0,27047.0
128,2020-05-07,1444.0,369.0,13665.0


In [21]:
# setting a new column in the data frame
covid_df['positive'] = covid_df.new_cases / covid_df.new_tests

In [22]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,positive
0,2019-12-31,0.0,0.0,,
1,2020-01-01,0.0,0.0,,
2,2020-01-02,0.0,0.0,,
3,2020-01-03,0.0,0.0,,
4,2020-01-04,0.0,0.0,,
...,...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0,0.026970
244,2020-08-31,1365.0,4.0,42583.0,0.032055
245,2020-09-01,996.0,6.0,54395.0,0.018311
246,2020-09-02,975.0,8.0,,


Similary new columns can be created:

covid_df['another_column'] = 10

To remove the column, simply type:
covid_df = covid_df.drop('another_column', axis = 1)

##The `.sample` nethod can be used to retrieve random sample of rows from the data frame.

In [24]:
covid_df.sample(10)

Unnamed: 0,date,new_cases,new_deaths,new_tests,positive
180,2020-06-28,175.0,8.0,21183.0,0.008261
18,2020-01-18,0.0,0.0,,
168,2020-06-16,301.0,26.0,27762.0,0.010842
3,2020-01-03,0.0,0.0,,
176,2020-06-24,113.0,18.0,30237.0,0.003737
126,2020-05-05,1221.0,195.0,32211.0,0.037906
97,2020-04-06,4316.0,527.0,,
42,2020-02-11,0.0,0.0,,
191,2020-07-09,193.0,15.0,29947.0,0.006445
117,2020-04-26,2357.0,415.0,24113.0,0.097748


Other functions and methods that can be used to retrieve data:


*   Covid_df['new_cases'] - to get columns as series using column name
*   new_cases[243] - retrieving values from a series using an index
*   covid_df.at[243, 'new_cases'] - get a single value
*   covid_df.copy() - creating a deep copy of a data frame
*   covid_df.loc[243] - Retrieving a row or range of rows of data from the data frame
*   head, tail and sample - Retrieving multiple rows
*   covid_df.new_tests.first_valid_index - Finding tge first non-empty data from the df







