In [5]:
from urllib.request import urlretrieve

In [8]:
urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv', 
            'italy-covid-daywise.csv')

('italy-covid-daywise.csv', <http.client.HTTPMessage at 0x7419250>)

In [9]:
import pandas as pd

In [10]:
covid_df = pd.read_csv("italy-covid-daywise.csv")

In [11]:
type(covid_df)

pandas.core.frame.DataFrame

In [12]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


Here's what we can tell by looking at the data frame:
   * The file provides four daywise counts for Covid-19 in Italy
   * The metrics reported are new cases, new deaths and new tests
   * Data is provided for 248 days: from Dec 12, 2019 to Sep 3, 2020
   
Keep in mind that these are officially reported numbers, and the actual number of cases & deaths may be higher, as not all cases are diagnosed.

We can view some basic information about the data frame using the `.info` method.

In [13]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        248 non-null    object 
 1   new_cases   248 non-null    float64
 2   new_deaths  248 non-null    float64
 3   new_tests   135 non-null    float64
dtypes: float64(3), object(1)
memory usage: 7.9+ KB


It appears that each column contains values of a specific data type. For the numeric columns, you can view the some statistical information like mean, standard deviation, minimum/maximum values and number of non-empty values using the `.describe` method.

In [24]:
covid_df.describe()

Unnamed: 0,new_cases,new_deaths,new_tests
count,248.0,248.0,135.0
mean,1094.818548,143.133065,31699.674074
std,1554.508002,227.105538,11622.209757
min,-148.0,-31.0,7841.0
25%,123.0,3.0,25259.0
50%,342.0,17.0,29545.0
75%,1371.75,175.25,37711.0
max,6557.0,971.0,95273.0


The `columns` property contains the list of columns within the data frame.

In [22]:
covid_df.columns

Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')

You can also retrieve the number of rows and columns in the data frame using the `.shape` method.

In [20]:
covid_df.shape

(248, 4)

In [25]:
import jovian

<IPython.core.display.Javascript object>

In [26]:
jovian.commit(project="python-pandas-data-analysis")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..
[jovian] Please enter your API key ( from https://jovian.ml/ ):
API KEY: ········
[jovian] Creating a new project "ankit114211/python-pandas-data-analysis"
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/ankit114211/python-pandas-data-analysis


'https://jovian.ml/ankit114211/python-pandas-data-analysis'

In [27]:
covid_df['new_cases']

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

Each column is represented using a data structure called `Series`, which is essentially a numpy array with some extra methods and properties.

In [29]:
type(covid_df)

pandas.core.frame.DataFrame

In [30]:
type(covid_df['new_cases'])

pandas.core.series.Series

Just like arrays, you can retrieve a specific value with a series using the indexing notation `[]`.

In [31]:
covid_df['new_cases'][246]

975.0

In [32]:
covid_df['new_tests'][240]

57640.0

Pandas also provides the `.at` method to directly retrieve at a specific row & column.



In [33]:
covid_df.at[246, 'new_cases']

975.0

In [34]:
covid_df.at[240,'new_tests']

57640.0

Instead of using the indexing notation `[]`, Pandas also allows accessing columns as properties of the data frame using the `.` notation. However, this method only works for columns whose names do not contain spaces or special chracters.

In [35]:
covid_df.new_cases

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

In [36]:
covid_df.new_cases[246]

975.0

Further, you can also pass a list of columns within the indexing notation `[]` to access a subset of the data frame with just the given columns.



In [37]:
cases_df = covid_df[['date','new_cases']]

In [38]:
cases_df

Unnamed: 0,date,new_cases
0,2019-12-31,0.0
1,2020-01-01,0.0
2,2020-01-02,0.0
3,2020-01-03,0.0
4,2020-01-04,0.0
...,...,...
243,2020-08-30,1444.0
244,2020-08-31,1365.0
245,2020-09-01,996.0
246,2020-09-02,975.0


Note, however, that the new data frame `cases_df` is simply a "view" of the original data frame `covid_df` i.e. they both point to the same data in the computer's memory, and changing any values inside one of them will also change the respective values in the other. Sharing data between data frames makes data manipulation in Pandas blazing fast, and you needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.

Sometimes you might need a full copy of the data frame, in which case you can use the copy method.

In [39]:
covid_df_copy = covid_df.copy()

The data within `covid_df_copy` is completely separate from `covid_df`, and changing values inside one of them will not affect the other.



To access a specific row of data Pandas provides the `.loc` method.



In [40]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


In [41]:
covid_df.loc[243]

date          2020-08-30
new_cases           1444
new_deaths             1
new_tests          53541
Name: 243, dtype: object

Each retrieved row is also a `Series` object.

In [42]:
type(covid_df.loc[243])

pandas.core.series.Series

To view the first or last few rows of data, we can use the `.head` and `.tail` methods.

In [43]:
covid_df.head(6)

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
5,2020-01-05,0.0,0.0,


In [45]:
covid_df.tail(8)

Unnamed: 0,date,new_cases,new_deaths,new_tests
240,2020-08-27,1366.0,13.0,57640.0
241,2020-08-28,1409.0,5.0,65135.0
242,2020-08-29,1460.0,9.0,64294.0
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,
247,2020-09-03,1326.0,6.0,


Notice above that while the first few values in the `new_cases` and `new_deaths` columns are 0, the corresponding values within the `new_tests` column are `NaN`. That is because the CSV file does not contain any data for the `new_tests` column for certain dates (you can verify this by looking into the file). It's possible that these values are missing or unknown.

In [47]:
covid_df.at[0,'new_tests']

nan

In [48]:
type(covid_df.at[0,'new_tests'])

numpy.float64

The distinction between `0` and `NaN` is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. In fact, Italy started reporting daily tests on April 19, 2020. By that time, 935310 tests had already been conducted.

We can find the first index that doesn't contain a `NaN` value using `first_valid_index` method of a series.

In [49]:
covid_df.new_tests.first_valid_index()

111

Let's look at a few rows before and after this index to verify that the values indeed change from `NaN` to actual numbers. We can do this by passing a range to `loc`.

In [50]:
covid_df.loc[108:113]

Unnamed: 0,date,new_cases,new_deaths,new_tests
108,2020-04-17,3786.0,525.0,
109,2020-04-18,3493.0,575.0,
110,2020-04-19,3491.0,480.0,
111,2020-04-20,3047.0,433.0,7841.0
112,2020-04-21,2256.0,454.0,28095.0
113,2020-04-22,2729.0,534.0,44248.0


The `.sample` method can be used to retrieve a random sample of rows from the data frame.



In [51]:
covid_df.sample(15)

Unnamed: 0,date,new_cases,new_deaths,new_tests
20,2020-01-20,0.0,0.0,
161,2020-06-09,280.0,65.0,32200.0
181,2020-06-29,174.0,22.0,15484.0
238,2020-08-25,953.0,4.0,45798.0
139,2020-05-18,675.0,145.0,26101.0
159,2020-06-07,270.0,72.0,27894.0
102,2020-04-11,3951.0,570.0,
141,2020-05-20,813.0,162.0,38617.0
104,2020-04-13,4092.0,431.0,
229,2020-08-16,629.0,158.0,22470.0


Notice that even though we have taken a random sample, the original index of each row has been preserved. This is an important an useful property of data frames - each row of data has an index associated with it.

Here's a summary of the functions & methods we looked at in this section:

   - `covid_df['new_cases']` - retrieving columns as series using a column name
   
   - `new_cases[243]` - retrieving values from a series using an index
   
   - `covid_df.at[243, 'new_cases']` - retrieving a single value from a data frame
   
   - `covid_df.copy()` - Creating a deep copy of a data frame
   
   - `covid_df.loc[243]` - Retrieving a row or range of rows of data from the data frame
   
   - `head`, `tail` and `sample` - Retrieving multiple rows of data from the data frame
   
   - `covid_df.new_tests.first_valid_index` - Finding the first non-empty index in a series

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..
[jovian] Updating notebook "ankit114211/python-pandas-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
