# Biomedical Data Bases, 2021-2022
### Pandas examples
These are notes by prof. Davide Salomoni (d.salomoni@unibo.it) for the Biomedical Data Base course at the University of Bologna, academic year 2021-2022.

## Read CSV data into a Pandas data frame
Import pandas, then use _read_csv()_ to create the data frame and print its columns.
<br>
Click on Run in Jupyter to execute the cell.

In [None]:
import pandas as pd
df = pd.read_csv('COVID-19-sample-BDB2022.csv')

In [None]:
type(df)

In [None]:
df.head()

In [None]:
print(df.columns)

In [None]:
# check how many rows and columns we have
shape = df.shape
print(shape)

In [None]:
# how many elements are there in total?
# (you could have directly used for that also df.size)
print(shape[0] * shape[1])

## Create a convenience function to map a date to a week number

In [None]:
import datetime
def week_string(year, month, day):
    ''' return a week number in the format yyyy-ww; for example,
    2021-45 for the 45th week of the year 2021. '''
    week = datetime.date(year, month, day).isocalendar()[1]
    return "%s-%02d" % (year, week)

# example: find the week string for March 1, 2020
print(week_string(2020, 3, 1))

# example: find the week string for November 30, 2021
print(week_string(2021, 11, 30))

## Find cases where 'country' is Italy from March 2020 to November 2021

In [None]:
# we import the time module so we can use time.time()
# to find out how long it takes to come to the result
import time

# convert the start and end date to the corresponding week numbers
start_week = week_string(2020, 3, 1)
end_week = week_string(2021, 11, 30)

### First attempt: the "brute-force" way, using iterrows()

In [None]:
start_time = time.time()

# create a dictionary with key = week string, and value = number of cases in that week
it_cases = dict()
for index,row in df.iterrows():
    country = row['country']
    if country == 'Italy':
        indicator = row['indicator']
        if indicator != 'cases':
            continue
        week = row['year_week']
        if (week >= start_week) and (week <= end_week):
            cases = row['weekly_count']
            it_cases[week] = cases

# create a new dataframe out of the it_cases dictionary. It will contain
# only cases occurred in Italy between March 2020 and November 2021.
df2 = pd.DataFrame(list(it_cases.items()), columns=['week', 'cases'])

end_time = time.time()

print('The brute-force method took %.2f seconds' % (end_time-start_time))

In [None]:
# the resulting dataframe only has the two columns week and cases
df2.head()

In [None]:
# plot the cases
df2.plot()

### Second attempt: the "pandas-native" way, using df.query()

In [None]:
start_time = time.time()

# create a new dataframe using df.query(). It will contain
# only cases occurred in Italy between March 2020 and November 2021.
df3 = df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))

end_time = time.time()

print('The Pandas-native method took %.2f seconds' % (end_time-start_time))

In [None]:
# the resulting dataframe is simply a filtered version of the original dataframe
df3.head()

In [None]:
# plot the cases (select the "weekly_count" column only)
df3.plot(y='weekly_count')

## Reading an Excel file

### Remember that you need to have the openpyxl library installed

In [None]:
! pip install openpyxl

In [None]:
# create a dataframe from the excel file
df = pd.read_excel('COVID-19-sample-BDB2022.xlsx')
df3 = df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))
df3.plot(y='weekly_count')

In [None]:
# Note that reading from an excel file is FAR slower than reading from a CSV file:

start_time = time.time()
df_csv = pd.read_csv('COVID-19-sample-BDB2022.csv')
end_time = time.time()
print('Reading the CVS file took %.2f seconds' % (end_time-start_time))

start_time = time.time()
df_excel = pd.read_excel('COVID-19-sample-BDB2022.xlsx')
end_time = time.time()
print('Reading the excel file took %.2f seconds' % (end_time-start_time))

## Examples of a few common pandas functions

### sum()
To compute the sum of all the deaths that are recorded in the COVID-19 DataFrame for Italy.
1. Create a new dataframe containing only the records where the country is Italy and the indicator is deaths.
2. Call sum() for the weekly_count column on that dataframe.
3. Verify that you obtained the right number, checking that it is equal to cumulative_count as reported in the last row of the dataframe.

In [None]:
df_italy = df.query('country=="Italy" and indicator=="deaths"')
df_italy['weekly_count'].sum()

In [None]:
df_italy.tail(1)

### describe()
Generate simple statistics of the df_italy dataframe.

In [None]:
df_italy.describe()

### nunique()
How many countries are recorded in the COVID-19 dataframe?

In [None]:
df.nunique()

### groupby()
Group the COVID-19 dataframe by continent, compute the sum of the columns (although logically summing some of the columns does not make sense, think about that). Then plot weekly_count.

In [None]:
df_grouped = df.groupby('continent')
# by itself, groupby just returns a special type of dataframe, to which you should apply some function.
# check the type of the returned dataframe:
type(df_grouped)

In [None]:
# now sum the 'grouped' dataframe
df_grouped.sum()

In [None]:
# plot weekly_count
# use a bar plot, set the y-axis label and the title
df_grouped.sum()['weekly_count'].plot(kind='bar', ylabel='Total cases', title='COVID-19 cases grouped by continent')

Now check which countries are part of the group 'Oceania', using get_group()

In [None]:
df_grouped.get_group('Oceania')

How many unique countries are there? Use nunique() on the 'country' column to find out.

In [None]:
df_grouped.get_group('Oceania')['country'].nunique()

Which countries are in the Oceania group? Use unique() (without the 'n') on the 'country' column to find out.

In [None]:
df_grouped.get_group('Oceania')['country'].unique()

In [None]:
# of course in the array above there is exactly the number of countries reported by nunique():
print(len(df_grouped.get_group('Oceania')['country'].unique()))

In [None]:
# or, explicitly verify that the two numbers are the same
print(
    len(df_grouped.get_group('Oceania')['country'].unique()) == 
    df_grouped.get_group('Oceania')['country'].nunique()
)

## Exporting to Excel

In [None]:
# export all Italian data to a new Excel file using the to_excel() method
df_italy = pd.read_excel('COVID-19-sample-BDB2022.xlsx').query('country == "Italy"')
df_italy.to_excel('COVID-19-italy-only.xlsx')

## Reading a CSV file from a remote location

In [None]:
df = pd.read_csv('https://github.com/dsalomoni/bdb-2022/raw/main/covid/COVID-19-sample-BDB2022.csv')
df_deaths_italy = df.query('country == "Italy" and indicator == "deaths"')
df_deaths_italy['weekly_count'].sum()