# ii. Introduction to Python II

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gem-epidemics/practical-epidemics/blob/master/site/source/iddinf/ii-intro-to-python-2.ipynb)

**Date**: Monday Sept 9, 2024


## LEARNING OUTCOMES

* Be familiar with key Python libraries
* Be able to read data and plot epidemic curves

## Key packages

In this section we will demonstrate some key Python packages such as NumPy and pandas.

NumPy is a fundamental numerical computing library in Python, and Pandas is used for data manipulation and analysis. The two work well together and allow us to work with dataframes and arrays with relative ease.

### Numpy

The convention is to `import numpy as np` - this creates a shorthand for accessing numpy operations.

In [None]:
import numpy as np

Arrays are frequently used in numpy as a structure for storing and retrieving data. These can be n-dimensional.







In [None]:
# We can define an array using a list:
infection_times = np.array([1.5, 3.2, 9.0, 4.0, 4.3])

# We can then access an element of the array
# remember in Python structures are zero indexed
print(infection_times[0])

1.5


In [None]:
# Let's define a 3-D array
connectivity_matrix = np.array([[0, 1.5, 0.25],
                                [1.5, 0, 0.5],
                                [0.25, 0.5, 0]])

# We can access elements and rows/columns in the same way
print(connectivity_matrix[0,2])

0.25


### Exercises

1. Print infection_times from the third element.

2. Slice the 3-D array to print the first row only

3. Define an integer called scale, and multiply with the connectivity_matrix

4. Create an array of length x using `np.arange()` and use the reshape operator to creae matrices of differen dimensions.

### Solutions


In [None]:
# 1. Print infection_times from the third element.
print('solution 1: ', infection_times[2:])

# 2. Slice the 3-D array to print the first row only
print('solution 2: ', connectivity_matrix[0,:])

# 3. Define a float called scale. Print the outport of scale multiplied with the connectivity_matrix
scale = 1.5
print('solution 3: \n',  scale * connectivity_matrix)

# 4. Create an array of length x using `np.arange()` and use the reshape operator to creae matrices of differen dimensions.
arr = np.arange(18).reshape(6,3)
print('solution 4: \n',  arr)

solution 1:  [9.  4.  4.3]
solution 2:  [0.   1.5  0.25]
solution 3: 
 [[0.    2.25  0.375]
 [2.25  0.    0.75 ]
 [0.375 0.75  0.   ]]
solution 4: 
 [[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]
 [15 16 17]]


### Pandas

Convention dictates we import pandas as pd. Although you could choose a different alias or none at all if you prefer.

In [None]:
import pandas as pd

# Defining a dataframe with two columns
cases = pd.DataFrame(({'total_cases': [3500, 2000, 15000, 1500, 42000],
                        'location': ['Norwich', 'Birmingham', 'Exeter', 'Leicester', 'Liverpool']}))

print(cases)

   total_cases    location
0         3500     Norwich
1         2000  Birmingham
2        15000      Exeter
3         1500   Leicester
4        42000   Liverpool


In [None]:
# We can check the shape of the dataframe - how many rows and columns it has
print(cases.shape)

(5, 2)


In [None]:
# We can access the columns by name
print(cases['location'])

0       Norwich
1    Birmingham
2        Exeter
3     Leicester
4     Liverpool
Name: location, dtype: object


In [None]:
# Alternatively we can use .iloc to refer to the index location
print(cases.iloc[:,1])

0       Norwich
1    Birmingham
2        Exeter
3     Leicester
4     Liverpool
Name: location, dtype: object


In [None]:
# We can create a new column in our dataframe called alert. Here we set all values to a boolean False.
cases['alert'] = False
print(cases)

   total_cases    location  alert
0         3500     Norwich  False
1         2000  Birmingham  False
2        15000      Exeter  False
3         1500   Leicester  False
4        42000   Liverpool  False


In [None]:
# We can define a conditional for the values of alert to be based on the case count
# in each location

cases['alert'] = cases['total_cases'] >50
print(cases)

   total_cases    location  alert
0         3500     Norwich   True
1         2000  Birmingham   True
2        15000      Exeter   True
3         1500   Leicester   True
4        42000   Liverpool   True


### Exercises
1. Create a new dataframe with two columns. A location column which matches that of `cases['location']`, and a population size column.

2. Merge your new dataframe with `cases` on the column location. You could use `pd.merge`.

3. Create a new column in your combined data frame which defines proportion of cases to population size. Round this column to 3 decimal places.

### Solutions

In [None]:
# 1. Create a new dataframe with two columns. A location column which matches that of `cases['location']`, and a population size column.
pops = pd.DataFrame(({'location': ['Norwich', 'Birmingham', 'Exeter', 'Leicester', 'Liverpool'],
                          'pop_size': [143900, 1.146e6, 129307, 1357394, 496784]}))
print('solution 1: \n',pops)
# 2. Merge your new dataframe with `cases` on the column location.
cases_pop = pd.merge(cases, pops, on = 'location')
print('solution 2: \n', cases_pop)

# 3. Create a new column in your combined data frame which defines proportion of cases to population size.
cases_pop['prop_cases_to_pop'] = (cases_pop['total_cases']/cases_pop['pop_size']).round(3)
print('solution 3: \n', cases_pop)

solution 1: 
      location   pop_size
0     Norwich   143900.0
1  Birmingham  1146000.0
2      Exeter   129307.0
3   Leicester  1357394.0
4   Liverpool   496784.0
solution 2: 
    total_cases    location  alert   pop_size
0         3500     Norwich   True   143900.0
1         2000  Birmingham   True  1146000.0
2        15000      Exeter   True   129307.0
3         1500   Leicester   True  1357394.0
4        42000   Liverpool   True   496784.0
solution 3: 
    total_cases    location  alert   pop_size  prop_cases_to_pop
0         3500     Norwich   True   143900.0              0.024
1         2000  Birmingham   True  1146000.0              0.002
2        15000      Exeter   True   129307.0              0.116
3         1500   Leicester   True  1357394.0              0.001
4        42000   Liverpool   True   496784.0              0.085


### Reading and writing data
Pandas is also commonly used for reading and writing data. More Pandas methods can be found here: https://pandas.pydata.org/docs/reference/general_functions.html#

In [None]:
# Let's import an outbreak dataset from a github repo.


## Plotting

There are two main plotting libraries in Python - `Matplotlib` and `Seaborn`. In this section we will focus on Matplotlib.


* Import epidemic linelist
* Plot
* Subplots
