<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>


# Python Packages
In the last notebook we saw how to write our own functions to manipulate data. That's helpful for when we have common manipulations we need to perform multiple times. But we don't want to write functions for every single data analysis task. Instead, we can leverage code written by other people in the Python communities.

**Packages** (also called **libraries**) are collections of functions and other Python code, *packaged* together and distributed for the open-source community to use. Packages contain smaller sets of functionality called **modules** and we may sometimes refer to them interchangably here. In this notebook, we'll look at a few of the most important packages in Python for data science.
- `numpy`
- `pandas`
- `matplotlib` and `seaborn`

## Importing packages
To use a package in Python, we first have to *import* it. The syntax for this is:

```python
import package_name
```

Without getting too technical, this loads  the library into your namespace and you can access all the contents of the library (functions, classes, etc.) as:

```python
package_name.function_name()
```

---
#### Comparison with R

This is the Python equivalent of R's `library(...)` function. A major difference between how R and python imports packages is that in R, calling `library(...)` directly loads all of package's contents to your namespace. For example, if you call `library(dplyr)`, you can directly access functions like `group_by`:

```r
# R
library(tidyverse)
df <- read_csv(...) %>% # read_csv, %>% loaded from tidyverse packages
    group_by(subject_id) %>% # group_by is from dplyr
    summarise(n=count())
```

In Python, the functions and classes are underneath the higher-level `module` object:

```python
# python
import pandas
df = pandas.read_csv(...) # read_csv needs to be accessed through `pandas.read_csv`
df = df.group_by("subject_id").agg{"dt": len} # The other methods are accessed through the dataframe object
```

---

Instead of importing a module, we can import all of the contents of the module as:

```python
from package_name import *
```

The asterisk "\*" means: "Import everything that is defined in `package_name`. So if you did this in the example above, you could run:


```python
from pandas import *
df = read_csv(...) # Now we can just call read_csv directly
```

This is typically not recommended, but can sometimes be useful. In fact, we've already seen an example of this in our previous notebooks. Most of the notebooks today have had the following line of code:

In [1]:
from helpers import *

This line of code imports all of the contents of a local module called `helpers` that I created to implement quizzes and helper functions. You can see the code I wrote by opening up the file `helpers.py` in the `module_1` directory.

#### TODO
Open up the file `module_1/helpers.py`. Copy and paste the code from line 4 of this file into the quiz below.

In [2]:
FreeTextTest(answer="from quizzes.module_1_quizzes import *")

VBox(children=(HTML(value=''), Textarea(value='', placeholder='Type something'), Button(description='Submit', …



What does the line of code from `helpers.py` do?

In [3]:
MultipleChoiceQuiz(answer="Imports a the contents of module_1_quizzes", options=["Imports a module called module_1_quizzes",
                                                                                "It is there for decoration."])

VBox(children=(HTML(value=''), RadioButtons(options=('Imports a the contents of module_1_quizzes', 'Imports a …



## Specific packages
Let's get comfortable using some popular Python packages.

## `numpy`
`numpy` is maybe the foundational Python package for data science. `numpy` implements methods for arrays and matrices and allows us to perform mathematical operations on them. It also implements important functions for calculating summary statistics from our data

First, we import the library. Then we can access functions in the library using `numpy.function_name`.

In [4]:
import numpy

`numpy` has a `mean` function, so we don't need to write our own function to calculate averages. Here's how to use numpy to calculate the mean of a list we saw in a previous notebook.

In [5]:
a = [4, 0, 2, 2, 0, 10, 7, 8, 5, 0]
numpy.mean(a)

3.8

Sometimes we use a library so much we don't want to type its whole name every time. So for some libraries (including numpy), it's standard practice to assign an **alias** by saying:

```python
import package_name as alias
```

Then we use `alias.function_name` in our code. 

For `numpy`, we typically refer to it as `np`:

In [6]:
import numpy as np
np.mean(a)

3.8

In [7]:
np.std(a)

3.4292856398964493

If you only need to use one particular function in a library, instead of importing the entire package you can just import the function you want (again with or without an alias):

```python
from package_name import function_name
from package_name import function_name as alias
```

#### TODO
Import the `std` function from numpy and assign it an alias of `sd` so it matches what we're used to in R.

In [8]:
from numpy import std as sd

In [9]:
def validate_np_sd(func):
    import numpy
    try:
        assert func is np.std
        print("That is correct!")
    except AssertionError:
        print("That is incorrect. Check your sd function.")
    


In [10]:
test_np_sd = FunctionTest(validation_func=validate_np_sd)
test_np_sd.test(sd)

That is correct!


In [11]:
sd(a)

3.4292856398964493

In [12]:
sd is np.std

True

### Arrays and matrices
One of the most important contributions of `numpy` is its implementation of `arrays`. Arrays are similar to lists, but with a lot of important extensions:

- They have mathematical operations defined on them that allows for easy and efficient data manipulation
- The can have multiple **axes** (i.e., rows and columns), allowing us to build matrices

In [13]:
a_arr = np.array(a)
a_arr

array([ 4,  0,  2,  2,  0, 10,  7,  8,  5,  0])

In [14]:
numpy.ndarray

numpy.ndarray

We can calculate the sum, mean, and standard deviation of the array using the appropriate methods:

In [15]:
a_arr.sum()

38

In [16]:
# You can also still use python's built-in function
sum(a)

38

In [17]:
a_arr.mean()

3.8

In [18]:
a_arr.std()

3.4292856398964493

We can also perform element-wise operations on the array (that is, apply a mathematical operation to each individual element of the array and get a new array with the results):

In [19]:
# Square the elements
a_arr ** 2

array([ 16,   0,   4,   4,   0, 100,  49,  64,  25,   0])

In [20]:
# Divide the elements by 3
a_arr / 3

array([1.33333333, 0.        , 0.66666667, 0.66666667, 0.        ,
       3.33333333, 2.33333333, 2.66666667, 1.66666667, 0.        ])

We can see the dimensionality of an array by using the `array.shape` attribute. This tells us that that `a_arr` is a 10x1 vector (the `,` means that there are no columns):

In [21]:
a_arr.shape

(10,)

### Multiple arrays
Performing operations on multiple arrays is easier with numpy than with lists. For example, we can add or multiply the elements together:

In [22]:
b_arr = np.array([4, 1, 4, 3, 9, 2, 0, 0, 5, 4])

In [23]:
a_arr + b_arr

array([ 8,  1,  6,  5,  9, 12,  7,  8, 10,  4])

In [24]:
a_arr * b_arr

array([16,  0,  8,  6,  0, 20,  0,  0, 25,  0])

Or compute their dot product:

In [25]:
a_arr.dot(b_arr)

75

However, this will throw an error if the shapes of the arrays don't line up:

In [26]:
a_arr + np.array([1, 2])

ValueError: operands could not be broadcast together with shapes (10,) (2,) 

### Matrices
A [**matrix**](https://en.wikipedia.org/wiki/Matrix_(mathematics)) is an array with multiple rows or columns. If we have two arrays, we can create a matrix by calling:

`np.array([array1, array2, ...])`

The code below creates a matrix using `a_arr` and `b_arr`:

In [27]:
matrix = np.array([a_arr, b_arr])

#### TODO
What shape does our new variable `matrix` have?

In [28]:
MultipleChoiceQuiz(answer="(2, 10)", options=["(10, 2)", "(2,)"])

VBox(children=(HTML(value=''), RadioButtons(options=('(10, 2)', '(2, 10)', '(2,)'), value='(10, 2)'), Button(d…



In [29]:
matrix.shape

(2, 10)

What data type is `matrix`?

In [30]:
MultipleChoiceQuiz(answer="numpy.ndarray", options=["numpy.matrix", "list", "tuple"])

VBox(children=(HTML(value=''), RadioButtons(options=('numpy.ndarray', 'numpy.matrix', 'list', 'tuple'), value=…



In [31]:
type(matrix)

numpy.ndarray

We can switch the rows and the columns by calling `matrix.transpose()`:

In [32]:
matrix.transpose()

array([[ 4,  4],
       [ 0,  1],
       [ 2,  4],
       [ 2,  3],
       [ 0,  9],
       [10,  2],
       [ 7,  0],
       [ 8,  0],
       [ 5,  5],
       [ 0,  4]])

#### TODO
What shape does `matrix.transpose()` have?

In [33]:
MultipleChoiceQuiz(answer="(10, 2)", options=["(2, 10)", "(10,)"])

VBox(children=(HTML(value=''), RadioButtons(options=('(10, 2)', '(10,)', '(2, 10)'), value='(10, 2)'), Button(…



### Indexing arrays
For single-axis arrays, we can index arrays the same way we did with lists:

In [34]:
a_arr[0]

4

In [35]:
a_arr[0:3]

array([4, 0, 2])

In [36]:
a_arr[:-1]

array([ 4,  0,  2,  2,  0, 10,  7,  8,  5])

When we have two-axis matrices, we can slice along the rows *and* columns we want. If you just need to slice along the rows, you can index just like you have been:

In [88]:
matrix[0]

array([ 4,  0,  2,  2,  0, 10,  7,  8,  5,  0])

We can slice along the columns by putting a comma in between the row and column indices.

So, for example, to get the value in the first row and first column:

In [89]:
matrix[0,0]

4

To get *all* the rows and the first column, we put a colon in the row index and then pass in the column index:

In [90]:
matrix[:,0]

array([4, 4])

#### TODO
What code would give you only the first *row* and all the columns? Select all that apply.

In [39]:
test_matrix_first_row_all_cols = SelectMultipleQuiz(answer=["matrix[0 , :]", "matrix[0, 0: ]"], options=["matrix[0][:]", "matrix[0, 0: ]", "matrix[0:]"], shuffle_answer=False)
test_matrix_first_row_all_cols

VBox(children=(HTML(value=''), SelectMultiple(options=('matrix[0][:]', 'matrix[0, 0: ]', 'matrix[0:]', 'matrix…



## Pandas
The second package we'll use *a lot* over the next few days is **pandas**. This library shares a lot in common with numpy but implements an important class of objects called a `DataFrame`. If you've used R before, lots of the pandas library should feel familiar.

But first, we'll need to learn about **installing packages**.

### Installing Python packages
So far, all of the packages we've used have been available by default. But there are many, many other Python packages out there. Many of them you can install with a single line of code.

Depending on the installation of Python that you have, `pandas` may not have be included in your default build. If so, you would get the following error when attempting to import pandas:
```
ModuleNotFoundError: No module named 'pandas'
```

Whenever you get that error in Python, you probably need to install the package.

You can install a package by running the following command:

```
pip install package_name
```

You can either run this in your terminal ("Terminal" on Mac, "Anaconda Prompt" on Windows) or in a cell in a Jupter notebook with `!` in front of it.

#### TODO
Run the cell below to install `pandas`.

In [44]:
!pip install pandas



Now you can import pandas (with an alias):

In [45]:
import pandas as pd

### Dataframes
A DataFrame is similar to a matrix in that it is composed of `rows` and `columns`. But DataFrames have some nice additional features:
- Rows and columns can have names
- Advanced indexing options
- Useful methods for analyzing and visualizing data

As an example of DataFrame, the function `load_pt_roster` will return the data about our emergency room waiting list from our previous notebook. Let's call it `df` (short for "dataframe"):

In [109]:
def load_pt_roster():
    import pandas as pd
    return pd.read_csv("../data/ed_patient_poster.csv")

In [110]:
df = load_pt_roster()
df

Unnamed: 0,name,arrival_time,age,severity
0,Jim,6:00,40,40
1,Mary,6:30,31,10
2,Rachel,7:00,27,20
3,Laura,7:30,38,15
4,Chloe,8:00,25,50


Indexing DataFrames is similar to how we indexed in numpy but with a few differences. The first is that indexing is typically done using the dataframe's `iloc` attribute rather than the dataframe itself:

```python
# Correct
df.iloc[start:end]

# Will sometimes work but could throw an error 
# depending on the structure of your dataframe
df[start:end]
```

#### TODO
Select the first 3 rows of `df`.

In [111]:
df[0:3]

Unnamed: 0,name,arrival_time,age,severity
0,Jim,6:00,40,40
1,Mary,6:30,31,10
2,Rachel,7:00,27,20


#### TODO
Select the final row of `df`.

In [112]:
df.iloc[-1]

name            Chloe
arrival_time     8:00
age                25
severity           50
Name: 4, dtype: object

Did you notice any difference between the two returned values from above?

When we slice multiple rows, the resulting object is a `DataFrame`. But when we access a single row, we get  a `Series`, which is analogous to a column vector or 1-axis numpy array. For example, the code below gives us a `Series` which is only the first row of data.

In [113]:
type(df.iloc[0])

pandas.core.series.Series

In [114]:
type(df.iloc[1:])

pandas.core.frame.DataFrame

#### Columns
We access columns either numerically (like in numpy) or by passing in the name(s) of the columns we want to access. 

To get multiple columns by name we pass in a list of strings:

In [115]:
df[["name", "arrival_time"]]

Unnamed: 0,name,arrival_time
0,Jim,6:00
1,Mary,6:30
2,Rachel,7:00
3,Laura,7:30
4,Chloe,8:00


In [116]:
# Numeric slicing by column
df.iloc[:,:2]

Unnamed: 0,name,arrival_time
0,Jim,6:00
1,Mary,6:30
2,Rachel,7:00
3,Laura,7:30
4,Chloe,8:00


Just like rows, if we access a single column it returns a `Series`. To get this, you pass in just the string rather than a list of strings:

In [117]:
df["name"]

0       Jim
1      Mary
2    Rachel
3     Laura
4     Chloe
Name: name, dtype: object

#### TODO
Pull out the column `severity` as a Series. Assign it to `severity`.

In [119]:
severity = df["severity"]
severity

0    40
1    10
2    20
3    15
4    50
Name: severity, dtype: int64

### Dataframe and Series methods

Just like how numpy had methods like `array.mean()`, DataFrames and Series have methods associated with them for calculations.

For example, we can get the min, max, mean, standard deviation, and median of a column:

In [125]:
print("Min severity:", severity.min())
print("Max severity:", severity.max())
print("Mean severity:", severity.mean())
print("Standard deviation of severity:", severity.std())
print("Median severity:", severity.median())

Min severity: 10
Max severity: 50
Mean severity: 27.0
Standard deviation of severity: 17.175564037317667
Median severity: 20.0


Or we could do that for multiple columns in a DataFrame:

In [126]:
df[["age", "severity"]].max()

age         40
severity    50
dtype: int64

Nice method for getting a quick overview of your data is the `describe()` method which calculates summary statistics for all of the numeric columns in your DataFrame/Series:

In [128]:
df.describe()

Unnamed: 0,age,severity
count,5.0,5.0
mean,32.2,27.0
std,6.610598,17.175564
min,25.0,10.0
25%,27.0,15.0
50%,31.0,20.0
75%,38.0,40.0
max,40.0,50.0


In [129]:
severity.describe()

count     5.000000
mean     27.000000
std      17.175564
min      10.000000
25%      15.000000
50%      20.000000
75%      40.000000
max      50.000000
Name: severity, dtype: float64

### Filtering data in `pandas`
We often wish to do analyses on a specific subset of data. To do this we need to filter the data. There are several ways to do this with `pandas`. One nice and simple way is to use the DataFrame's `query` method, where you pass in a Python expression which will be used to filter your data, such as:

```python
df.query("column == value")
```

As an example, the code below filters the data to only the rows where `name = 'Jim'`:

In [135]:
df.query("name == 'Jim'")

Unnamed: 0,name,arrival_time,age,severity
0,Jim,6:00,40,40


And this code looks for are 30 or younger:

In [136]:
df.query("age <= 30")

Unnamed: 0,name,arrival_time,age,severity
2,Rachel,7:00,27,20
4,Chloe,8:00,25,50


#### TODO
Filter the dataset to patients who have a severit score of 30 or higher.

In [137]:
df.query("severity >= 30")

Unnamed: 0,name,arrival_time,age,severity
0,Jim,6:00,40,40
4,Chloe,8:00,25,50


### Reading data with `pandas`
So far most of our examples have used small, mostly synthetic datasets. But eventually we'll want to work with real clinical data. To do this we need to be able to load data into Python to manipulate and analyze.

`pandas` offers lots of helpful functions for doing this. Here are a few methods for reading in data with pandas from different data sources. Each takes a different type of data input and returns a DataFrame. 

- `pd.read_csv`: Comma-separated files
- `pd.read_excel`: Excel workbooks
- `pd.read_sql`: SQL databases (more on this tomorrow)
- `pd.read_json`: JSON files

Today we'll use some datasets kept in `csv` files, and tomorrow we'll move on to SQL databases.

The helper function we used early actually just called `pd.read_csv` to load our patient roster:


In [138]:
pd.read_csv("../data/ed_patient_poster.csv")

Unnamed: 0,name,arrival_time,age,severity
0,Jim,6:00,40,40
1,Mary,6:30,31,10
2,Rachel,7:00,27,20
3,Laura,7:30,38,15
4,Chloe,8:00,25,50


## Next Steps
Now that we know how to load data and use pandas, we're ready to do some real data analysis! In the next notebook we'll work with a dataset generated using clinical data from an ICU.