In [None]:
import warnings

# DSCI 100 - Introduction to Data Science


## Lecture 3 - Wrangling to get tidy data

*Credit to [Jenny Brian's slides](https://www.slideshare.net/Plotly/plotcon-nyc-behind-every-great-plot-theres-a-great-deal-of-wrangling) and Garrett Grolemund's [tidying example](https://garrettgman.github.io/tidying/)*


<img src="img/intentional_arrival.png" width=800>

### Housekeeping
- Reminder about worksheets/tutorials: 
 - **do not ever rename, move, or delete your worksheet**
         - **do not use "Save As..." when you're saving your work**
         - use "Save Notebook" or Ctrl-S instead!
     - at a minimum, if you move/delete/rename worksheets, our autograder won't be able to find them and will give you 0
- You can download your worksheets and tutorials to your computer if you think something has gone terribly wrong.

## Reminder  

Where are we? Where are we going?

<center><img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="800px"/></center>

*image source: [R for Data Science](https://r4ds.had.co.nz/) by Grolemund & Wickham*

## Data Wrangling!
<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/data_cowboy.png" width="700"/>

[The cartoon illustrations in this slide deck are all created by Allison Horst](https://github.com/allisonhorst/stats-illustrations/tree/main/rstats-artwork)

- In the real world, when you get data, it's usually *very messy*
  - inconsistent format (commas, tabs, semicolons, missing data, extra empty lines)
  - split into multiple files (e.g. yearly recorded data over many years)
  - corrupted files, custom formats
- when you read it successfully into Python, it will often still be *very messy*

- you need to make your data **"tidy"**

## What is Tidy Data?

<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg" width=800>

*Illustrations from the [Openscapes](https://www.openscapes.org/) blog [Tidy Data for reproducibility, efficiency, and collaboration](https://www.openscapes.org/blog/2020/10/12/tidy-data/) by Julia Lowndes and Allison Horst"*



<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_2.jpg" width=800>



<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_3.jpg" width=800>

## Examples of Tidy(?) Data

...here is the same data represented in a few different ways. Let's vote on which are tidy!

### Tuberculosis data

This data is tidy. True or false?

| country     | year | rate              |
|-------------|------|-------------------|
| Afghanistan | 1999 | 745/19987071      |
| Afghanistan | 2000 | 2666/20595360     |
| Brazil      | 1999 | 37737/172006362   |
| Brazil      | 2000 | 80488/174504898   |
| China       | 1999 | 212258/1272915272 |
| China       | 2000 | 213766/1280428583 |

False. 

### Tuberculosis data

This data is tidy. True or false?

| country     | cases (year=1999) | cases (year=2000) | population (year=1999) | population (year=2000) |
|-------------|-------------------|-------------------| -----------------------|------------------------|
| Afghanistan | 745               | 2666              | 19987071               | 20595360               |
| Brazil      | 37737             | 80488             | 172006362              | 174504898              |
| China       | 212258            | 213766            | 1272915272             | 1280428583             |

False. Here we have separate columns for the same variable measured at different years. All `population` measurements should be contained in a single column, as should all the measurements for `cases`.

### Tuberculosis data

This data is tidy. True or false?

| country     | 1999   | 2000   |
|-------------|--------|--------|
| Afghanistan | 745    | 2666   |
| Brazil      | 37737  | 80488  |
| China       | 212258 | 213766 |


False. Again we have separate columns for the years, and we don't know if the measurement are of `population` or `cases`.

### Tuberculosis data

This data is tidy. True or false?

| country     | year | key        | value      |
|-------------|------|------------|------------|
| Afghanistan | 1999 | cases      | 745        |
| Afghanistan | 1999 | population | 19987071   |
| Afghanistan | 2000 | cases      | 2666       |
| Afghanistan | 2000 | population | 20595360   |
| Brazil      | 1999 | cases      | 37737      |
| Brazil      | 1999 | population | 172006362  |
| Brazil      | 2000 | cases      | 80488      |
| Brazil      | 2000 | population | 174504898  |
| China       | 1999 | cases      | 212258     |
| China       | 1999 | population | 1272915272 |
| China       | 2000 | cases      | 213766     |
| China       | 2000 | population | 1280428583 |

False. The `value` column contains measurements of two diffrent variables.

### Tuberculosis data

This data is tidy. True or false?

| country     | year | cases  | population |
|-------------|------|--------|------------|
| Afghanistan | 1999 | 745    | 19987071   |
| Afghanistan | 2000 | 2666   | 20595360   |
| Brazil      | 1999 | 37737  | 172006362  |
| Brazil      | 2000 | 80488  | 174504898  |
| China       | 1999 | 212258 | 1272915272 |
| China       | 2000 | 213766 | 1280428583 |

**True!**

- each row corresponds to a single observation,
- each column corresponds to a single variable, and
- each cell (row, column pair) correspond to a single value

## Tools for tidying and wrangling data

- `Pandas` library functions
    - `[]` & `loc[]`
    - `assign`
    - `groupby` & `agg`
    - `melt` & `pivot`
    - `apply`

## Demo Time! 
<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="800"/>


In [1]:
import pandas as pd

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


In [2]:
penguins = pd.read_csv("https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv")
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


Data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Antarctica.


## Selecting columns with []

The `[]` notation can be used to select a subset of columns (variables) from a dataframe.

For example if we wanted to see the `bill_length_mm` or the `bill_length_mm` and `bill_depth_mm`. 

In [3]:
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


```
# Start with selecting one column
penguins['bill_length_mm']

# Select many columns by passing them as a list
penguins[['bill_length_mm', 'bill_depth_mm']]

# We could also select a single column as a list
# This will return a dataframe instead of a series
penguins[['bill_length_mm']]

# Could also have assigned this to another variable name
penguins_bill_length = penguins[['bill_length_mm']]
penguins_bill_length
```

## Filtering rows with []

The `[]` notation an also be used to choose a subset of rows (observations) in a dataframe, based on a condition.

e.g. filter for only penguins with flippers longer than 190 mm

In [None]:
penguins

```
penguins

# One condition
penguins[penguins['flipper_length_mm'] > 190]

# Two conditions
penguins[
    (penguins['flipper_length_mm'] > 190)
    & (penguins['species'] == 'Chinstrap')
]
```

## loc[]

The `[]` operation is only used when you want to **either** filter rows **or** select columns. If you want to do **both at the same time**, you can use the `loc[]` notation instead.


In [None]:
penguins

```
# Filter rows and select columns
penguins.loc[
    penguins['flipper_length_mm'] > 190,
    ['flipper_length_mm', 'body_mass_g']
]

# `loc` can also be used to select a range of columns, which `[]` can't
# : is a range, without an end or start, it's the full range (all rows or columns)
penguins.loc[:, 'island':'body_mass_g']

# From island to end
penguins.loc[:, 'island':]

# Select columns by index
penguins.iloc[:, :4]
```

(Optional) In case students asks about selecting columns more efficiently, e.g. everything that starts with "bill_":

```
penguins.loc[:, penguins.columns.str.startswith('bill_')]
```

> - str.startswith(): Starts with a prefix.
> - str.endswith(): Ends with a suffix.
> - str.contains(): Contains a regular expression string.

## Assign

The `assign` function can create new columns from old ones.

e.g. convert body mass from grams to pounds 

In [None]:
penguins

```
# Recall that 454 g is 1 lbs
penguins.assign(body_mass_lbs = penguins['body_mass_g'] / 454 )

# The above creates a new dataframe, it does not save it to the original `penguins` df
penguins

# We need to assign it to a new variable if we want to save it
penguins_lbs = penguins.assign(body_mass_lbs = penguins['body_mass_g'] / 454 )
```

## Many operations in the same sequence

When you need type out a long sequence of operations on data, you could either:

1. Save intermediate objects
2. Add methods together in the same chain

Let's look closer at each one of these

### 1. Save intermediate objects

```python
penguins_1 = penguins.assign(body_mass_lbs = penguins['body_mass_g'] / 454 )
penguins_2 = penguins_1[penguins_1['flipper_length_mm'] > 190]
penguins_3 = penguins_2[['flipper_length_mm', 'body_mass_lbs']]
penguins_3
```

#### Disadvantages:

- The reader may be tricked into thinking the named `penguins_1` and `penguins_2` objects are important for some reason, while they are just temporary intermediate computations. 
- Further, the reader has to look through and find where `penguins_1` and `penguins_2` are used in each subsequent line.
- Creating variables that we don't need, could lead to memory issues if they are big and take up a lot of space.

### 2. Method chaining

```python
(
    penguins
    .assign(body_mass_lbs = penguins['body_mass_g'] / 454 )
    [penguins['flipper_length_mm'] > 190]
    [['flipper_length_mm', 'body_mass_lbs']]
)
```

#### Advantage: 

- Code becomes more readable, particularly when you need to do a long sequence of operations on data.
- No intermediatery variables created

Emphasize/repeat what a method is.

We want to put each method/operation on its own row. We could write it in one big line, but it would be very hard to read.

In order to do this, we must use the parentheses to indicate to Python that all this code belongs together and should be executed as one.

### 2. Method chaining

```python
(
    penguins
    .assign(body_mass_lbs = penguins['body_mass_g'] / 454 )
    .loc[
        penguins['flipper_length_mm'] > 190, 
        ['bill_length_mm', 'flipper_length_mm']
    ]
)
```

Remember that we should use `loc` when doing a row filter and column selection,
rather than doing the two operations after each other.

### Nursery rhyme example

> Jack and Jill went up the hill  
> To fetch a pail of water  
> Jack fell down and broke his crown  
> And Jill came tumbling after

#### Intermediate objects

```r
on_hill = jack_jill.went_up('hill')
with_water = on_hill.fetch('water')
fallen = with_water.fell_down('jack')
broken = fallen.broke('jack')
after = broken.tumble_after('jill')

```

### Nursery rhyme example

> Jack and Jill went up the hill  
> To fetch a pail of water  
> Jack fell down and broke his crown  
> And Jill came tumbling after

#### Method chaining

```python
(
    jack_jill
    .went_up('hill')
    .fetch('water')
    .fell_down('jack')
    .broke('jack')
    .tumble_after('jill')
)
```

## Grouping and summarizing via `groupby`

- Grouping is when you split data into groups based on the value of a column.
- Sumarizing is when you combine data into fewer summary values.

For example computing the average particle pollution per city (source https://info201.github.io/dplyr.html):

![](https://info201.github.io/img/dplyr/group_by.png)

- Another example, splitting the `penguins` data into one group per `species`, and then summarizing the values within each group e.g. reporting the average `body_mass` per species. 
    - To do this we use `group_by` to iterate over species, calculating average body mass.

In [None]:
# Recall the data structure
penguins

Count how many penguins there are for each species:

```
penguins.value_counts("species")
```

Calculate average body mass for each species:

```
penguins.groupby("species")["body_mass_g"].mean()
```

The standard deviation


```
penguins.groupby("species")["body_mass_g"].std()
```

For any of these, we can use `.reset_index` to get a dataframe.


Note pandas will ignore missing values when computing aggregations such as the mean, std, etc.

## Another big concept this week: iteration

- Iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

![](https://www.ecomcrew.com/wp-content/uploads/2015/07/bar-code-scanning-grocery-store.jpg)

- important to reduce duplication of code  
- easier to see the intent of your code 
- likely have fewer bugs

## `apply` 

`apply` allows you to apply function(s) to multiple columns/rows,
e.g. to calculate the average for each numeric column in the `penguins` data



- `apply()` executes any function on each element of a DataFrame.
- we will pass the function we want to apply as a parameter, for example: `apply(max)`
- the `axis` paramter defines if you iterate over columns or rows. 
    - `axis=0` is the default if no input is passed. It indicates we want to apply our function to each column 
    - `axis=1` indicates we want to apply our cuntion to each row


In [None]:
penguins

using `.max()` to cacluate max across different columns data

```
penguins.loc[:, "bill_length_mm":"body_mass_g"].max()
```

we can do the same thing with `apply`
```
penguins.loc[:, "bill_length_mm":"body_mass_g"].apply(max)
```

why is apply useful? can be used with a wide range of operations,
such as our own custom function that don't exist in pandas.
In general the rule is that if it already exists in pandas,
then it is better to **not** use apply.


## Go forth and wrangle! 

<img align="left" src="https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy.gif" width="300">

*image source: https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy-downsized-large.gif*

## What did we learn?

- 
- 
- 





Friends with similar tools:

<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_4.jpg" width=800>

Easier for automation & iteration!

<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_5.jpg" width=800>

And it makes all other tidy datasets seem more welcoming!

<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_6.jpg" width=800>

So make friends with tidy data!

<img src="https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_7.jpg" width=800>