<img src="https://snipboard.io/Kx6OAi.jpg">

# Session 2. Practical Advanced Pandas: Configuration and Review
<div style="margin-top: -20px;">Author:  David Yerrington</div>

## Learning Objectives

- Aggregations Review
- Describe the role of axis in aggregations
- How to implement `.apply()` methods and their tradeoffs

### Prerequisite Knowledge
- Basic Pandas 
  - Difference between Series vs Dataframe
  - Bitmasks, query function, selecting data
  - Aggregations

## Environment Setup

Don't forget to setup your Python environment from [the setup guide](../environment.md) if you haven't done so yet.


In [70]:
import pandas as pd, numpy as np

### Load Data

In [72]:
df = pd.read_csv("../data/pokemon.csv", encoding = "utf8")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,4,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,5,Charmander,Fire,,39,52,43,60,50,65,1,False


## 1. Aggregations

Aggregations are the bread and butter of Pandas.  Aggregations provide us the most common type of data transformation that we'll encounter in the wild.  Using these types of functions helps us understand data primarily through summarization.

### Measures of central tendency
![](https://snipboard.io/KOHx1p.jpg)

The most useful function in Pandas when first exploring data is the summary function which combines the most common measures of central tendency all-in-one.

### Common Built-in Methods

### Mean

### Median

### Sum

## `groupby`

A "group by" operation works much the same way as it would in a SQL environment.  We answer questions like "Find the number of people in each city" or "How many online orders did we have each day of the month."

The way `.groupby` works in Pandas is by grouping identical values and performing an operation on coorepsonding data within the same group.


In [79]:
data = [
    dict(key = "A", value = 65),
    dict(key = "A", value = 55),
    dict(key = "A", value = 35),
    dict(key = "B", value = 250),
    dict(key = "B", value = 650),
    dict(key = "B", value = 323),
    dict(key = "C", value = 1500),
    dict(key = "C", value = 2345),
    dict(key = "C", value = 3333),
]
## Create dataframe here
new_df = pd.DataFrame(data)
new_df

Unnamed: 0,key,value
0,A,65
1,A,55
2,A,35
3,B,250
4,B,650
5,B,323
6,C,1500
7,C,2345
8,C,3333


## Groupby `key`

Using the column `key` as the criteria for grouping, Pandas will separate each value in the `key` to groups matching the same value.  For example, all "A" records are considered a group and same for "B" and "C".  Each group has 3 records in it.

Any aggregation is then performed on each group separately, then combined to a single value representing each group.

In [81]:
new_df.groupby("key").mean()

Unnamed: 0_level_0,value
key,Unnamed: 1_level_1
A,51.666667
B,407.666667
C,2392.666667


We get 51.66667 for group "A" because the values in this group are `65`, `55`, and `35` and when we calculate the mean of these group values, we get `51.666667`.  The same calculation, `.mean()`, is performed across the other groups as well.

### Applying Multiple Aggregations

You can also use numpy to provide additional aggregations (you can also write your own!).

## 2. Role of `axis` in Aggreations

As we approach building our own `.apply()` functions, it's helpful to understand how axis is used to control the flow of data to aggregation methods.

![](https://snipboard.io/8i3yIz.jpg)

## Another Simple Dataset

In [85]:
data = [
    dict(animal = "cat", width = 5, height = 10, depth = 22, volume = 0),
    dict(animal = "cat", width = 3, height = 8, depth = 17, volume = 0),
    dict(animal = "rat", width = 2, height = 1, depth = 3, volume = 0),
    dict(animal = "rat", width = 3, height = 1.2, depth = 3.5, volume = 0),
    dict(animal = "bat", width = 2, height = 2.12, depth = 4, volume = 0),
    dict(animal = "bat", width = 3, height = 3.12, depth = 4.1, volume = 0),
    dict(animal = "bat", width = 1.9, height = 2.8, depth = 3.8, volume = 0),
    dict(animal = "wombat", width = 7, height = 6, depth = 26, volume = 0),
    dict(animal = "wombat", width = 8, height = 7.1, depth = 28, volume = 0),
    dict(animal = "wombat", width = 5, height = 5, depth = 22, volume = 0),
]

## Create dataframe here
animals = pd.DataFrame(data)
animals

Unnamed: 0,animal,width,height,depth,volume
0,cat,5.0,10.0,22.0,0
1,cat,3.0,8.0,17.0,0
2,rat,2.0,1.0,3.0,0
3,rat,3.0,1.2,3.5,0
4,bat,2.0,2.12,4.0,0
5,bat,3.0,3.12,4.1,0
6,bat,1.9,2.8,3.8,0
7,wombat,7.0,6.0,26.0,0
8,wombat,8.0,7.1,28.0,0
9,wombat,5.0,5.0,22.0,0


### Question:  What do you think would happen if we used `axis = 1` rather than `axis = 0` in aggregations based on `df.mean(axis=n)`?

> `axis = 0` is the default with Pandas aggregate functions.

## 3. `.apply()`

Sometimes we want to write our own transformations in Pandas because we either don't know how to do something in Pandas or it's built-in methods are not enough to produce what we want.  What's common with newcombers to Pandas is they want to do something custom and start searching for how to iterate over the rows of a DataFrame like it's some Python `list` or `dict`.  

It's definitely possible to write your own methods this way, it's really not using Pandas to its full potential.  As the size of datasets grow in size, efficiency suffers.

### Create a new column called "volume"

Let's pretend that our animals are perfectly rectagular in nature.  We want to eningeer a new feature that calculates the volume of each animal by row.

## $V=LWH$

### Wrong Way:  Iterate over every row. 

In [7]:
# Lets increase the size of the data first using sampling!

## The Right Way:  `.apply([function])`

Now only can we do this in one line using a Lambda function.  This will be slower than writing an iterator, but saves a lot of code and can be easier to copy/paste if it's just a single line.

> **This lambda function:**
>
> ```python
> lambda row: row['depth'] * row['width'] * row['height']
> ```
>
> <br>
>
> **Is roughly equivalent to:**
>
> ```python
> def my_func(row):
>     return row['depth'] * row['width'] * row['height']
> ```

### An aside: For pure speed, operate directly on the colum series.

In [None]:
%%timeit

sampled_animals['volume'] = sampled_animals['depth'] * sampled_animals['width'] * sampled_animals['depth']

### Question:  If we `.apply` a function to our dataset by columns, instead of rows, what do you expect the data to look like as the input to the function?

## Apply by column (axis = 0)

By default, `.apply()` methods work based on columns so it's no necessary to specify `axis=0`.  In this space, we will build a basic method that standardizes the input of our data but also examine exactly what the data looks like before and after as it enters the function.

Let's check the dataset again before continuing.


In [115]:
## check animals DataFrame
animals

Unnamed: 0,animal,width,height,depth,volume
0,cat,5.0,10.0,22.0,0
1,cat,3.0,8.0,17.0,0
2,rat,2.0,1.0,3.0,0
3,rat,3.0,1.2,3.5,0
4,bat,2.0,2.12,4.0,0
5,bat,3.0,3.12,4.1,0
6,bat,1.9,2.8,3.8,0
7,wombat,7.0,6.0,26.0,0
8,wombat,8.0,7.1,28.0,0
9,wombat,5.0,5.0,22.0,0


## Applying a Standard Scale Function

Standard scaling a set of column values is a fairly common and easy task for any machine learning library in Python.  To illustrate the utility of `.apply([method])` we will recreate this common statistical concept:

# $z = \frac{x - \bar{x}}{\sigma}$

In [None]:
# Write the standard scale function
# Apply it to the
# Examine input data
def standard_scale(col):
    return col # update return method

animals[['width', 'height', 'depth']].apply(standard_scale)

### Other Useful Transformation Methods

### `.applymap`

Applymap works just like apply except that it's input context is every single value rather than the entire column or row series.  

### `.agg([list of methods])`

Agg, like apply, allows you to contextualize aggregations using built-in, numpy, or user defined methods.  The output of `.agg()` are the aggregations performed accross multiple columns or rows depending on the axis specified.

> You can also specifify a dictionary with keys matching the columns, and specific methods that they match to:
> `df.agg({'col1': [np.min, np.max], 'col2': [np.sum], ...})`

**An example:**

In [7]:
animals[['height', 'width']]

## Summary

We use aggregations and transformations in Pandas regularly.  Most transformation problems involve data cleaning, feature engineering, metrics, or even summarization.  As data projects demand data in different formats for machine learning, analysis, or online processing of some kind, it's important to know how to effectively work with Pandas to get the most out of it.

### 1.  Use `.apply()` rather than `for row in dataframe.values:`

Most of Pandas uses Numpy data types which are highly optimized for array operations.  Numpy arrays are densely packed arrays of a homogeneous type. Compared to Python lists, each element has a pointer reference to a different object.  This is generally benefitial due to ideals of [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference).

Not only is it faster to use Pandas built-ins, calculating new features by column series automatically increases your performance by an order of 5-100x.  Almost anything you do with `apply()` will be slower than regular Python.


> **Fun Fact**
> The smallest possible space for an element to take up in a Python list is 16 bytes.  4 bytes minimum for the value, 4 for the reference in memory, 4 for the type, rounded up by the memory allocator would be 16 bytes.  For Numpy, each element is the exact same so there's no need to style type since it's container is designed to hold every single value of the same type.  Right off the bat, we get rid of 4 bytes to store type information for each element.  To store single precision values is only 4 bytes per element while double precision values are 8 bytes.  At the end of the day you are storing less data about your data with Numpy vs regular python lists.  Lists are designed to be general purpose while Numpy arrays are designed to handle numeric data efficiently.

> Whenever possible, don't repeat yourself!  Use a built-in function provided by Pandas or Numpy to transform your data rather than writing it from scratch!  It will almost be faster to use built-ins depending on the problem and size of your data.

### 2. `.apply()` works with either axis: Columns or rows.

We now understand why `.apply` is superior to operating on your data with lists and "for loops."  Apply can take any function, user defined, built-in, or functions provided by other library such as Numpy and scipy.  

When `axis=1`, the input to your apply function is a single row series.  This is useful for when you need to operate on values at the row level such as multiplying or adding to values together, extracting the time or date from one column and creating a new column with the extracted value, or engineering new features with combined row values.

When `axis=0` or not specified because it's the default, the input to your apply function will be the **entire** column series.  So if your DataFrame has 100 rows it, the apply function's input will be a column series with the length of 100.  Apply functions operate each column in order that they exist in the DataFrame they are applied to.  However, it's not possible to get other columns in context.  If you need to use data from other columns to derive the value of your function, use row axis `axis=1` instead.