<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Advanced Transformations with `pandas`

_Authors: Dave Yerrington (SF)_

---

`pandas` is the most popular python package for managing datasets and is used extensively by data scientists.

### Learning Objectives

- Use map and apply to functionally transform your data.
- Subset your data using the `groupby` function
- Collapse and summarize data groups using aggregation

## What' the difference between an DataFrame and a Series?
Let's check some ideas and discuss them first thing.

In [2]:
import pandas as pd, numpy as np

%matplotlib inline

### Let's start with some Pokemon

In [3]:
data = [
    (24, 25, 100, "Pikachu"),
    (55, 55, 120, "Bulbasaur"),
    (33, 35, 100, "Charmander"),
    (22, 25, 105, "Geodude"),
    (12, 15, 90,  "JigglyPuff"),
    (55, 55, 115, "Paul"),
]

columns = ["Power", "Speed", "HP", "Name"]
df = pd.DataFrame(data, columns = columns)
df

Unnamed: 0,Power,Speed,HP,Name
0,24,25,100,Pikachu
1,55,55,120,Bulbasaur
2,33,35,100,Charmander
3,22,25,105,Geodude
4,12,15,90,JigglyPuff
5,55,55,115,Paul


# Series Aspects: Axis 1 and 0

Previously we talked about the abilty to operate on data as either _rows_ or _columns_.  Let's review this now becuase it's a core assumption about how we _can_ work with our _DataFrame_.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

###  Quick question:  Does axis matter if we're using "sum()"?

In [4]:
df.sum()
# axis = 0 as the default option; Sums up all rows

Power                                                201
Speed                                                210
HP                                                   630
Name     PikachuBulbasaurCharmanderGeodudeJigglyPuffPaul
dtype: object

In [5]:
df.sum(axis=1)
# axis = 1; Sums up across columns

0    149
1    230
2    168
3    152
4    117
5    225
dtype: int64

### Apply vs Map

One way we can access our data, is using the raw `numpy` representation of it with `.values`.  Let's say we want to just scale the value of _Power_ by 10, we might be inclined to do this with vanilla Python like so:

In [6]:
df.values

array([[24, 25, 100, 'Pikachu'],
       [55, 55, 120, 'Bulbasaur'],
       [33, 35, 100, 'Charmander'],
       [22, 25, 105, 'Geodude'],
       [12, 15, 90, 'JigglyPuff'],
       [55, 55, 115, 'Paul']], dtype=object)

In [9]:
my_values = df.values

for index, row in enumerate(my_values):
    row[0] = row[0] * 10
    
my_values

# Accessing the first element of every row in my_valyes because this holds the values in 'Power' columns

array([[240, 25, 100, 'Pikachu'],
       [550, 55, 120, 'Bulbasaur'],
       [330, 35, 100, 'Charmander'],
       [220, 25, 105, 'Geodude'],
       [120, 15, 90, 'JigglyPuff'],
       [550, 55, 115, 'Paul']], dtype=object)

While this is fine and all at some level, we've left the safe haven of our DataFrame object and duplicated a lot of extra data that we have to chauffer back into a _DataFrame_ if we want to use it again.

### Map functions
**Map functions operate on _series_ only**.  They make it easy for us to transform data without having to break out into a full on iterator like in our previous example.  Let's do this with a map function instead.

Common use cases:

* Iterating on a _series_, functionally, rather than using a "loop".
* Feature engineering (creating a column / variable based on some transformation of existing data).
* Prototyping a function before **apply**ing it to many columns.

In [12]:
def scale_by_10(value):
    return value * 10

In [13]:
df['Power'].map(scale_by_10)

0    240
1    550
2    330
3    220
4    120
5    550
Name: Power, dtype: int64

#### We could be more succinct with a lambda function here.  Let's do it!
Generally, you might want to write a user defined function to handle operations that require more than a few lines of code to accomplish but choose lambda functions for doing simple one line operations.

In [10]:
df['Power'].map(lambda value: value*10)

0    240
1    550
2    330
3    220
4    120
5    550
Name: Power, dtype: int64

#### Have we changed any data though?

In [16]:
df

Unnamed: 0,Power,Speed,HP,Name
0,24,25,100,Pikachu
1,55,55,120,Bulbasaur
2,33,35,100,Charmander
3,22,25,105,Geodude
4,12,15,90,JigglyPuff
5,55,55,115,Paul


In [17]:
df['Power'] = df['Power'].map(lambda value: value * 10)
df

Unnamed: 0,Power,Speed,HP,Name
0,240,25,100,Pikachu
1,550,55,120,Bulbasaur
2,330,35,100,Charmander
3,220,25,105,Geodude
4,120,15,90,JigglyPuff
5,550,55,115,Paul


### Apply Functions

Where `map()` works on a series, `apply()` works on multi-series either on the _column_ (axis=0) or _row_ (axis=1) axis. Let's create a function that `apply()` can use, that will create a few column call "all".  Every apply function must return a _series_.

- Operates on all series on axis 0 or 1
  - axis=1 (column axis, row-by-row data), the input to each iteration of the function will be each single row.
  - axis=0 (row axis, column-by-column data), the input to the provided function will be the entire column series (6 values in this particular dataset)
- Function used with `apply()` must return a _series_.


#### Axis = 1 (column axis, row-by-row data)

In [18]:
def total_stats(row):
    row['All'] = row[['Power', 'Speed', 'HP']].sum()
    return row

In [19]:
df.apply(total_stats, axis=1)

Unnamed: 0,Power,Speed,HP,Name,All
0,240,25,100,Pikachu,365
1,550,55,120,Bulbasaur,725
2,330,35,100,Charmander,465
3,220,25,105,Geodude,350
4,120,15,90,JigglyPuff,225
5,550,55,115,Paul,720


### Axis = 0 (row axis, column-by-column data)

Now let's operate on `columns` (axis=0).  Perhaps we want to scale the values of each series by 100.

In [23]:
def scale_by_100(column):
    return column.sum()

In [24]:
df.apply(scale_by_100, axis=0)

Power                                               2010
Speed                                                210
HP                                                   630
Name     PikachuBulbasaurCharmanderGeodudeJigglyPuffPaul
dtype: object

### Let's change our dataset out


![](https://snag.gy/nA84ce.jpg)

## Now let's talk about data again.

Here's a basic dataset that we will use for our examples going forth.

**"Data Dictionary"**

The _data dictionary_ describes the contents, format, and structure of a given dataset.  You might notice this term come up from time to time to refer to the definition of a contents of data.

----
- **Weight**: Metric weight of subject
- **Height**: Metric height of subject
- **Success**: How likely given person likes **bacon** or **pancakes**
- **Category**: Category of food experiment performed in observed test 

In [26]:
data = [
    (15, 150, .5,  "bacon"),
    (45, 150, .99, "stack"),
    (55, 150, .55, "bacon"),
    (42, 150, .98, "stack"),
    (56, 150, .66, "bacon"),
    (33, 225, .65, "bacon"),
    (44, 234, .89, "stack"),
]

columns = ["height", "weight", "success", "category"]
df = pd.DataFrame(data, columns = columns)
df

Unnamed: 0,height,weight,success,category
0,15,150,0.5,bacon
1,45,150,0.99,stack
2,55,150,0.55,bacon
3,42,150,0.98,stack
4,56,150,0.66,bacon
5,33,225,0.65,bacon
6,44,234,0.89,stack


# Grouping / Subsetting and Aggregation

Looking at subsets of data is fundamental to the exploratory anlaysis process.  There are times when you might see:

* Statistics that greatly contrast in your subset vs your overall population.
* Characteristics that are contrary to global assumptions about your dataset.
* Imputation of missing values from a specific aspect of your data.

Groupby is a handy operation that lets you examine subsets defined by a common variable.  The most fundamental of assumptions about how grouping works is that subsets are formed when values of a given variable are the same.

Let's check out a few examples assuming this dataset dataset:

![](https://snag.gy/nA84ce.jpg)

If we wanted to subset our data by "category", the only 2 possible categories would be **"bacon"** and **"stack"** (for _Sam Stack_) because those are the unique values in that column / variable / feature.

What would these subsets look like?

![](https://snag.gy/anAB15.jpg)

If we're basing our subsets on the **"category"** variable, the first subset would be the all records in the dataset with the first unique value, **"bacon"**.

If we wanted to do this in Pandas, this would be the same as **"grouping by"** the **"category"**.  In fact would accomplish this in Pandas using the `.groupby()` feature of the dataframe.

#### Let's assign this to a variable and inspect it further.

Notice that nothing is displayed as a result of a `groupby` operation.  That's ok for now.  We can look at this.

In [29]:
category_group = df.groupby('category')
category_group.groups

{'bacon': [0, 2, 4, 5], 'stack': [1, 3, 6]}

In [30]:
category_group.groups.keys()

dict_keys(['bacon', 'stack'])

Above, you can see that the subsets (ie: groups) **"bacon"** and **"stack"** are accounted for.  Each group has it's own `index` reference and `dtype`.  The interesting detail to note is that the index references the rows from the original DataFrame.

![](https://snag.gy/tieBmj.jpg)

The second subset, would include only rows (ie: axis = 0) having the same values for the **"category"** variable, which would be **"stack"**.

### Accessing each group

In [31]:
category_group.get_group('bacon')

Unnamed: 0,height,weight,success,category
0,15,150,0.5,bacon
2,55,150,0.55,bacon
4,56,150,0.66,bacon
5,33,225,0.65,bacon


In [32]:
category_group.get_group('stack')

Unnamed: 0,height,weight,success,category
1,45,150,0.99,stack
3,42,150,0.98,stack
6,44,234,0.89,stack


### We can even iterate over these groups
We will expore the commented lines of code as well.
   

In [33]:
df.groupby('category').apply(print)

   height  weight  success category
0      15     150     0.50    bacon
2      55     150     0.55    bacon
4      56     150     0.66    bacon
5      33     225     0.65    bacon
   height  weight  success category
1      45     150     0.99    stack
3      42     150     0.98    stack
6      44     234     0.89    stack


### Aggregation

The most common case in which "grouping" or "subsetting" is employed is with _aggregration_.  Generally, we are going from many values (in), to a single value (out).  Conceptually _aggregation_ looks like this:
![image.png](attachment:image.png)

Let's take our "height" data from our "bacon" subset, then plug it into our aggregation concept to explore it further.

![](https://snag.gy/pOBE1V.jpg)

### Aggregation with groups + Pandas

Just before this cell, we used a `mean` and `sum` aggregation.  Here it is with a _grouped_ DataFrame.

In [34]:
df.groupby('category').mean()

Unnamed: 0_level_0,height,weight,success
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bacon,39.75,168.75,0.59
stack,43.666667,178.0,0.953333


In [35]:
df.groupby('category').sum()

Unnamed: 0_level_0,height,weight,success
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bacon,159,675,2.36
stack,131,534,2.86


#### Perhaps it's helpful to review this again in the context of the "stack" group.

Let's just assume we're talking about the "stack" subset for a moment.  Each column _series_ (axis = 1), went through the `.sum()` aggregation function, within the "stack" subset.
![](https://snag.gy/nruAZ1.jpg)

### It's also possible to do aggregation over the entire dataset.

### Multiple aggregation over every variable.
In this case we are using Numpy functions that take multiple values in, returning a single value out.

### Aggregation over specific variables.

# Conclusion

- When would you use map vs apply (Bonus: Can they be aggregation functions or not)?
- How can we implement aggregation with Pandas?
- What is the hardest topic we've learned today?
- Can you think of any good uses for `groupby`?