# Filtering, Transforming, Applying

### Objectives
After this lesson you should be able to...
+ Know how to use the primary groupby methods **`agg`**, **`filter`**, **`transform`** and **`apply`**
+ Know the differences between **`agg`**, **`filter`**, **`transform`** and **`apply`**
+ Know what object is implicitly passed to **`agg`**, **`filter`**, **`transform`** and **`apply`**

### Prepare for this lesson by
+ Reading the rest of the [split apply combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation) documentation from Transformation until the end

## groupby summary


There are four primary methods that you will use once you group your DataFrame columns. The following command is the **generic form**. The table below summarizes **`agg`**, **`filter`**, **`transform`** and **`apply`**.

**`df.groupby(['columns', 'to', 'group']).agg/filter/transform/apply(functions to apply)`**

<table>
    <thead>
        <tr>
            <td><b>groupby Method</b></td>
            <td><b>Description</b></td>
            <td><b>Object Passed to Function</b></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>agg</td>
            <td>By definition an aggregation means to take many values and reduce them to a single value.<br>
                The agg method must return a single value. <br>
                Can specify a different aggregation for each function.</td>
            <td>Series</td>
        </tr>
        
        <tr>
            <td>filter</td>
            <td>The function must return a boolean. Each group is either kept or discarded.<br> 
                The number of columns always remains the same.</td>
            <td>DataFrame</td>
        </tr>
        
        <tr>
            <td>transform</td>
            <td>The function is applied column by column and keeps the length of the data the same.<br></td>
            <td>Series</td>
        </tr>
        
        <tr>
            <td>apply</td>
            <td>Use only when above methods don't work.<br> 
                   When you want to work with multiple columns within each group.</td>
            <td>DataFrame</td>
        </tr>
    </tbody>
</table>

## Filtering out groups

**`groupby`** objects come with a **`filter`** method that...
1. Scans each group independently
2. Applies a function to each group and returns a boolean value
3. Keeps the group or drops each group based on the boolean value returned from the function
4. The end result is the original DataFrame (same number of columns) with certain groups filtered out

The **`filter`** method accepts a function that returns either True or False for each group. This result is used to filter the original DataFrame.

The **`filter`** method accepts will be a **custom** function that is **`implicitly`** passed a DataFrame of each group.

Anything can happen inside the body of the function passed to **`filter`** but it must return **`True`** or **`False`**. This boolean value determines whether the group is included or is dropped from the final resulting DataFrame.

### Find states with more than 300,000 undergraduate students
To help provide some context, we will use the filter method to find states that have at least 300,000 undergraduate students. The **`filter`** is passed a function that will sum all the undergraduate students for each state. It will take this sum and compare it against the number 300,000 and return **`True`** or **`False`**. Only states that have more than 300,000 students will remain.

None of the values in the DataFrame are mutated. Only rows are dropped with **`filter`**

In [2]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

college = pd.read_csv('data/college.csv')

In [3]:
# Define function that accepts a dataframe of the current group
# it must return a boolean

def filter_ugds(df):
    return df['UGDS'].sum() > 300000

In [4]:
# use filter
college_filtered = college.groupby('STABBR').filter(filter_ugds)

In [5]:
# see the difference in size
print(college.shape)
print(college_filtered.shape)

(7535, 27)
(4619, 27)


### Pass additional parameters to the filtering function
At first glance the **`filter`** method from above appears to only allow the passed function to contain a single argument. This is not the case. You can pass any number of arguments to the function inside of **`filter`**.

Let's take a look at the **`filter`** docstrings for a moment. Since **`filter`** is a chained method the docstring intelligence tricks (shift + tab + tab, etc...) are not available to us. We can use the help function to output the docstrings into the notebook.

In [6]:
help(college.groupby('STABBR').filter)

Help on method filter in module pandas.core.groupby:

filter(func, dropna=True, *args, **kwargs) method of pandas.core.groupby.DataFrameGroupBy instance
    Return a copy of a DataFrame excluding elements from groups that
    do not satisfy the boolean criterion specified by func.
    
    Parameters
    ----------
    f : function
        Function to apply to each subframe. Should return True or False.
    dropna : Drop groups that do not pass the filter. True by default;
        if False, groups that evaluate False are filled with NaNs.
    
    Notes
    -----
    Each subframe is endowed the attribute 'name' in case you need to know
    which group you are working on.
    
    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
    ...                           'foo', 'bar'],
    ...                    'B' : [1, 2, 3, 4, 5, 6],
    ...                    'C' : [2.0, 5., 8., 1., 2., 9.]})
    >>> grouped = df.groupby('A')
 

### \*args and \*\*kwargs
Notice in the method header the two additional parameters, \*args and \*\*kwargs that not formally mentioned in the parameter list. These two additional parameters will not be explained here but there is a great [stackoverflow post](http://stackoverflow.com/questions/3394835/args-and-kwargs) where you can learn all about them.

### What it means for `filter`
You can pass additional arguments to the function that **`filter`** accepts. This becomes useful when we want to customize the filter based on some parameter.

Let's rebuild the filtering function to accept a **`num_students`** argument that allows more flexibility when determining which states to filter out.

In [7]:
# define new function with additional arguments
def filter_ugds_param(df, num_students):
    return df['UGDS'].sum() > num_students

In [8]:
college_filtered2 = college.groupby('STABBR').filter(filter_ugds_param, num_students=500000)

In [9]:
print(college_filtered2.shape)
print(college_filtered.shape)
print(college.shape)

(3319, 27)
(4619, 27)
(7535, 27)


### Groupby with Series

Pandas Series also have a **`groupby`** method. You might be thinking how it's possible to group and aggregate a single column of data. It becomes possible when you consider the **index** and the many levels that an index can have. Multi-level indexes will be discussed in another notebook.

To make the index meaningful set it to be one of the columns of the DataFrame with the **`set_index`** method.

In [10]:
# create a new dataframe with a more interesting index
college_state_index = college.set_index('STABBR')

college_state_index.head()

Unnamed: 0_level_0,INSTNM,CITY,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
AL,Alabama A & M University,Normal,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
AL,University of Alabama at Birmingham,Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
AL,Amridge University,Montgomery,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
AL,University of Alabama in Huntsville,Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
AL,Alabama State University,Montgomery,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Grouping by the Index

The **`groupby`** method will work exactly with an index as it does with a column if the index has a name. If there is no name you can use the **`level`** parameter to specify the numeric level location.

In [11]:
# create a Series by selecting one column
s = college_state_index['UGDS']

s.head()

STABBR
AL     4206.0
AL    11383.0
AL      291.0
AL     5451.0
AL     4811.0
Name: UGDS, dtype: float64

In [12]:
# the name can be retrieved with the name attribute of the index
s.index.name

'STABBR'

In [13]:
# groupby as normal but with the level argument
# get the mean of the undergrad student population

s.groupby(level=0).mean().head(10)

STABBR
AK    2493.200000
AL    2789.865169
AR    1644.146341
AS    1276.000000
AZ    4130.468254
CA    3518.308397
CO    2324.880342
CT    1873.550562
DC    2645.277778
DE    2491.052632
Name: UGDS, dtype: float64

#### Don't need level parameter anymore from pandas 0.20
Just pass the name of the level to the Series and it will group it properly for you.

In [18]:
s.groupby('STABBR').mean().head(10)

STABBR
AK    2493.200000
AL    2789.865169
AR    1644.146341
AS    1276.000000
AZ    4130.468254
CA    3518.308397
CO    2324.880342
CT    1873.550562
DC    2645.277778
DE    2491.052632
Name: UGDS, dtype: float64

In [19]:
# should give the same result as dataframe
college.groupby('STABBR')['UGDS'].mean().head(10)

STABBR
AK    2493.200000
AL    2789.865169
AR    1644.146341
AS    1276.000000
AZ    4130.468254
CA    3518.308397
CO    2324.880342
CT    1873.550562
DC    2645.277778
DE    2491.052632
Name: UGDS, dtype: float64

## Tranforming and not aggregating groups

The most common operation to apply to a group is some kind of aggregation. Getting a single number summary of a group is usually of primary concern. There are however instances where the entire group would like to be transformed with every row kept. The **`transform`** groupby method will apply a (usually) custom function to each column of each group returning data that is the same length as the group. 

One of the most common operations is to standardize data - that is transform a numerical group so that its mean is 0 and standard deviation is 1. This is [done in the documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation). 

### Finding the percentage of the total for each group
The transform method implicitly accepts a Series and must return a Series the same length as the original. Here, our transform function takes in the Series of undergrads for each state and divides it by the total and returns this quantity as a Series.

In [24]:
def pct_total(s):
    return s / s.sum()

ugds_pct_total = college.groupby('STABBR')['UGDS'].transform(pct_total)
ugds_pct_total.head()

0    0.016939
1    0.045844
2    0.001172
3    0.021953
4    0.019376
Name: UGDS, dtype: float64

In [26]:
# its the same length as the DataFrame
len(ugds_pct_total), len(college)

(7535, 7535)

### Adding a new column with transform
Transform is great if you would like to append a new column to the DataFrame.

In [27]:
college['UGDS_PCT_TOTAL'] = ugds_pct_total

Find the colleges that represent the highest percent of their state

In [31]:
college[['INSTNM', 'STABBR', 'UGDS', 'UGDS_PCT_TOTAL']].sort_values('UGDS_PCT_TOTAL', ascending=False).head(15)

Unnamed: 0,INSTNM,STABBR,UGDS,UGDS_PCT_TOTAL
4216,University of the Virgin Islands,VI,1971.0,1.0
4141,Northern Marianas College,MP,1120.0,1.0
4138,American Samoa Community College,AS,1276.0,1.0
4214,College of Micronesia-FSM,FM,2344.0,1.0
4561,College of the Marshall Islands,MH,1078.0,1.0
4215,Palau Community College,PW,602.0,1.0
4140,University of Guam,GU,3607.0,0.634812
60,University of Alaska Anchorage,AK,12865.0,0.516004
4137,University of Wyoming,WY,9910.0,0.40141
691,University of Delaware,DE,18222.0,0.384999


# Case Study: Tracking Weight Loss per Month
There are two friends interested in tracking their weight over the course of several months. To provide motivation they decide to wager some money each month. The friend who loses the highest percentage of body weight each month wins that month. Each month is independent from the others so the weight loss percentage resets at the start of the month.

Below is the data that was collected over the course of the bet.

In [32]:
# create fake data
np.random.seed(1234)
all_weight = np.zeros(32)
all_weight[::2] = 300 * np.cumprod(np.random.rand(16) * .05 + .96)
all_weight[1::2] = 200 * np.cumprod(np.random.rand(16) * .05 + .96)
all_weight = np.round(all_weight, 0).astype(int)


df_weight = pd.DataFrame({'Name':['Bob', 'Amy'] * 16, 
                          'Month': ('Jan ' * 8 + ' Feb' * 8 + ' Mar' * 8 + ' Apr' * 8).split(),
                          'Week' : (['Week 1'] * 2 + ['Week 2'] * 2 + ['Week 3'] * 2 + ['Week 4'] * 2) * 4,
                          'Weight':all_weight},
                        columns=['Name', 'Month', 'Week', 'Weight'])

df_weight

Unnamed: 0,Name,Month,Week,Weight
0,Bob,Jan,Week 1,291
1,Amy,Jan,Week 1,197
2,Bob,Jan,Week 2,288
3,Amy,Jan,Week 2,189
4,Bob,Jan,Week 3,283
5,Amy,Jan,Week 3,189
6,Bob,Jan,Week 4,283
7,Amy,Jan,Week 4,190
8,Bob,Feb,Week 1,283
9,Amy,Feb,Week 1,186


### Transform
A new bet is begun at the start of each month and the winner for each month is declared by percentage weight loss achieved from week 1 to week 4. Percentage weight loss resets to zero at the beginning of each month.

We need to find a way to track the percentage weight loss within each month for each person. Since each week will have a weight loss percentage, no aggregation is performed and instead a new value is needed for each row. This situation calls for **`transform`**.

The default behavior of transform is to apply the same function (given as an argument) to each non-grouped column of the DataFrame. Since columns are Series, it is a Series that is passed to the transformation function.

In [33]:
# define a custom function that takes a Series and find the percentage loss from the first week
# is passed a Series

def find_perc_loss(s):
    return (s - s.iloc[0]) / s.iloc[0]

In [34]:
# group by name and month
# apply transformation
# and only transform weight column
df_weight['percent_month_loss'] = df_weight.groupby(['Name', 'Month'])['Weight'].transform(find_perc_loss) 

# view first two months. Notice that percent loss resets to 0
df_weight.head(16)

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
0,Bob,Jan,Week 1,291,0.0
1,Amy,Jan,Week 1,197,0.0
2,Bob,Jan,Week 2,288,-0.010309
3,Amy,Jan,Week 2,189,-0.040609
4,Bob,Jan,Week 3,283,-0.027491
5,Amy,Jan,Week 3,189,-0.040609
6,Bob,Jan,Week 4,283,-0.027491
7,Amy,Jan,Week 4,190,-0.035533
8,Bob,Feb,Week 1,283,0.0
9,Amy,Feb,Week 1,186,0.0


In [35]:
# it might be easier to read if sorted.
# note that Month is not sorted lexicographically and not by calendar
df_weight_final = df_weight.sort_values(['Name', 'Month', 'Week'])

df_weight_final

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
25,Amy,Apr,Week 1,166,0.0
27,Amy,Apr,Week 2,164,-0.012048
29,Amy,Apr,Week 3,164,-0.012048
31,Amy,Apr,Week 4,161,-0.03012
9,Amy,Feb,Week 1,186,0.0
11,Amy,Feb,Week 2,184,-0.010753
13,Amy,Feb,Week 3,177,-0.048387
15,Amy,Feb,Week 4,173,-0.069892
1,Amy,Jan,Week 1,197,0.0
3,Amy,Jan,Week 2,189,-0.040609


### Finding a winner

It's possible to manually find a winner of each month by comparing each person's week 4. For instance, Amy lost 7.9% of her body weight in April compared to Bob's 6.6% and won that month.

This is very tedious and since we have Pandas would be ridiculous to do by hand. We need to reshape the data in such a manner that each person's week 4 is easily comparable. There are many ways to reshape data. **`pivot`** and **`pivot_table`** DataFrame methods allow you to convert **long** formatted data into **wide** formatted data. This converts column values into column names. More will be said on this later.

First the above DataFrame will be filtered for only week 4 and then pivoted to make the comparison easy.

In [36]:
df_weight_week4 = df_weight_final[df_weight_final['Week'] == 'Week 4']

df_weight_week4

Unnamed: 0,Name,Month,Week,Weight,percent_month_loss
31,Amy,Apr,Week 4,161,-0.03012
15,Amy,Feb,Week 4,173,-0.069892
7,Amy,Jan,Week 4,190,-0.035533
23,Amy,Mar,Week 4,170,-0.028571
30,Bob,Apr,Week 4,250,-0.038462
14,Bob,Feb,Week 4,268,-0.053004
6,Bob,Jan,Week 4,283,-0.027491
22,Bob,Mar,Week 4,261,-0.033333


In [37]:
# use pivot to move the Name column
df_weight_winner = df_weight_week4.pivot(index='Month', columns='Name', values='percent_month_loss')

df_weight_winner

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Apr,-0.03012,-0.038462
Feb,-0.069892,-0.053004
Jan,-0.035533,-0.027491
Mar,-0.028571,-0.033333


### Column for winner with np.where

Now that the winner is much more easily seen with the new reshaped data, a final step of creating a column with the winner's name can be made with the numpy **`where`** function. **`np.where`** works by an array of boolean values and returns an array consisting of the second argument wherever the array is True and the third argument False.

In [38]:
# a trivial example
# Return 'Yes' when True and 'No' when False

np.where([True, False, False, True, False, True], 'Yes', 'No')

array(['Yes', 'No', 'No', 'Yes', 'No', 'Yes'],
      dtype='<U3')

In [39]:
# make the winner Amy when her weight loss is more than Bob's and vice versa using np.where
df_weight_winner['Winner'] = np.where(df_weight_winner['Amy'] < df_weight_winner['Bob'], 'Amy', 'Bob')

df_weight_winner

Name,Amy,Bob,Winner
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,-0.03012,-0.038462,Bob
Feb,-0.069892,-0.053004,Amy
Jan,-0.035533,-0.027491,Amy
Mar,-0.028571,-0.033333,Bob


In [40]:
# Get the winner
# If there happen to be lots of months of data

df_weight_winner['Winner'].value_counts()

Bob    2
Amy    2
Name: Winner, dtype: int64

# Problem Set
We will be working with the city of Houston dataset for the questions in this notebook. Run the following command before attempting the problems

In [43]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
employee = pd.read_csv('data/employee.csv')
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


## Problem 1
<span  style="color:green; font-size:16px">What are the 5 least common departments?</span>

In [44]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Filter out departments with less than 50 occurences and save it to **`employee_filter`**. Then test your code by outputing the frequencies of all the remaining departments. </span>

In [45]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Filter out departments from the original **`employee`** DataFrame with average salaries less than $70,000 and save it to **`employee_filter_salary`**. Then test your code by outputing the average salaries for the remaining departments.</span>

In [46]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Filter *`for`* those departments from the original **`employee`** DataFrame with average salaries of at least 65,000 or having at least 25 unique position titles. Save result to **`employee_more`**</span>

In [47]:
# your code here

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find a way to do problem 4 without using the **`filter`** method. Make clever use of aggregate groupby and boolean logic</span>

In [48]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px"> Create a column **`is_max`** that is equal to 1 if the base salary is currently the max base salary (out of all previous rows) for that department and 0 otherwise. See sample data below.</span>

In [49]:
# your code here

In [None]:
# Return a dataframe that looks like this
''' 
DEPARTMENT    BASE_SALARY  is_max  
Library           160         1
Police            150         1
Library           170         1
Police             95         0
Police            140         0
Library            80         0
Police            189         1
'''

### Problem 7: Advanced
<span  style="color:green; font-size:16px"> Programatically Find the 10th occurence of 0 for **`is_max`** and return a DataFrame that ends after the tenth occurence.</span>

In [50]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px"> Write a function that accepts a single argument that will filter **`employee1`** for a specific department where **`is_max`** is 1. Test your function with departments like 'Library' and 'Public Works & Engineering-PWE'.</span>

In [51]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">A good skill to have is to ask a difficult question for yourself and then answer it. Ask yourself a question that involes grouping and answer it.</span>

In [52]:
# your code here