# Advanced Filtering and Making Complex New Columns

In this section we'll tackle making new columns that require complex calculations or filters (or both) so we can then filter against that output, or, alternatively, just use that output.

Obviously, we need our dataframe again.

In [None]:
import pandas as pd
import random

workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []

for x in range(0, 500):
    id = random.randint(100000000, 999999999)
    while id in used_ids:
        id = random.randint(100000000, 999999999)
    used_ids.append(id)
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
    workout_dict['ID'].append(id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    workout_dict['Duration of exercise (min)'].append(row[4])
    workout_dict['Exercise Type'].append(row[5])

df = pd.DataFrame(workout_dict)
df.head(10)

### Making New Data Labels
If we were interested in people with faster heart rates it would be easy enough to write a filter, say, `df[df['Heart Rate Avg'] >= 100]` that would return only those people. However, we might not be only interested in those people, but interested in the difference between people who have those faster heart rates and those who don't. In that case we might want to make a new column where people with an average heart rate over 100 were labeled "True" and those with a slower heart rate were labeled "False". (While we aren't there yet, this sort of thing would make it easy to get summary statistics for these groups separately.)

A quick reminder: making a new column is as simple as setting that column equal to something. `df['Something'] = 3` would make a column called "Something" where every value was 3.

We can use this basic syntax to make a new column with True/False values by passing an expression very much like a filter. Because we're going to want to try several variations on this, the code block below begins by making a copy of df that we'll modify, so we can start fresh by going back to df.

In [None]:
df2 = df.copy()
df2['Fast Heart Rate'] = df2['Heart Rate Avg'] > 100
df2.head(10)

This works up to a point. What if we wanted a column that showed us a range? That also works, using a more complex expression. 

#### Write an expression below that makes a column that is True only if someone's average heart rate is between 95-105. (Note: you'll need to enclose the whole thing in parentheses so that Python knows to turn it into a single True or False.)

In [None]:
# write your code here


The point where this begins to collapse is where we need more than two labels, or really anything complex. At this point we really want to pass the row to a function that can evaluate the value at a given column and return a value. However, if we attempt to loop over a dataframe we don't get the rows.

In [None]:
for x in df:
    print(x)

Instead, pandas supplies us with an `iterrows` method that lets us iterate over the rows. It's a method, so you call it as a function, and it produces an iterable. However, the row returned is not a simple list, and it's a copy not a view, and so changing it doesn't change the dataframe. Instead, you need to look up the original dataframe row by index and change that. It's a mess, and so there's a specific pandas method that handles this all much more cleanly.

However, if you want to see what you're missing, the code to do this the hard way is below.

In [None]:
# copy the dataframe, just because we don't want to mess up our real dataframe for this example
df3 = df.copy()

# make a new column and give it a default value
df3['Heart Rate Class'] = 'Middle'

# generate the iterrows object and iterate over it
# iterrows returns an iterable with two items in it, so we unpack that all at once
for index, row in df.iterrows():
    # check the age, if it shouldn't be the default do something
    if row['Heart Rate Avg'] >= 110:
        # use loc with a tuple to access the original cell and change it
        df3.loc[(index, 'Heart Rate Class')] = 'Fast'
    elif row['Heart Rate Avg'] < 90:
        df3.loc[(index, 'Heart Rate Class')] = 'Slow'
        
df3.head(10)

Hopefully at this point you're ready to see the simple way.

The simple way uses a user-defined function and `apply`. `apply` is a dataframe method that has two main arguments to pay attention to: a function that `apply` sends everything to, and an axis that determines whether it sends rows or columns (0 is columns, 1 is rows).

The example below just demonstrates how apply works, without making a new column. We'll define a function that just prints the average heart rate in the row and then apply that function over each row.

In [None]:
def pointless_print(row):
    print(row['Heart Rate Avg'])
    
df.apply(pointless_print, axis=1)

If we altered the function slightly, so that it returns the value instead of printing it, we would get a Series.

In [None]:
def return_avg(row):
    return row['Heart Rate Avg']

df.apply(return_avg, axis=1)

Because the Series is as long as the columns in the dataframe we can easily add it as a column. Let's modify the function to reclassify heart rates according to our earlier categories, and then attach the output as a new column.

In [None]:
df4 = df.copy()

def classify_avg(row):
    if row['Heart Rate Avg'] < 90:
        rate = 'Slower'
    elif row['Heart Rate Avg'] < 110:
        rate = 'Middle'
    else:
        rate = 'Faster'
    return rate

df4['Heart Rate Class'] = df4.apply(classify_avg, axis=1)
df4.head()

As you can see, this allows us to do fairly complex processing, since we can hand off an entire row to a function of whatever complexity we need.

Before we go further, practice this yourself.
#### In the block below calculate a new column called "Midpoint" that is the average of the heart rate maximum and minimums, using apply. (This can be done without using apply, but that's not the point of this exercise.)

In [None]:
# your code goes here


We can even get multiple columns at once. If the function returns multiple values and  we pass the argument `result_type='expand'` we get a small dataframe back. We could either join this dataframe to our existing one, or, more simply, declare that this dataframe is several new columns in the existing dataframe. We do this the same way we accessed multiple columns at once, using a list of new columns. E.g., getting or setting columns A and B would be `df[['A', 'B']]`.

In the example below I will run this operation on a copy of the dataframe.

In [None]:
df5 = df.copy()

# what is important about this function is that it returns two items, not one
def multiple_returns(row):
    aerobic_exercise = True
    heart_rate_class = 'Middle'
    if row['Exercise Type'] == 'Weight training':
        aerobic_exercise = False
    if row['Heart Rate Avg'] < 90:
        heart_rate_class = 'Low'
    elif row['Heart Rate Avg'] > 110:
        heart_rate_class = 'High'
    return aerobic_exercise, heart_rate_class


# attach the output of apply to the dataframe
df5[['Aerobic Exercise', 'Heart Rate Class']] = df5.apply(multiple_returns, axis=1, result_type='expand')
df5.head(10)

`apply` can also be used on a single column. For instance, the code below will use a modified version of the classify_avg function we defined earlier on the Heart Rate Avg column, without a need to look up the average heart rate column in a row. We'll use yet another copy of df for this, and give it a classifing column.

In [None]:
def classify_avg(avg):
    if avg < 90:
        rate_class = 'Low'
    elif avg < 110:
        rate_class = 'Middle'
    else:
        rate_class = 'High'
    return rate_class


df6 = df.copy()

df6['Class'] = df6['Heart Rate Avg'].apply(classify_avg)
df6.head(10)

#### In the cell below, try passing a single column to apply that should return half the exercise time.

In [None]:
# your code goes here


`apply` also allows you to define functions on the spot. There's no need to do this, but there are times when it is useful. Below, we'll use the pre-existing `lower()` method to make the Exercise Type column labels all lowercase. Since `lower` is a method we can't just pass things to it, it is attached to text using dot notation. What we can do is make use of Python's `lambda` to make an on-the-spot function.

(If `lower` is unclear, think of it this way: if `text` is a text variable then `text.lower()` gives us the lowercase version of `text`.)

In [None]:
df7['Exercise Type'] = df7['Exercise Type'].apply(lambda x: x.lower())
df7.head()

The code below does exactly the same thing (to a different column), so don't feel like you have to use `lambda`. The form below takes more lines of code, but that's fine when you're starting out, and is easier to read for some people.

In [None]:
def lowercase(text):
    return text.lowercase()

df7['Class lower'] = df7['Class'].apply(lambda x: x.lower())
df7.head()

However, if you want to use lambda the format isn't too difficult. `lambda x:` means "we're making a function which takes an argument, x", so `lambda x: x.lower()` means "we're making a function that takes an argument, x, and then returns x.lower()".

At this point, you have a lot of different ways to use `apply`. Using `apply` to make a column is often a precursor to filtering on that new column. 

#### So, for practice using apply, assume that we know that B-Wolf devices consistently measure 2 bpm lower than Skykandal devices. Also, Skykandal devices react poorly to water, and measure 5 bpm too high if you're swimming. Write a block of code that bumps up all B-Wolf device heart rate measures by 2, subtracts 5 from Skykandal heart rate measures for swimmers only, and then filters on the corrected rates so that we only have people with average heart rates above 100.

In [None]:
# your code goes here


Now, it's time to look at summarizing the data from all of these filtering operations and new columns.