# Masking and Filtering a Data Frame

This notebook shows how to find certain rows of a data frame given a condition.
Pay attention to the power of this concept which you can use in a spreadsheet or other tools as well.

This introduces new concepts

- Booleans (a computer value for true or false)
- Boolean arrays
- Boolean arrays as an index
- And logic operations

In [1]:
data = '''
household,dorm,phone_energy,laptop_energy
A,tuscany,10,50
B,sauv,30,60
C,tuscany,12,45
D,sauv,20,50
'''

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from io import StringIO
from tabulate import tabulate

df = pd.read_csv(StringIO(data))
tidy_df = pd.melt(df, id_vars='dorm', 
        value_vars=['phone_energy', 'laptop_energy'],
        var_name='energy_use',
        value_name='energy_kWh')
tidy_df

Unnamed: 0,dorm,energy_use,energy_kWh
0,tuscany,phone_energy,10
1,sauv,phone_energy,30
2,tuscany,phone_energy,12
3,sauv,phone_energy,20
4,tuscany,laptop_energy,50
5,sauv,laptop_energy,60
6,tuscany,laptop_energy,45
7,sauv,laptop_energy,50


We can get these by their index.
Note that the first column is a number that you can use to get that row of the data.
You can see that this is basically the first row of data, but it looks a little different.

In [18]:
tidy_df.ix[0]

dorm               tuscany
energy_use    phone_energy
energy_kWh              10
Name: 0, dtype: object

Now say that we want to get all the data observations where the energy is above some minimum level, say greater than or equal to 50 kWh.
We can manually find the rows with 50 kWh or more and then use the same `ix` function with the rows over 50 kWh.

In [19]:
tidy_df.ix[[4, 5, 7]]

Unnamed: 0,dorm,energy_use,energy_kWh
4,tuscany,laptop_energy,50
5,sauv,laptop_energy,60
7,sauv,laptop_energy,50


This works, but if we have lots of data, we can imagine it won't be practical.
Another approach is to do the following, where the computer gives us a true or false for whether or not our question "is the energy greater than 50" is true for each row.

In [14]:
tidy_df['energy_kWh'] >= 50

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7     True
Name: energy_kWh, dtype: bool

This will seem magic, but we can use this list of true and false responses to get the rows of the data frame that we want.

In [16]:
tidy_df[tidy_df['energy_kWh'] >= 50]

Unnamed: 0,dorm,energy_use,energy_kWh
4,tuscany,laptop_energy,50
5,sauv,laptop_energy,60
7,sauv,laptop_energy,50


Here we do the same thing manually.
Note that these all give you the same answer.

In [17]:
tidy_df[[False, False, False, False, True, True, False, True]]

Unnamed: 0,dorm,energy_use,energy_kWh
4,tuscany,laptop_energy,50
5,sauv,laptop_energy,60
7,sauv,laptop_energy,50


# Combining Things

Let's say you want to find energy over 50 kWh in the tuscany dorm.
You can put more than one condition in the index section.
Pay special attention to the syntax and characters.

In [21]:
tidy_df[(tidy_df['energy_kWh']>=50) & (tidy_df['dorm']=='tuscany')]

Unnamed: 0,dorm,energy_use,energy_kWh
4,tuscany,laptop_energy,50
