# 5. Boolean Indexing More

### Objectives

+ Boolean Selection with the brackets on a Series
+ Using the `between` method instead of an `and` condition
+ Simultaneously select rows with boolean selection and columns with a list of names with `.loc`
+ Select rows with missing values with the `isna` method

## Boolean Selection on a Series
All the examples thus far have taken place on the bikes DataFrame. Boolean selection on a Series happens almost identically. Since there is only one dimension of data, the queries you ask are usually going to be simpler.

First, let’s select a single column of data as a Series such as the temperature column.

In [23]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])

In [6]:
temp = bikes['temperature']
temp.head()

0    73.9
1    69.1
2    73.0
3    72.0
4    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90

In [7]:
temp[temp > 90].head()

54    91.0
55    91.0
56    91.0
61    93.0
62    93.0
Name: temperature, dtype: float64

Select temperature less than 0 or greater than 95

In [8]:
temp[(temp < 0) | (temp > 95)].head()

395     96.1
396     96.1
397     96.1
1871    -2.0
2049    -2.0
Name: temperature, dtype: float64

## Re-read data with `starttime` in the index
The default index is not very helpful. Let's re-read data with **`starttime`** in the index. While, this column may not be unique it does provide us with useful information for the index.

In [25]:
bikes = pd.read_csv('../data/bikes.csv', 
                    parse_dates=['starttime', 'stoptime'], 
                    index_col='starttime')
bikes.head()

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-06-28 19:01:00,7147,Subscriber,Male,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
2013-06-28 22:53:00,7524,Subscriber,Male,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2013-06-30 14:43:00,10927,Subscriber,Male,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
2013-07-01 10:05:00,12907,Subscriber,Male,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
2013-07-01 11:16:00,13168,Subscriber,Male,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [26]:
temp2 = bikes['temperature']
temp2.head()

starttime
2013-06-28 19:01:00    73.9
2013-06-28 22:53:00    69.1
2013-06-30 14:43:00    73.0
2013-07-01 10:05:00    72.0
2013-07-01 11:16:00    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90. We expect to get a summer month and we do.

In [29]:
temp2[temp2 > 90].head()

starttime
2013-07-16 15:13:00    91.0
2013-07-16 15:31:00    91.0
2013-07-16 16:35:00    91.0
2013-07-17 17:08:00    93.0
2013-07-17 17:25:00    93.0
Name: temperature, dtype: float64

Select temperature less than 0 or greater than 95. We expect to get winter months and we do.

In [30]:
temp2[(temp2 < 0) | (temp2 > 95)].head()

starttime
2013-08-30 15:33:00    96.1
2013-08-30 15:37:00    96.1
2013-08-30 15:49:00    96.1
2013-12-12 05:13:00    -2.0
2014-01-23 06:15:00    -2.0
Name: temperature, dtype: float64

## The `between` method
The `between` method return a boolean Series by testing whether the current value is between two given values. For instance, if want to select the temperatures between 50 and 60 degrees (inclusive), we do the following:

In [37]:
criteria = temp2.between(50, 60)
criteria.head()

starttime
2013-06-28 19:01:00    False
2013-06-28 22:53:00    False
2013-06-30 14:43:00    False
2013-07-01 10:05:00    False
2013-07-01 11:16:00    False
Name: temperature, dtype: bool

In [39]:
temp2[criteria].head()

starttime
2013-09-13 07:55:00    54.0
2013-09-13 08:04:00    57.9
2013-09-13 08:04:00    57.9
2013-09-13 08:06:00    57.9
2013-09-13 08:22:00    57.9
Name: temperature, dtype: float64

# Simultaneous boolean selection of rows and column labels with `.loc`
The **`.loc`** indexer was thoroughly covered in an earlier notebook and will now be covered here to simultaneously select rows and columns. Earlier, it was stated that **`.loc`** made selections only by label. This wasn't strictly true as it is also able to do boolean selection along with selection by label.

Remember that **`.loc`** takes both a row selection and a column selection separated by a comma. Since the row selection comes first, you can pass it the same exact inputs that you do for just the brackets and get the same results.

Let's run some of the older examples of boolean selection with **`.loc`**.

In [41]:
bikes.loc[bikes['tripduration'] > 1000].head()

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-06-30 14:43:00,10927,Subscriber,Male,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
2013-07-03 15:21:00,21028,Subscriber,Male,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
2013-07-04 17:17:00,24383,Subscriber,Male,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
2013-07-04 18:13:00,24673,Subscriber,Male,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy
2013-07-05 10:02:00,26214,Subscriber,Male,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.0,partlycloudy


In [42]:
criteria = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
bikes.loc[criteria].head()

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-07-15 16:43:00,66336,Subscriber,Male,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,41.809835,-87.599383,15.0,State St & Harrison St,41.873958,-87.627739,19.0,82.9,10.0,5.8,0.0,rain
2013-07-21 16:35:00,89180,Subscriber,Male,2013-07-21 17:06:00,1809,Michigan Ave & Pearson St,41.89766,-87.62351,23.0,Millennium Park,41.881032,-87.624084,35.0,82.4,10.0,11.5,0.0,tstorms
2013-07-21 16:47:00,89228,Subscriber,Male,2013-07-21 17:03:00,999,Carpenter St & Huron St,41.894556,-87.653449,19.0,Carpenter St & Huron St,41.894556,-87.653449,19.0,82.4,10.0,11.5,0.0,tstorms
2013-07-23 00:16:00,95044,Subscriber,Female,2013-07-23 00:26:00,563,Wabash Ave & Roosevelt Rd,41.867173,-87.625955,19.0,Daley Center Plaza,41.884337,-87.630183,47.0,78.8,10.0,17.3,0.0,tstorms
2013-07-26 19:10:00,111568,Subscriber,Male,2013-07-26 19:33:00,1395,Larrabee St & Kingsbury St,41.897764,-87.642884,27.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,66.9,8.0,12.7,0.0,rain


## Separate row and column selection with a comma for `.loc`
The great benefit of **`.loc`** is that it allows us to simultaneously do boolean selection along the rows and make column selections by label.

Let's select just the events rain and snow and only the columns events and trip duration.

In [43]:
row_selection = bikes['events'].isin(['rain', 'snow'])
col_selection = ['events', 'tripduration']
bikes.loc[row_selection, col_selection].head()

Unnamed: 0_level_0,events,tripduration
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-15 16:43:00,rain,727
2013-07-26 19:10:00,rain,1395
2013-07-30 18:53:00,rain,442
2013-08-05 17:09:00,rain,890
2013-09-07 16:09:00,rain,978


It is not necessary (though it is cleaner) to assign the row and column selections to their own variables first.

In [44]:
bikes.loc[bikes['events'].isin(['rain', 'snow']), ['events', 'tripduration']].head()

Unnamed: 0_level_0,events,tripduration
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-15 16:43:00,rain,727
2013-07-26 19:10:00,rain,1395
2013-07-30 18:53:00,rain,442
2013-08-05 17:09:00,rain,890
2013-09-07 16:09:00,rain,978


## Column to Column Comparisons
So far, we have created conditionals by comparing each of our column values to a single scalar value. It is possible to do element-by-element comparisons by comparing two columns to one another.

For instance, if we wanted to test whether there were more capacity at the start of the ride vs the end, we would do the following:

In [51]:
criteria = bikes['dpcapacity_start'] > bikes['dpcapacity_end']

Let's use this criteria with **`.loc`** to return all the rows where the start capacity is greater than the end.

In [52]:
bikes.loc[criteria, ['dpcapacity_start', 'dpcapacity_end']].head()

Unnamed: 0_level_0,dpcapacity_start,dpcapacity_end
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-06-28 22:53:00,31.0,19.0
2013-07-02 17:47:00,31.0,19.0
2013-07-03 15:21:00,31.0,15.0
2013-07-07 00:06:00,19.0,15.0
2013-07-08 17:06:00,23.0,19.0


### Boolean selection with `.iloc` does not work
The Pandas developers decided not to allow boolean selection with **`.iloc`**.

In [55]:
bikes.iloc[criteria]

ValueError: iLocation based boolean indexing cannot use an indexable as a mask

# Finding Missing Values with `isna`
The **`isna`** method called from either a DataFrame or a Series returns True for every value that is missing and False for any other value. 

Let's see this in action by calling **`isna`** on the start capacity column.

In [68]:
bikes['dpcapacity_start'].isna().head()

starttime
2013-06-28 19:01:00    False
2013-06-28 22:53:00    False
2013-06-30 14:43:00    False
2013-07-01 10:05:00    False
2013-07-01 11:16:00    False
Name: dpcapacity_start, dtype: bool

### Filtering for missing values

We can now use this boolean Series to select all the rows where the capacity start column is missing. Verify that the 

In [69]:
criteria = bikes['dpcapacity_start'].isna()
bikes[criteria]

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-09-06 07:52:00,7319012,Subscriber,Male,2015-09-06 07:55:00,207,Clark St & 9th St (AMLI),,,,Federal St & Polk St,41.872078,-87.629544,19.0,75.0,10.0,4.6,-9999.0,mostlycloudy
2015-09-07 09:52:00,7341764,Subscriber,Female,2015-09-07 09:57:00,293,Clark St & 9th St (AMLI),,,,Wabash Ave & 8th St,41.871962,-87.626106,19.0,81.0,10.0,8.1,-9999.0,mostlycloudy
2015-09-15 08:25:00,7468970,Subscriber,Male,2015-09-15 08:33:00,473,Clark St & 9th St (AMLI),,,,Franklin St & Monroe St,41.881469,-87.635177,27.0,68.0,10.0,9.2,-9999.0,mostlycloudy
2015-10-03 21:43:00,7780399,Subscriber,Female,2015-10-03 22:04:00,1268,Clark St & 9th St (AMLI),,,,Ritchie Ct & Banks St,41.906782,-87.626402,15.0,50.0,10.0,15.0,-9999.0,cloudy
2015-11-06 08:53:00,8207553,Subscriber,Female,2015-11-06 09:08:00,868,Clark St & 9th St (AMLI),,,,Fairbanks Ct & Grand Ave,41.89186,-87.62062,15.0,48.0,10.0,18.4,-9999.0,mostlycloudy
2015-11-18 08:55:00,8317161,Subscriber,Male,2015-11-18 09:01:00,359,Clark St & 9th St (AMLI),,,,Dearborn St & Adams St,41.879356,-87.629791,19.0,60.1,8.0,25.3,0.0,rain


## `isnull` is an alias for `isna`
There is an identical method named **`isnull`** that you will see in other tutorials. It is an **alias** of **`isna`** meaning it does the exact same thing with a different name. Either one is suitable to use but I prefer **`isna`** because of the similarity **na** to **NaN**, the representation of missing values.

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Select the wind speed column a a Series and assign it to a variable. Are there any negative wind speeds?</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Select all wind speed between 12 and 16.</span>

In [70]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Select the events and gender columns for all trip durations longer than 1,000 seconds.</span>

In [71]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index. We will use this DataFrame for the rest of the problems. Select all the movies such that the Facebook likes for actor 2 are greater than those for actor 1.</span>

In [72]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Select the year, content rating, and IMDB score columns for movies from the year 2016 with IMDB score less than 4.</span>

In [73]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Select all the movies that are missing values for content rating.</span>

In [74]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Select all the movies that are missing both the gross and budget. Return just those columns to verify that those values are indeed missing.</span>

In [75]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Write a function `find_missing` that has three parameters, `df`, `col1` and `col2` where `df` is a DataFrame and `col1` and `col2` are column names. This function should return all the rows of the DataFrame where `col1` and `col2` are missing. Only return the two columns as well. Answer problem 7 with this function.</span>

In [None]:
# your code here