# 5. Boolean Indexing More

## Boolean Selection on a Series

In [1]:
import pandas as pd
bikes = pd.read_csv('data/bikes.csv', parse_dates=['starttime', 'stoptime'], index_col='starttime')

In [2]:
temp = bikes['temperature']
temp.head()

starttime
2013-06-28 19:01:00    73.9
2013-06-28 22:53:00    69.1
2013-06-30 14:43:00    73.0
2013-07-01 10:05:00    72.0
2013-07-01 11:16:00    73.0
Name: temperature, dtype: float64

Select temperature less than 0 or greater than 95

In [3]:
filt1 = temp < 0
filt2 = temp > 95
filt_all = filt1 | filt2

temp[filt_all].head()

starttime
2013-08-30 15:33:00    96.1
2013-08-30 15:37:00    96.1
2013-08-30 15:49:00    96.1
2013-12-12 05:13:00    -2.0
2014-01-23 06:15:00    -2.0
Name: temperature, dtype: float64

# Simultaneous boolean selection of rows and column labels with `.loc`
Earlier, it was stated that **`.loc`** made selections only by label. This wasn't strictly true as it is also able to do boolean selection along with selection by label.

Remember that **`.loc`** takes both a row selection and a column selection separated by a comma. Since the row selection comes first, you can pass it the same exact inputs that you do for just the brackets and get the same results.

Let's run some of the older examples of boolean selection with **`.loc`**.

In [4]:
filt = bikes['tripduration'] > 1000
bikes.loc[filt].head()

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-06-30 14:43:00,10927,Subscriber,Male,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
2013-07-03 15:21:00,21028,Subscriber,Male,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
2013-07-04 17:17:00,24383,Subscriber,Male,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
2013-07-04 18:13:00,24673,Subscriber,Male,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy
2013-07-05 10:02:00,26214,Subscriber,Male,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.0,partlycloudy


## Separate row and column selection with a comma for `.loc`

In [5]:
filt = bikes['events'].isin(['rain', 'snow'])
cs = ['events', 'tripduration']

bikes.loc[filt, cs].head()

Unnamed: 0_level_0,events,tripduration
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-15 16:43:00,rain,727
2013-07-26 19:10:00,rain,1395
2013-07-30 18:53:00,rain,442
2013-08-05 17:09:00,rain,890
2013-09-07 16:09:00,rain,978


## Column to Column Comparisons
So far, we have created conditionals by comparing each of our column values to a single scalar value. It is possible to do element-by-element comparisons by comparing two columns to one another.

For instance, if we wanted to test whether there were more capacity at the start of the ride vs the end, we would do the following:

In [6]:
filt = bikes['dpcapacity_start'] > bikes['dpcapacity_end']

Let's use this criteria with **`.loc`** to return all the rows where the start capacity is greater than the end.

In [7]:
cs = ['dpcapacity_start', 'dpcapacity_end']
bikes.loc[filt, cs].head()

Unnamed: 0_level_0,dpcapacity_start,dpcapacity_end
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-06-28 22:53:00,31.0,19.0
2013-07-02 17:47:00,31.0,19.0
2013-07-03 15:21:00,31.0,15.0
2013-07-07 00:06:00,19.0,15.0
2013-07-08 17:06:00,23.0,19.0


### Boolean selection with `.iloc` does not work
The Pandas developers decided not to allow boolean selection with **`.iloc`**.

# Finding Missing Values with `isna`
The **`isna`** method called from either a DataFrame or a Series returns True for every value that is missing and False for any other value. 

In [8]:
bikes['dpcapacity_start'].isna().head()

starttime
2013-06-28 19:01:00    False
2013-06-28 22:53:00    False
2013-06-30 14:43:00    False
2013-07-01 10:05:00    False
2013-07-01 11:16:00    False
Name: dpcapacity_start, dtype: bool

### Filtering for missing values 

In [9]:
filt = bikes['dpcapacity_start'].isna()
bikes[filt]

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-09-06 07:52:00,7319012,Subscriber,Male,2015-09-06 07:55:00,207,Clark St & 9th St (AMLI),,,,Federal St & Polk St,41.872078,-87.629544,19.0,75.0,10.0,4.6,-9999.0,mostlycloudy
2015-09-07 09:52:00,7341764,Subscriber,Female,2015-09-07 09:57:00,293,Clark St & 9th St (AMLI),,,,Wabash Ave & 8th St,41.871962,-87.626106,19.0,81.0,10.0,8.1,-9999.0,mostlycloudy
2015-09-15 08:25:00,7468970,Subscriber,Male,2015-09-15 08:33:00,473,Clark St & 9th St (AMLI),,,,Franklin St & Monroe St,41.881469,-87.635177,27.0,68.0,10.0,9.2,-9999.0,mostlycloudy
2015-10-03 21:43:00,7780399,Subscriber,Female,2015-10-03 22:04:00,1268,Clark St & 9th St (AMLI),,,,Ritchie Ct & Banks St,41.906782,-87.626402,15.0,50.0,10.0,15.0,-9999.0,cloudy
2015-11-06 08:53:00,8207553,Subscriber,Female,2015-11-06 09:08:00,868,Clark St & 9th St (AMLI),,,,Fairbanks Ct & Grand Ave,41.89186,-87.62062,15.0,48.0,10.0,18.4,-9999.0,mostlycloudy
2015-11-18 08:55:00,8317161,Subscriber,Male,2015-11-18 09:01:00,359,Clark St & 9th St (AMLI),,,,Dearborn St & Adams St,41.879356,-87.629791,19.0,60.1,8.0,25.3,0.0,rain


## `isnull` is an alias for `isna`

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Select the wind speed column a a Series and assign it to a variable. Are there any negative wind speeds?</span>

In [13]:
ws = bikes['wind_speed']
wsfilt = ws < 0


temp[wsfilt].head()

starttime
2016-03-19 10:08:00   -9999.0
2016-06-30 11:47:00   -9999.0
2016-07-21 21:02:29   -9999.0
2016-08-07 09:16:42   -9999.0
2016-08-07 09:29:44   -9999.0
Name: wind_speed, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Select the events and gender columns for all trip durations longer than 1,000 seconds.</span>

In [31]:
newf = bikes['tripduration'] > 1000
cs = ['events', 'gender', 'tripduration']

bikes.loc[newf, cs].head(30)


Unnamed: 0_level_0,events,gender,tripduration
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-06-30 14:43:00,mostlycloudy,Male,1040
2013-07-03 15:21:00,cloudy,Male,1300
2013-07-04 17:17:00,mostlycloudy,Male,1523
2013-07-04 18:13:00,mostlycloudy,Male,1697
2013-07-05 10:02:00,partlycloudy,Male,2263
2013-07-06 09:43:00,partlycloudy,Male,1365
2013-07-09 13:12:00,cloudy,Male,5396
2013-07-12 01:07:00,clear,Male,1043
2013-07-12 18:13:00,partlycloudy,Male,1616
2013-07-14 14:08:00,partlycloudy,Female,6274


### Problem 3
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index. We will use this DataFrame for the rest of the problems. Select all the movies such that the Facebook likes for actor 2 are greater than those for actor 1.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Select the year, content rating, and IMDB score columns for movies from the year 2016 with IMDB score less than 4.</span>

In [None]:
# your code here