<a href="https://colab.research.google.com/github/cnrgrl/PANDAS/blob/main/07_Boolean_Selection_More.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# uncomment the following line, if you are using google collab
!rm -r Pandas
!git clone https://github.com/Wuebbelt/Pandas.git

rm: cannot remove 'Pandas': No such file or directory
Cloning into 'Pandas'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 77 (delta 12), reused 75 (delta 10), pack-reused 0[K
Unpacking objects: 100% (77/77), done.


# Boolean Selection More

In this chapter, we explore several more possible ways to use boolean selection to filter data.

## Boolean selection on a Series

All of the examples thus far have taken place on DataFrames. Boolean selection on a Series happens almost identically. Since there is only one dimension of data, the queries you ask are usually going to be simpler. First, let's select a single column of data as a Series such as the temperature column from the bikes dataset.

In [None]:
import pandas as pd
bikes = pd.read_csv('Pandas/bikes.csv', parse_dates=['starttime', 'stoptime'])
temp = bikes['temperature']
temp.head(3)

0    73.9
1    69.1
2    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90. The procedure is the same as with DataFrames. Create a boolean Series and pass that Series to *just the bracketes*.

In [None]:
filt = temp > 90
temp[filt].head(3)

54    91.0
55    91.0
56    91.0
Name: temperature, dtype: float64

Select temperatures less than 0 or greater than 95. Multiple condition boolean Series also work the same.

In [None]:
filt1 = temp < 0
filt2 = temp > 95
filt = filt1 | filt2
temp[filt].head()

395     96.1
396     96.1
397     96.1
1871    -2.0
2049    -2.0
Name: temperature, dtype: float64

### Set the index as `starttime`

The default index is not very helpful. Let's use the `set_index` method to make the `starttime` column the new index. While, this column may not be unique it does provide us with useful labels for each row.

In [None]:
bikes2 = bikes.set_index('starttime')
bikes2.head(3)

Unnamed: 0_level_0,trip_id,usertype,gender,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-06-28 19:01:00,7147,Subscriber,Male,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
2013-06-28 22:53:00,7524,Subscriber,Male,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2013-06-30 14:43:00,10927,Subscriber,Male,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


Let's get back our temperature Series with its updated index.

In [None]:
temp2 = bikes2['temperature']
temp2.head()

starttime
2013-06-28 19:01:00    73.9
2013-06-28 22:53:00    69.1
2013-06-30 14:43:00    73.0
2013-07-01 10:05:00    72.0
2013-07-01 11:16:00    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90. We expect to get a summer month and we do.

In [None]:
filt = temp2 > 90
temp2[filt].head(5)

starttime
2013-07-16 15:13:00    91.0
2013-07-16 15:31:00    91.0
2013-07-16 16:35:00    91.0
2013-07-17 17:08:00    93.0
2013-07-17 17:25:00    93.0
Name: temperature, dtype: float64

Select temperature less than 0 or greater than 95. We expect to get some winter months in the result and we do.

In [None]:
filt1 = temp2 < 0
filt2 = temp2 > 95
filt = filt1 | filt2
temp2[filt].head()

starttime
2013-08-30 15:33:00    96.1
2013-08-30 15:37:00    96.1
2013-08-30 15:49:00    96.1
2013-12-12 05:13:00    -2.0
2014-01-23 06:15:00    -2.0
Name: temperature, dtype: float64

## The `between` method

The `between` method returns a boolean Series by testing whether the current value is between two given values. For instance, if want to select the temperatures between 50 and 60 degrees we do the following:

In [None]:
filt = temp2.between(50, 60)
filt.head(3)

starttime
2013-06-28 19:01:00    False
2013-06-28 22:53:00    False
2013-06-30 14:43:00    False
Name: temperature, dtype: bool

By default, the `between` method is inclusive of the given values, so temperatures of exactly 50 or 60 would be found in the result. We pass this boolean Series to *just the brackets* to complete the selection.

In [None]:
temp2[filt].head(3)

starttime
2013-09-13 07:55:00    54.0
2013-09-13 08:04:00    57.9
2013-09-13 08:04:00    57.9
Name: temperature, dtype: float64

## Simultaneous boolean selection of rows and column labels with `loc`
The `loc` indexer was thoroughly covered in an earlier chapter and will now be brought up again to show how it can simultaneously selects rows with boolean selection and columns by labels.

Remember that `loc` takes both a row selection and a column selection separated by a comma. Since the row selection comes first, you can pass it the same exact inputs that you do for *just the brackets* and get the same results. Let's run some of the previous examples of boolean selection with `loc`. Here, we select all rides with trip duration greater than 1,000.

In [None]:
filt = bikes['tripduration'] > 1000
bikes.loc[filt].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy


Here, we select all weather events that are either rain, snow, tstorms, or sleet.

In [None]:
filt = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
bikes.loc[filt].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
45,66336,Subscriber,Male,2013-07-15 16:43:00,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,41.809835,-87.599383,15.0,State St & Harrison St,41.873958,-87.627739,19.0,82.9,10.0,5.8,0.0,rain
78,89180,Subscriber,Male,2013-07-21 16:35:00,2013-07-21 17:06:00,1809,Michigan Ave & Pearson St,41.89766,-87.62351,23.0,Millennium Park,41.881032,-87.624084,35.0,82.4,10.0,11.5,0.0,tstorms
79,89228,Subscriber,Male,2013-07-21 16:47:00,2013-07-21 17:03:00,999,Carpenter St & Huron St,41.894556,-87.653449,19.0,Carpenter St & Huron St,41.894556,-87.653449,19.0,82.4,10.0,11.5,0.0,tstorms


### Separate row and column selection with a comma for `loc`
The nice benefit of `loc` is that it allows us to simultaneously select rows with boolean selection and select columns by label. Let's select rides during rain or snow and the columns `events` and `tripduration`.

In [None]:
filt = bikes['events'].isin(['rain', 'snow'])
cols = ['events', 'tripduration']
bikes.loc[filt, cols].head()

Unnamed: 0,events,tripduration
45,rain,727
112,rain,1395
124,rain,442
161,rain,890
498,rain,978


Now let's find all female riders with trip duration greater than 5000 when it was cloudy. We'll only return the columns used during the boolean selection.

In [None]:
filt1 = bikes['gender'] == 'Female'
filt2 = bikes['tripduration'] > 5000
filt3 = bikes['events'] == 'cloudy'
filt = filt1 & filt2 & filt3
cols = ['gender', 'tripduration', 'events']
bikes.loc[filt, cols]

Unnamed: 0,gender,tripduration,events
2712,Female,79988,cloudy
14455,Female,7197,cloudy
22868,Female,13205,cloudy
36441,Female,19922,cloudy


## Column to column comparisons

So far, we created filters by comparing each of our column values to a single scalar value. It is possible to do element-by-element comparisons by comparing two columns to one another. For instance, the total bike capacity at each station at the start and end of the ride is stored in the `dpcapacity_start` and `dpcapacity_end` columns. If we wanted to test whether there were more capacity at the start of the ride vs the end, we would do the following:

In [None]:
filt = bikes['dpcapacity_start'] > bikes['dpcapacity_end']

Let's use this filter with `loc` to return all the rows where the start capacity is greater than the end.

In [None]:
cols = ['dpcapacity_start', 'dpcapacity_end']
bikes.loc[filt, cols].head(3)

Unnamed: 0,dpcapacity_start,dpcapacity_end
1,31.0,19.0
6,31.0,19.0
8,31.0,15.0


### Boolean selection with `iloc` does not work

The pandas developers decided not to allow boolean selection with `iloc`. The following raises an error.

In [None]:
bikes.iloc[filt]

NotImplementedError: ignored

## Finding Missing Values with `isna`
The `isna` method called from either a DataFrame or a Series returns `True` for every value that is missing and `False` for any other value. Let's see this in action by calling `isna` on the start capacity column.

In [None]:
bikes['dpcapacity_start'].isna().head(3)

0    False
1    False
2    False
Name: dpcapacity_start, dtype: bool

### Filtering for missing values

We can now use this boolean Series to select all the rows where the capacity start column is missing. Verify that those values are indeed missing. 

In [None]:
filt = bikes['dpcapacity_start'].isna()
bikes[filt].head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
17566,7319012,Subscriber,Male,2015-09-06 07:52:00,2015-09-06 07:55:00,207,Clark St & 9th St (AMLI),,,,Federal St & Polk St,41.872078,-87.629544,19.0,75.0,10.0,4.6,-9999.0,mostlycloudy
17605,7341764,Subscriber,Female,2015-09-07 09:52:00,2015-09-07 09:57:00,293,Clark St & 9th St (AMLI),,,,Wabash Ave & 8th St,41.871962,-87.626106,19.0,81.0,10.0,8.1,-9999.0,mostlycloudy
17990,7468970,Subscriber,Male,2015-09-15 08:25:00,2015-09-15 08:33:00,473,Clark St & 9th St (AMLI),,,,Franklin St & Monroe St,41.881469,-87.635177,27.0,68.0,10.0,9.2,-9999.0,mostlycloudy


### `isnull` is an alias for `isna`

There is an identical method named `isnull` that you will see in other tutorials. It is an **alias** of `isna` meaning it does the exact same thing but has a different name. Either one is suitable to use, but I prefer `isna` because of the similarity to **NaN**, the representation of missing values. There are also other methods such as `dropna` and `fillna` that use the 'na' in their method names.

## Exercises

Continue to use the bikes dataset for the first few exercises.

### Exercise 1
<span  style="color:green; font-size:16px">Select the wind speed column a a Series and assign it to a variable. Are there any negative wind speeds?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Select all wind speed values between 12 and 16.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Select the `events` and `gender` columns for all trip durations longer than 1,000 seconds.</span>

Read in the movie dataset by executing the cell below and use it for the following exercises.

In [None]:
import pandas as pd
movie = pd.read_csv('Pandas/movie.csv', index_col='title')
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


### Exercise 4
<span  style="color:green; font-size:16px">Select all the movies such that the Facebook likes for actor 2 are greater than those for actor 1.</span>

### Exercise 5
<span  style="color:green; font-size:16px">Select the year, content rating, and IMDB score columns for movies from the year 2016 with IMDB score less than 4.</span>

### Exercise 6
<span  style="color:green; font-size:16px">Select all the movies that are missing values for content rating.</span>

### Exercise 7
<span  style="color:green; font-size:16px">Select all the movies that are missing values for both the gross and budget columns. Return just those columns to verify that those values are indeed missing.</span>

### Exercise 8
<span  style="color:green; font-size:16px">Write a function `find_missing` that has three parameters, `df`, `col1` and `col2` where `df` is a DataFrame and `col1` and `col2` are column names. This function should return all the rows of the DataFrame where `col1` and `col2` are missing. Only return the two columns as well. Answer problem 7 with this function.</span>