# Filtering using logical operations and `.loc`

In [1]:
# import pandas
import pandas as pd
# load the gapminder dataset
gapminder = pd.read_csv('https://raw.githubusercontent.com/UofUDELPHI/2024-02-08-python/main/content/complete/data/gapminder.csv')
# take a look at the head of gapminder
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Filtering with `.loc` using a boolean series


Recall that you can create a boolean series based on a logical condition on a column from a DataFrame. For instance, below, we define a boolean series of `True`/`False` values that are `True` when the `country` value in the DataFrame equals `'Australia'` and is `False` when it does not:

In [2]:
# use a boolean operation to identify which rows have the value 'Australia' in the 'country' column 
# save the result as australia_index
australia_index = gapminder['country'] == 'Australia'
australia_index

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: country, Length: 1704, dtype: bool



We can use this boolean series to subset/filter the rows of our DataFrame by providing it in the row indexing position of the `.loc` indexer. The following will filter the `gapminder` DataFrame just to the rows where the `country` value equals `'Australia'`:

In [3]:
# use .loc and australia_index to select only the rows corresponding to Australia
gapminder.loc[australia_index,:]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
60,Australia,Oceania,1952,69.12,8691212,10039.59564
61,Australia,Oceania,1957,70.33,9712569,10949.64959
62,Australia,Oceania,1962,70.93,10794968,12217.22686
63,Australia,Oceania,1967,71.1,11872264,14526.12465
64,Australia,Oceania,1972,71.93,13177000,16788.62948
65,Australia,Oceania,1977,73.49,14074100,18334.19751
66,Australia,Oceania,1982,74.74,15184200,19477.00928
67,Australia,Oceania,1987,76.32,16257249,21888.88903
68,Australia,Oceania,1992,77.56,17481977,23424.76683
69,Australia,Oceania,1997,78.83,18565243,26997.93657


Note that you can also filter the data using a boolean series via the simpler square parenthesis syntax without the `.loc` indexer, just as we used to extract columns in the previous section/video. The simpler `df[]` syntax can be used to filter *either* the columns or the rows, but not both at the same time. If you provide a list of column names, Pandas will know that you are trying to subset to those *columns*, whereas if you provide a boolean series whose length equals the number of rows, Pandas will know that you are trying to subset the *rows*. But just looking at the code, if you don't know what the data looks like, it is very hard to tell if the syntax below is subsetting the rows or the columns.

In [4]:
# use `df[]` and australia_index to select only the rows corresponding to Australia
gapminder[australia_index]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
60,Australia,Oceania,1952,69.12,8691212,10039.59564
61,Australia,Oceania,1957,70.33,9712569,10949.64959
62,Australia,Oceania,1962,70.93,10794968,12217.22686
63,Australia,Oceania,1967,71.1,11872264,14526.12465
64,Australia,Oceania,1972,71.93,13177000,16788.62948
65,Australia,Oceania,1977,73.49,14074100,18334.19751
66,Australia,Oceania,1982,74.74,15184200,19477.00928
67,Australia,Oceania,1987,76.32,16257249,21888.88903
68,Australia,Oceania,1992,77.56,17481977,23424.76683
69,Australia,Oceania,1997,78.83,18565243,26997.93657


This it is recommended that you use the `.loc` indexing syntax, which has an explicit position for the row subsetting and the column subsetting: `df.loc[row_index,column_index]`

Rather than defining a separate indexing series object for filtering the rows (like `australia_index`), it is common to just put the logical filtering condition directly in the indexing syntax:

In [5]:
# directly use logical conditioning inside .loc to select only the rows corresponding to Australia
gapminder.loc[gapminder['country'] == 'Australia',:]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
60,Australia,Oceania,1952,69.12,8691212,10039.59564
61,Australia,Oceania,1957,70.33,9712569,10949.64959
62,Australia,Oceania,1962,70.93,10794968,12217.22686
63,Australia,Oceania,1967,71.1,11872264,14526.12465
64,Australia,Oceania,1972,71.93,13177000,16788.62948
65,Australia,Oceania,1977,73.49,14074100,18334.19751
66,Australia,Oceania,1982,74.74,15184200,19477.00928
67,Australia,Oceania,1987,76.32,16257249,21888.88903
68,Australia,Oceania,1992,77.56,17481977,23424.76683
69,Australia,Oceania,1997,78.83,18565243,26997.93657


### Multiple conditions

You can provide multiple row filtering conditions by separating them with an `&`:

In [6]:
# provide multiple logical conditions to filter the rows so that only rows corresponding to Australia after 1990 are selected
gapminder.loc[(gapminder['country'] == 'Australia') & (gapminder['year'] > 1990),:]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
68,Australia,Oceania,1992,77.56,17481977,23424.76683
69,Australia,Oceania,1997,78.83,18565243,26997.93657
70,Australia,Oceania,2002,80.37,19546792,30687.75473
71,Australia,Oceania,2007,81.235,20434176,34435.36744


### Exercise

Extract the subset of the data corresponding to Asian countries for which the life expectancy is at least 75.

In [7]:
gapminder.loc[(gapminder['continent'] == 'Asia') & (gapminder['lifeExp'] >= 75),['country', 'year']]

Unnamed: 0,country,year
95,Bahrain,2007
666,"Hong Kong, China",1982
667,"Hong Kong, China",1987
668,"Hong Kong, China",1992
669,"Hong Kong, China",1997
670,"Hong Kong, China",2002
671,"Hong Kong, China",2007
763,Israel,1987
764,Israel,1992
765,Israel,1997
