### Filtering Pandas Dataframe By Values of Column

One of the biggest advantages of having the data as a Pandas Dataframe is that Pandas allows us to slice and dice the data in multiple ways.

Often, you may want to subset a pandas dataframe based on one or more values of a specific column. Essentially, we would like to select rows based on one value or multiple values present in a column.

Here are SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s).

Let us first load gapminder data as a dataframe into pandas.

In [10]:
# load pandas
import pandas as pd
data = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
df = pd.read_csv(data)

This data frame has up to 1,704 rows and 6 columns. Let's verify this by applying Pandas method *shape*. And after that we will visualize first 5 columns of the dataframe by using the Pandas *head()* function.

As we can see below, one of the columns is **year**. Let us look at the first three rows of the data frame.

In [12]:
df.shape

(1704, 6)

In [14]:
df.head(3)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071


### 1. How to Select Rows of Pandas Dataframe Based on a Single Value of a Column?

One way to filter by rows in Pandas is to use boolean expression. We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep.

For example, let us filter the dataframe or subset the dataframe based on year’s value 2002. This conditional results in a boolean variable that has True when the value of year equals 2002, False otherwise.

In [23]:
# does year equals to 2002?
# is_2002 is a boolean variable with True or False in it
is_2002 =  df['year']==2002
#Let's see how our boolean variable look applying the head() function
print(is_2002.head())

0    False
1    False
2    False
3    False
4    False
Name: year, dtype: bool


We can then use this boolean variable to filter the dataframe. After subsetting we can see that new dataframe is much smaller in size.

Now that we have a column filled with *False* or *True* values, it's simple: let's how many values are equal to the variable *is_2002*

In [24]:
# filter rows for year 2002 using the boolean variable
df_2002 = df[is_2002]
print(df_2002.shape)

(142, 6)


We have successfully filtered pandas dataframe based on values of a column. Here, all the rows with year equals to 2002.

In [26]:
print(df_2002.head())

        country  year         pop continent  lifeExp    gdpPercap
10  Afghanistan  2002  25268405.0      Asia   42.129   726.734055
22      Albania  2002   3508512.0    Europe   75.651  4604.211737
34      Algeria  2002  31287142.0    Africa   70.994  5288.040382
46       Angola  2002  10866106.0    Africa   41.003  2773.287312
58    Argentina  2002  38331121.0  Americas   74.340  8797.640716


In the above example, we used two steps, 1) create boolean variable satisfying the filtering condition 2) use boolean variable to filter rows. However, we don’t really have to create a new boolean variable and save it to do the filtering. Instead, we can directly give the boolean expression to subset the dataframe by column value as follows.

In [27]:
df_2002 = df[df['year']==2002]
print(df_2002.shape)

(142, 6)


## Filtering rows using Pandas chaining

We can also use Pandas chaining operation, to access a dataframe’s column and to select rows like previous example. Pandas chaining makes it easy to combine one Pandas command with another Pandas command or user defined functions.

Here we use Pandas *eq()* function and chain it with the year series for checking element-wise equality to filter the data corresponding to year 2002.

In [29]:
df_2002 = df[df.year.eq(2002)]
print(df_2002.shape)

(142, 6)


In the above example, we checked for equality (year==2002) and kept the rows matching a specific value. We can use any other comparison operator like “less than” and “greater than” and create boolean expression to filter rows of pandas dataframe.

### Selecting Rows of Pandas Dataframe Whose Column Value Does NOT Equal a Specific Value

Sometimes, you may want tot keep rows of a data frame based on values of a column that does not equal something. Let us filter our gapminder dataframe whose year column is not equal to 2002. Basically we want to have all the years data except for the year 2002.

In [40]:
# filter rows for year does not equal to 2002
df_not_2002 = df[df.year != 2002]
df_not_2002 = df[gapminder['year']!=2002]
df_not_2002.shape

(1562, 6)

### Select Rows of Pandas Dataframe Whose Column Value is NOT NA/NAN

Often you may want to filter a Pandas dataframe such that you would like to keep the rows if values of certain column is NOT NA/NAN.

We can use Pandas notnull() method to filter based on NA/NAN values of a column.

In [32]:
# filter out rows ina . dataframe with column year values NA/NAN
df_no_NA = df[df.year.notnull()]
df_no_NA.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [33]:
# filter out rows ina . dataframe with column year values NA/NAN
df_NAs = df[df.year.isnull()]
df_NAs.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap


### Selecting Rows of Pandas Dataframe Based on a list

Also in the above example, we selected rows based on single value, i.e. **year == 2002**. However, often we may have to select rows using multiple values present in an iterable or a list. For example, let us say we want select rows for years [1952, 2002].

Pandas dataframe’s **isin()** function allows us to select rows using a list or any iterable. If we use isin() with a single column, it will simply result in a boolean variable with True if the value matches and False if it does not.

In [34]:
#Selecting rows whose column value is in list 
years = [1952, 2007]
df.year.isin(years)

0        True
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703     True
Name: year, Length: 1704, dtype: bool

In [36]:
df_years= df[df.year.isin(years)]
df_years.shape

(284, 6)

We can make sure our new Dataframe contains row corresponding only the two years specified in the list. Let us use Pandas unique function to get the unique values of the column “year”

In [37]:
df_years.year.unique()

array([1952, 2007], dtype=int64)

###  Selecting Rows of Pandas Dataframe Based on Values NOT in a list

We can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing _~_ in the front. For example, to get rows of **df** Dataframe whose column values not in the continent list, we will use

In [38]:
continents = ['Asia','Africa','Americas','Europe']
df_ocean = df[~df.continent.isin(continents)]
df_ocean.shape 

(24, 6)

This will result in a smaller dataframe with data' Dataframe for just _Oceania_ continent. We can verify this again by using Pandas’ unique function as before. We will just see _Oceania_ continent.

In [39]:
df_ocean.continent.unique()

array(['Oceania'], dtype=object)

### Selecting Rows of Pandas Dataframe using Multiple Conditions

We can combine multiple conditions using _&_ operator to select rows from a pandas data frame. For example, we can combine the above two conditions to get Oceania data from years 1952 and 2002.

The resulting subset of data will contain rows corresponding to the Oceania continent for the years 1957 and 2007.

In [45]:
oceania_1952_2002 = df[~df.continent.isin(continents) & df.year.isin(years)]
oceania_1952_2002

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
60,Australia,1952,8691212.0,Oceania,69.12,10039.59564
71,Australia,2007,20434176.0,Oceania,81.235,34435.36744
1092,New Zealand,1952,1994794.0,Oceania,69.39,10556.57566
1103,New Zealand,2007,4115771.0,Oceania,80.204,25185.00911
