# Filtering DataFrames and Series

In [None]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in Lisbon.

In [None]:
df = pd.read_csv('data/airbnb.csv', index_col='room_id')

In [None]:
df.shape

In [None]:
df.head()

# Selecting rows

## Selecting rows by their position - iloc

We use the function [iloc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) to select specific rows on a Data Frame (regardless of the index).

With `iloc` we select rows regarding their row number, starting at 0.

In [None]:
df.iloc[0]

In [None]:
type(df.iloc[0])

If we want the selection to be a dataframe (instead of a Series), we can use double brackets `[[]]`

In [None]:
df.iloc[[0]]

We can select multiple rows at once:

In [None]:
df.iloc[[0, 3,5]]

Or use slices like with arrays:

In [None]:
df.iloc[2:10]

## Selecting rows by their index value - loc

* With [.loc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) we can select rows based on their index value.

Since we have set the dataframe index as the Airbnb listing, we can select a specific room based on its id, for example, the listing 10186098.

In [None]:
df.loc[10186098]

Selecting an index value that doesnt exist wil fail like it would do with a dictionary

In [None]:
df.loc[[5]]

Same as with .iloc, we can select multiple values at once.

In [None]:
df.loc[[29872, 19188572, 4612503 ]]

We can use a boolean array (a list of True and False) to select multiple rows with loc, this is called a **mask**.

In [None]:
df.loc[[True, True, False, True]]

We see we have selected rows 1,2 and 4 of the dataframe

In [None]:
df.head()

## Column Selection

## Selecting columns by their name

We can select columns using dot notation **(as long as the column names dont have spaces or non alphanumerical characters on them)**

In [None]:
df.room_type

Which is the same as doing:

In [None]:
df['room_type']

When we select one column we receive a pd.Series, we can use double brackets to select multiple columns (if we select multiple columns we will always receive a dataframe). 

In [None]:
df[["room_type", "price"]].head()

We can always select columns with loc

In [None]:
df.loc[:, "room_type"][:10]

The index doesnt have to be unique, for example we can set the neighbourhood as the index.

In [None]:
df.head()

In [None]:
df = df.set_index("neighborhood")

In [None]:
df.loc["Belém"].head()

We set back the index to `host_id`, we need to use the argument `drop=False` so pandas doesnt remove the original index 

In [None]:
df = df.set_index("host_id", drop=False)

## Mask

The function ([Mask](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html)) allows us to "hide" parts of a dataframe that match a certain condition.

In [None]:
df.mask(df.overall_satisfaction == 5.0)

We see that the rows that dont match the condition appear as `NaN`, which stands for **Not a Number**, a standard way of saying *"there is no relevant data here"*. Pandas will usually ignore the NaNs.

## Where

On the other hand, [where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) hides those rows that don't match the condition (where is the opposite of mask).

In [None]:
df.where(df.overall_satisfaction == 5.0)

# Filtering with []

We can also filter by using brackets.
The difference between filtering with brackets and using `mask/where` is that with brackets we only receive a segment of the dataframe (less rows), while with `mask/where` we receive a dataframe with the same rows and index than the original one.

For example, we can filter the dataframe to see all the listings in `Belem`:

In [None]:
df.head()

In [None]:
df.where(df.neighborhood=="Belém").shape

If we use brackets, the dataframe we get is smaller

In [None]:
df[df.neighborhood == 'Belém']

In [None]:
df[df.neighborhood == 'Belém'].shape

We can select the inverse of a condition if we put `~` in front of it.

For example, to select all listings that are not in Belem, we can do this:

In [None]:
df[~(df.neighborhood ==  "Belém")].shape

# Multiple Selection

We can filter a dataframe based on multiple conditions.

We can select rows that match multiple conditions by concatenating the conditions with `&`.

For example, if we want those listings in Belém with more than 3 bedrooms:

In [None]:
df[(df.neighborhood == 'Belém') & (df.bedrooms > 3)]

Same way, we can select rows that match one condition OR the other with the pipe (`|`)

In [None]:
df[(df.neighborhood == "Belém") | (df.neighborhood == "Benfica")]

## Isnull/Notnull

Sometimes we simple want to select those rows there there are no null (`NaN`) values.

We can select rows where a column is null by doing `column.isnull()`.

In [None]:
df[df.overall_satisfaction.isnull()]

Likewise, we can select those rows where a column is not null by using `notnull()`.

In [None]:
df[df.overall_satisfaction.notnull()].head()

An easy way to check if any column has null values is by using `df.notnull().all()` (all will return True only if all rows match the condition:

In [None]:
df.notnull().all()

So we see that the column overall_satisfaction has some null values on it

We can find those rows that have any null like this (any returns true if any value is true, and using axis=1 means we are checking rows instead of columns:

In [None]:
df[df.isnull().any(axis=1)]

# Isin

We can check if an element belongs to a python list like this:

In [None]:
"potato" in ["potato", "tomato", "lettuce"]

We can use a similar approach with pandas dataframes using `.isin`. For example, if we want to select those listings where the neighborhood is in a specific list we can do it like this:

In [None]:
favorite_neighbourhoods = ["Belém", "Parque das Nações"]

listings_i_like = df[df.neighborhood.isin(favorite_neighbourhoods)]

listings_i_like.head()

# Query

The method `.query` allows us to use SQL to select rows from a dataframe.

In [None]:
df.query("neighborhood=='Belém' and price>150")

## Filtering based on datatypes

In [None]:
df.dtypes

We can use the method `select_dtypes` to select those columns that have specific types. 

For example, if we want to select only the columns that are floats, we can do:

In [None]:
df.select_dtypes(include=[float]).head()

We can also use the parameter `exclude` to filter excluding certain data types. 

For example, if we want to exclude those columns that are python objects (and strings are objects), we can do so like:

In [None]:
df.select_dtypes(exclude=[object]).head()