# Filtering data

In the last section we looked at how to act on entire columns at once. For example when we did:

```python
tips["total_bill"] * 100
```

it applied the multiplication to every row, multiplying each number by 100.

Sometimes we don't want to have to deal with entire columns at once, we might only want to grab a subset of the data and look in just that part. For example, with the tips data, we might think that the day of the week will affect the data so we just want to grab the data for Saturdays.

In Pandas there are two steps to asking a question like this.

1. create a *filter* which describes the question you want to ask
2. *apply* that filter to the data to get just the bits you are interested in

You create a filter by performing some operation on your `DataFrame` or a column within it. To ask about only those rows which refer to Saturday, you grab the `day` column and compare it to `"Sat"`:

In [1]:
# write code here
# read from "tips.csv"

In [2]:
# write code here

This has created a filter object (sometimes called a *mask* or a *boolean array*) which has `True` set for the rows where the day is Saturday and `False` elsewhere.

We could save this filter as a variable:

In [3]:
# write code here

We can use this to filter the `DataFrame` as a whole. `tips["day"] == "Sat"` has returned a `Series` containing booleans. Passing it back into `tips` as an indexing operation will use it to filter based on the `day` column, only keeping those rows which contained `True` in the filter:

In [4]:
# write code here

Notice that it now says that the table only has 87 rows, down from 244. However, the index has been maintained. This is because the row labels are connected to the row, they're not just row numbers.

It is more common to do this in one step, rather than creating and naming a filter object. So the code becomes:

In [5]:
# write code here

This has given us back our subset of data as another `DataFrame` which can used in exactly the same way as the previous one (further filtering, summarising etc.).

### Exercise 1

- Select the data for only Thursdays.
- Calculate the mean of the `tip` column for Thursdays
- Compare this with the mean of the `tip` column for Saturdays


In [6]:
# write code here

## Other filters

As well as filtering with the `==` operator (which only checks for exact matches), you can do other types of comparisons. Any of the standard Python comparisons will work (i.e. `==`, `!=`, `<`, `<=`, `>`, `>=`).

To grab only the rows where the total bill is less than £8 we can use `<`:

In [7]:
# write code here

### Exercise 2

Filter the data to only include parties of 5 or more people.



In [8]:
# write code here

## Combining filters

If you want to apply multiple filters, for example to select only "Saturdays with small total bills" you can do it in one of two different ways. Either split the question into multiple steps, or ask it all at once.

Let's do it multiple steps first since we already have tools we need for that:

In [9]:
# write code here

Or, you can combine the questions together using the `&` operator with a syntax like:

```python
df[(filter_1) & (filter_2)]
```

so in our case filter 1 is `tips["day"] == "Sat"` and filter 2 is `tips["total_bill"] < 8` so it becomes:

In [10]:
# write code here

If you want to do an "or" operation, then instead of `&` you can use `|`.

### Exercise 3

Filter the data to only include parties of 4 or more people which happened at lunch time.

Hint: The `size` and `time` columns are what you want to use here.


In [11]:
# write code here

## DataFrame indexing

When we use the square bracket syntax on a `DataFrame` directly there are a few different types of object that can be passed:

<dl>
<dt>A single string</dt>
<dd>This will select a single column form the <code>DataFrame</code>, returning a <code>Series</code> object.</dd>
<dt>A list of strings</dt>
<dd>This will select those columns by name, returning a <code>DataFrame</code>.</dd>
<dt>A filter (a <code>Series</code> of <code>True</code>/<code>False</code>)</dt>
<dd>This will filter the table as a whole, returning a <code>DataFrame</code> with only the rows matching <code>True</code>included.</dd>
</dl>

These are provided as shortcuts as they are the most common operations to do an a `DataFrame`. This is why some of them operate on columns and other on rows.

If you want to be expicit about which axis you are acting on, you can pass these same types of objects to the `.loc[rows, columns]` attribute with one argument per axis. This means that

```python
tips[sat_filter]
```

is equivalent to
```python
tips.loc[sat_filter]
```

and that
```python
tips["size"]
```

is equivalent to
```python
tips.loc[:, "size"]
```

The full set of rules for [`DataFrame.loc` are in the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html).

### Group by: split-apply-combine

Another operation worth learning is how to aggregate the data based on some criteria. The method we are going to use is called `groupby()`, and is actually the result of combining the following steps:

- **splitting** the data into groups based on some criteria
- **applying** a function to each group independently
- **combining** the results into a data structure

To demonstrate this, we will use the following dataset containing information on passengers of the Titanic.

In [12]:
# write code here
# read from 'data/titanic.csv'

This dataset contains infomation on the passengers, like the sex, the name, the age, and some info about the ticket prices and so on

In [13]:
# write code here

Suppose we want to make some statistics on the age of the passengers.
One way we could do it is by selecting the 'Age' column:


In [14]:
# write code here

In [15]:
# write code here

Or we could select both age and sex of the passengers, and obtain a new DataFrame:

In [16]:
# write code here

In [17]:
# write code here

Now, suppose we want to calculate the average age, we could simpy select the 'Age' column and calculate the mean:

In [18]:
# write code here

But if we wanted to calculate the average age depending on the sex, we cannot calculate it as:

In [19]:
# write code here

As you can see, it doesn't give us what we were expecting. This happens because 'Sex' is a cathegorical variable, that can have the values "male" or "female", and it isn't understood by the mean function. With two numerical variables, like the age and the ticket fare, we could do:

In [20]:
# write code here

And this would give us the mean of each column.

What if we wanted to know the average age for each sex?

To do this we can use the `groupby()` method, to make a group per category:

In [21]:
# write code here

As expected, this returns the mean of each column, grouped according to 'Sex'. Let's select the 'Age':

In [22]:
# write code here

It is possible to group data by more categories at the same time. So, if we wanted to know the average cost of a ticket for both genders and for different cabin classes:

In [23]:
# write code here

### Count the number of records per Category

What is the number of passengers in each cabin class?

In [24]:
# write code here

The `value_counts()` method counts the number of records for each category in a column. The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [25]:
# write code here

Both size and count can be used in combination with groupby. Whereas size includes `NaN` values and just provides the number of rows (size of the table), count excludes the missing values. In the `value_counts` method, use the `dropna` argument to include or exclude the `NaN`
 values. more info: https://pandas.pydata.org/docs/user_guide/basics.html#basics-discretization

REMEMBER
- Aggregation statistics can be calculated on entire columns or rows
- `groupby` provides the power of the split-apply-combine pattern
- `value_counts`
 is a convenient shortcut to count the number of entries in each category of a variable

These are some of the most common operations we can perform on a Pandas DataFrame. 
For more details, you can always refer to the official documentation, that also includes very detailed tutorials:https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html