# Querying

Let's continue with our analysis of California wildfires. In the previous section we learned how to answer questions like "what were the largest fires?", but very often we're more interested answering questions like "what were the largest fires *in the year 2019*?"

In [None]:
import babypandas as bpd
fires = bpd.read_csv("../../data/calfire-full.csv").set_index('name')
fires

Since we already know how to sort rows by the largest fire, the next natural question to ask ourselves is how do we only look at the fires which took place in 2019. Here's how -- we'll break the next line of code down in just a moment.

In [None]:
fires[fires.get('year') == 2019]

So what's going on in the expression above?

## Boolean Arrays

Essentially, we can select a subset of rows by using square brackets and using a {dterm}`comparison operator` on one of our columns. Recall from our introduction to {dterm}`Booleans` that we use comparison operators to return whether a comparison between two values is True or False.  In the expression above, we're getting the column of years and seeing if it equals 2019.

This starts to give us an idea of how the expression works beneath the surface. In fact, what happens if we just look at the result of the expression within the square brackets?

In [None]:
fires.get('year') == 2019

The result is a {dterm}`Boolean array` -- a sequence of True and False values.

```{note}
The term Boolean array isn't confined to just arrays. We can actually refer to any sequence of True and False as a Boolean array, including lists and Series.
```

Just like we saw that arrays and Series support element-wise operations with arithmetic, they also support element-wise *comparisons*!

When we use square brackets on a table, Babypandas expects to recieve a Boolean array that has the same length as the number of rows in the table. If the Boolean array is True in the first position, then the first row of the table will be included in the result. If the Boolean array is False in some position, then the row of the table in that same position won't be included in the result.

````{tip}
Using square brackets on a table can be read aloud as "*where*".

So the expression
```python
fires[fires.get('year') == 2019]
```
would be read aloud as **"fires *where* the year column of fires equals 2019"**.
````

As long as the Boolean array has the same length as the number of rows in the table, we can use it. Though it's rather impractical, we could have manually created a list of True and False to pass in to our row selection.

Take the following table of five common sorting algorithms `algos` as an example:

In [None]:
algos = bpd.DataFrame().assign(
    Algorithm=['Insertion sort', 'Merge sort', 'Quick sort', 'Bubble sort', 'Heap sort'],
    Efficiency=['O(n^2)', 'O(n log n)', 'O(n^2)', 'O(n^2)', 'O(n log n)']
).set_index('Algorithm')
algos

If we wanted to select only the rows that have an efficiency of $O(n\log n)$, we *could* manually create a Boolean array and select using that.

```{margin}
What is this mysterious notation with $O$ and $n$s? It's related to the *efficiency* of an algorithm.

No need to know what they are now -- you'll see them again in DSC 40A!
```

In [None]:
bool_arr = [False, True, False, False, True]
algos[bool_arr]

Though it would almost always make quite a bit more sense to calculate the Boolean array programatically.

In [None]:
bool_arr_calculated = algos.get('Efficiency') == 'O(n log n)'
algos[bool_arr_calculated]

## ...

- gotcha: sorting then selecting (using the column of the unsorted)
- In a broader sense, we're often interested in answering questions on a subset of data where some {dterm}`condition` is satisfied by the columns.