# Calculations on Groups

The [WIFIRE Lab](https://wifire.ucsd.edu/index.php/) uses footage streamed from a collection of wireless cameras across California to research and model the spread of wildfires. Suppose WIFIRE has a new camera, and they want to send it to whichever county has the largest fires -- they've asked you calculate the average fire size for each of the counties.

We know from the basics of querying that we can ask for just the rows of a single county and then calculate the average fire size in acres using some of the Series calculations we've learned about. Let's do this for San Diego County.

In [None]:
import babypandas as bpd

fires = bpd.read_csv("data/calfire-full.csv").set_index("name")

In [None]:
san_diego = fires.loc[fires.get('county') == 'San Diego County']
san_diego.get('acres').mean()

And we could use the same pattern to find the average fire size for Yolo County.

In [None]:
yolo = fires.loc[fires.get('county') == 'Yolo County']
yolo.get('acres').mean()

And we could write the same pattern for every county in our dataset...

As you can imagine, we don't want this process to be so manual. Namely, it'll take a ton of time to write, a ton of time to run, and requires us to look at the dataset and find the name of all of the counties.

How many counties are there again? Yikes.

In [None]:
fires.get('county').unique()

Fortunately, this pattern is common enough that it has a special function to carry it out.

## GroupBy

Whenever we want to perform a calculation on all the rows that belong to a single *group*, and we want to perform that calculation across all of our groups, we can use the `.groupby` function.

```html
<table>.groupby('<column_name>').<calculation>()
```

When we call `.groupby` on a DataFrame, we pass a column name as the argument and are essentially telling Babypandas to split up our DataFrame based on the values of that column. Any rows which have the same value in that column will show up in the same group.

In [None]:
fires.groupby('county')

We don't get a table back when doing this step. Instead, we get a special type of object which serves as a placeholder for the next step.

The groupby object owns a handful of methods that we can use to perform calculations on each group, such as `.mean`, `.sum`, `.max`, and `.count` -- the count method simply counts how many rows there are in each group. Notice that these calculations only return *one* value for each group. So once the function operates on a group all of the rows in that group essentially get condensed -- or *aggregated* -- into a single row. For this reason, these calculations are called {dterm}`aggregation functions`.

```{note}
You may have already heard of 'aggregation functions' by some different names. If someone has ever asked for a '*summary statistic*' or a '*metric*', it basically means the same thing.

All of these terms refer to a calculation that summarizes multiple data points by using a single value.
```

Now that we know how to aggregate the acres burnt in every fire in each county into just the *mean* acres burnt, let's try it!

In [None]:
fires.groupby('county').mean()

Woah there, we only wanted to look at the average acres burnt across each county but we got the average of every numeric column.

This is because the function called on the groupby object will indeed attempt to apply itself to all the columns. That being said, our original table also has 'unit' and 'cause' columns which both contain strings -- but we're not seeing the average cause show up because you can't take the mean of strings! So, the aggregation function applies to all *possible* columns and just drops the rest.

To avoid confusion, and avoid forcing the computer to do unneccessary work, it's a good practice to select only the columns you need before conducting the groupby -- you should select the column you want to group by, and any columns you want to perform the aggregation function on.

In [None]:
counties_and_acres = fires.get(['county', 'acres'])
counties_and_acres.groupby('county').mean()

Much better.

Finally, it's important to notice what's going on with our table index.

When we loaded in our data set we set the table index to the 'name' column, but once we conducted the groupby the table index has changed to the 'county' column. In fact, whenever we call groupby, the index will always be replaced by the grouping column.

This is fortunate, since it allows us to very easily get the value for a specific group!

In [None]:
fire_sizes = counties_and_acres.groupby('county').mean()
fire_sizes.loc['San Diego County']

And we can now use the same sort-and-grab-the-first-index pattern we covered earlier in order to find the county with the largest average fire size.

In [None]:
fire_sizes.sort_values(by='acres', ascending=False).iloc[0]

```{hiddenanswer}
---
question: If we were to run `fires.groupby('county').max()` (without first selecting a subset of the columns) which columns would we expect to remain or disappear from the result?
answer: The name index will disappear and be replaced by county, all other columns will remain because the max can be calculated on a collection of numbers and max can be calculated on a collection of strings!
```

## Grouping by multiple columns

The WIFIRE Lab is greatful for you assistance, and they have another task for you. They're trying out a a new camera with the capability to detect fires at the edge.[^edge: Basically, instead of sending their video back to a data center that processes it to check for fires, those fire-detecting algorithms are run on the camera itself. That way, instead of sending back lots of video it can just send back whether it sees a fire or not.] [They're going to deploy the camera for one month, and then collect it again for some reason].

### Is a wildfire likely to occur?

Since they're hoping that the experimental camera will actually witness a wildfire, but they can only have the camera deployed for one month, they want to find out which combination of county *and* month is most likely to experience a fire. WIFIRE has again asked you to perform these calculations. But how do we find the combination with the greatest {dterm}`likelihood` of a fire?

Suprise! Here's out first official introduction to statistics. In order to calculate which combination of county and month is most *likely* to experience a fire, frequentist statistics would dictate that we should find which combination has the greatest *incidence* (occurrence) of fires.

### Multiple grouping columns

This calculation involves us counting the number of fires that took place in each county... for each month. Sounds like groupby but with an additional level. Indeed, we can use groupby to create groups out of *combinations* of columns by passing a list of column names to the `.groupby` function.

```html
<table>.groupby([<list_of_column_names>, <...>]).<calculation>()
```

Similar to before, we're telling Babypandas to split up our rows based on the values of *both* (all) the columns we specify. Then, every row which has the same value in *both* (all) of the grouping columns will be put into the same group. Everything else remains the same, including our use of an aggregation function. Simply the way the groups were constructed now depends on multiple columns instead of just one.

So, let's go ahead and use the `.count` aggregation function to get the number of fires which historically occurred in each county for each month.

In [None]:
fire_incidence = fires.groupby(['county', 'month']).count()
fire_incidence

**<font color=blue>How should we explain which column to select if using count?</font>**

Just like before, the table index will will replaced by the grouping columns. So, this time there are two(!) levels to the index. In the table above each county will have a maximum of twelve rows assigned to it -- one for each month. It looks like Alameda County never had a recorded fire take place before May.

If the hierarchical nature of the index is making things too confusing, we can always 'reset' the index so that the values in the index become columns again.

In [None]:
fire_incidence.reset_index()

Let's conclude this section by finally answering the question posed to us -- which county and month should we send the camera to such that we'll have the greatest likelihood of observing a wildfire?

The benefit of leaving our groupby index untouched is that it allows us to use the same sort-and-grab-the-first-index pattern again.

In [None]:
fire_incidence.sort_values('year', ascending=False).index[0]

We did it! Let's go report back to WIFIRE that they should test out the experimental camera in Los Angeles County during July if they want the greatest likelihood of catching a wildfire.