# Grouping data in pandas

You can group and aggregate data in pandas in ways that will be familiar if you've ever done a pivot table in Excel or a GROUP BY statement in SQL. In this notebook we'll use the eel import data that lives at `../data/eels.csv`.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/eels.csv')

In [None]:
df.head()

### `groupby()`

Let's group the data by country and sum the kilos for each country.

If this were a pivot table, we'd drag the `country` column into Rows and the `kilos` column into Values, then summarize by Sum.

If this were SQL, we might write something like:

```sql
SELECT country, sum(kilos)
FROM table
GROUP BY country
ORDER BY 2 desc
```

Let's do the same thing in pandas using [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html):

- Select our two columns of interest (`country` and `kilos`)
- Call the `groupby()` method on the grouping column (`country`)
- Call the `sum()` method
- Sort by kilos descending

In [None]:
df[['country', 'kilos']].groupby('country').sum().sort_values('kilos', ascending=False)

### Value counts

If all you need to do is count the occurrences of a value in a column, you can use the `value_counts()` method for a Series.

In our eel data, every row is one month's of shipments of a particular eel product from one country. In how many months is mainland China represented, period? Of those, how many times did its monthly exports to the U.S. exceed 25,000 kilos?

Our steps:
- Get the value_counts of the country data and peep China's total
- Filter the data to get just shipments over 25,000 kilos, then get the value_counts on country again, peep China's total

In [None]:
df.country.value_counts().sort_values(ascending=False).head()

In [None]:
df[df['kilos'] > 25000].country.value_counts().sort_values(ascending=False)

### Pivot tables

Now we want to get the total kilos by country by year. We could use `groupby()` again, but pass it multiple columns. We'd get something like this:

In [None]:
df[['country', 'year', 'kilos']].groupby(['country', 'year']).sum()

... which is fine, but (I think) there's a more intuitive way to look at this data: using the [`pivot_table()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) method.

If we were making this pivot table in Excel, we would drag `country` to Rows, `kilos` to Values and `year` to Columns. But we're gonna do it in pandas. We're gonna hand the `pivot_table()` method X things:
- A reference to the data frame you're pivoting (`df`)
- The `index` column -- what to group your data by (`index='country'`)
- The `columns` column -- the second grouping factor (`columns='year'`)
- The `aggfunc` -- what function to use to aggregate the data; the default is to use an average, but we'll use Python's built-in `sum` function

Then we'll sort the results by the latest year of data -- 2017.

In [None]:
pd.pivot_table(df, index='country', columns='year', values='kilos', aggfunc=sum).sort_values(2017, ascending=False)