# Pandas Groupby


Groupby is one of the most important and key functionality in pandas. It allows us to group data together, call aggregate functions and combine the results in three steps *split-apply-combine*: <br>
Before we move on to the hands-on, let's try to understand how this split-apply-combine work, using a data in different colours!

* **Split:** In this process, data contained in a pandas object (e.g. Series, DataFrame) is split into groups based on one or more keys that we provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). <br>
* **apply:** Once splitting is done, a function is applied to all groups independently, producing a new value.
* **combine:** Finally, the results of all those functions applications are combined into a resultant object. The form of the resulting object will usually depend on what's being done to the data.<br>

Lets explore with some examples:


![group](https://media.giphy.com/media/5LcfoE5u34kfNvW1Oi/giphy.gif)


## Introduction

In this lab, you'll learn how to use `.groupby()` statements in Pandas to summarize datasets.

## Objectives
You will be able to:
* Understand what a groupby object is and split a DataFrame using a groupby
* Create aggregate data view using the groupby method on a pandas DataFrame

## Using `.groupby()` statements

Consider an example of the titanic DataFrame:

<img src='images/titanic_1.png'>

During the Exploratory Data Analysis phase, one of the most common tasks you'll want to do is split our dataset into subgroups and compare them to see if you can notice any trends.  For instance, you may want to group the passengers together by gender or age. You can do this by using the `.groupby()` function built-in to pandas DataFrames. 

To group passengers by gender, you would type:

```python
df.groupby('Sex')

# This line of code is equivalent to the one above
df.groupby(df['Sex'])
```

Note that this alone will not display a result--although you have split the dataset into groups, you don't have a meaningful way to display information until you chain an **_Aggregation Function_** onto the groupby.  This allows you to compute summary statistics!

You can quickly use an aggregation function by chaining the call to the end of the groupby method.

```python
df.groupby('Sex').sum()
```


The code above returns displays the following DataFrame:

<img src='images/titanic_2.png'>

You can use aggregation functions to quickly help us compare subsets of our data.  For example, the aggregate statistics displayed above allow you to quickly notice that there were more female survivors overall than male survivors.

## Aggregation Functions


There are many built-in aggregate functions provided for you in the pandas package, and you can even write and apply your own. Some of the most common aggregate functions you may want to use are:

* `.min()` -- returns the minimum value for each column by group
* `.max()` -- returns the maximum value for each column by group
* `.mean()` -- returns the average value for each column by group
* `.median()` -- returns the median value for each column by group
* `.count()` -- returns the count of each column by group


You can also see a list of all of the built-in aggregation methods by creating a grouped object and then using tab completion to inspect the available methods:

```python
grouped_df = df.groupby('Sex')
grouped_df.<TAB>
```

This will display the following output:

```
In [26]: grouped_df.<TAB>
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight
```

This is a comprehensive list of all built-in functions available to grouped objects.  Note that some are aggregation functions, while others, such as `gb.fillna()`, allow us to fill the null values to individual groups independently.  

## Grouping With Multiple Groups

You can also split data into multiple different levels of groups by passing in an array containing the name of every column you want to group by--for instance, by every combination of both `Sex` and `Pclass`.    

```python
df.groupby(['Sex', 'Pclass']).mean()
```

The code above would return the following DataFrame:

<img src="./images/titanic_3.png">

## Selecting Information From Grouped Objects

Since the resulting object returned is a DataFrame, you can also slice a selection of columns you're interested in from the DataFrame returned. 

The example below demonstrates the syntax for returning the mean of the `Survived` class for every combination of `Sex` and `Pclass`:

```python
df.groupby(['Sex', 'Pclass'])['Survived'].mean()
```

The code above returns the following DataFrame:

<img src='./images/titanic_4.png'>

The above example slices by column, but you can also slice by index. Take a look:
```python
grouped = df.groupby(['Sex', 'Pclass'])['Survived'].mean()
print(grouped['female'])

# Output:
# Pclass
# 1    0.968085
# 2    0.921053
# 3    0.500000
# Name: Survived, dtype: float64

print(grouped['female'][1])
# Output:
# 0.968085
```

Note that you only need to provide only the value `female` as the index, and are returned all the groups where the passenger is female, regardless of the `Pclass` value. The second example shows the results for female passengers with a 1st-class ticket. 



In [1]:
import pandas as pd

+ Let's create a dictionary and convert that into pandas dataframe

In [52]:
# Create a dataframe
data = {'Store':['Walmart','Walmart','Costco','Costco','Target','Target'],
       'Customer':['Tim','Jermy','Mark','Denice','Ray','Sam'],
       'Sales':[150,200,550,90,430,120]}
df = pd.DataFrame(data)
df

Unnamed: 0,Store,Customer,Sales
0,Walmart,Tim,150
1,Walmart,Jermy,200
2,Costco,Mark,550
3,Costco,Denice,90
4,Target,Ray,430
5,Target,Sam,120


In the df, we have a Customer unique name, Sales in numbers and store name. <br>
Let's group the data, in df, based on column "Store" using groupby method. This will create a DataFrameGroupBy object.

Grab the df, access the gropby method using "." and pass the column we want to group the data on. <br>
Notice, we get a groupby object, stored in a memory 0x.... 

In [53]:
df.groupby("Store")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x113051b70>

In [54]:
by_store = df.groupby("Store")

### Now, we have grouped data in "by_store" object, we can call aggregate method on this object. 

In [55]:
by_store.mean()

Unnamed: 0_level_0,Sales
Store,Unnamed: 1_level_1
Costco,320
Target,275
Walmart,175


#### Pandas will apply `mean()` on number columns "Sales". It ignore not numeric columns automatically. Same is True for sum, std, max, and so on..

In [56]:
df.groupby("Store").mean()

Unnamed: 0_level_0,Sales
Store,Unnamed: 1_level_1
Costco,320
Target,275
Walmart,175


### Notice that, the result is a dataframe with "Store" as index and "Sales" as column. We can use loc method to locate any value for certain company after aggregation function. This will give us the value (e.g. sales) for a single store.

In [57]:
df.groupby("Store").sum().loc["Walmart"]

Sales    350
Name: Walmart, dtype: int64

### We can perform whole lots of aggregation operations on "by_store" object.

In [58]:
by_store.min()

Unnamed: 0_level_0,Customer,Sales
Store,Unnamed: 1_level_1,Unnamed: 2_level_1
Costco,Denice,90
Target,Ray,120
Walmart,Jermy,150


In [59]:
by_store.max()

Unnamed: 0_level_0,Customer,Sales
Store,Unnamed: 1_level_1,Unnamed: 2_level_1
Costco,Mark,550
Target,Sam,430
Walmart,Tim,200


In [60]:
by_store.std()

Unnamed: 0_level_0,Sales
Store,Unnamed: 1_level_1
Costco,325.269119
Target,219.203102
Walmart,35.355339


In [61]:
# count the no of instances in the columns, works with strings as well
# we have 2 customers and 2 sales in each store
by_store.count()

Unnamed: 0_level_0,Customer,Sales
Store,Unnamed: 1_level_1,Unnamed: 2_level_1
Costco,2,2
Target,2,2
Walmart,2,2


+ Describe is a useful method, that gives a bunch of useful information, such as, mean, min, quartile values etc for each company.

In [62]:
by_store.describe()

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Store,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Costco,2.0,320.0,325.269119,90.0,205.0,320.0,435.0,550.0
Target,2.0,275.0,219.203102,120.0,197.5,275.0,352.5,430.0
Walmart,2.0,175.0,35.355339,150.0,162.5,175.0,187.5,200.0


## Let's use `transpose()` after describe so that the output looks good!

In [63]:
by_store.describe().transpose()

Unnamed: 0,Store,Costco,Target,Walmart
Sales,count,2.0,2.0,2.0
Sales,mean,320.0,275.0,175.0
Sales,std,325.269119,219.203102,35.355339
Sales,min,90.0,120.0,150.0
Sales,25%,205.0,197.5,162.5
Sales,50%,320.0,275.0,175.0
Sales,75%,435.0,352.5,187.5
Sales,max,550.0,430.0,200.0


## We can call a column name for a selected store to separate information with `transpose()` as well!

In [64]:
by_store.describe().transpose()['Costco']

Sales  count      2.000000
       mean     320.000000
       std      325.269119
       min       90.000000
       25%      205.000000
       50%      320.000000
       75%      435.000000
       max      550.000000
Name: Costco, dtype: float64

## Summary

In this lab, you learned about how to split a DataFrame into subgroups using the `.groupby()` method. You also learned you to generate aggregate views of these groups by applying built-in methods to a groupby object.

# Great Job!
Let's have a quick over view before we move on to the next section.