# Groupping and aggregation

In [1]:
import pandas as pd

In [2]:
url = 'https://github.com/alx2202/DataAnalysis/raw/main/Day13/emps.csv'
emps = pd.read_csv(url, sep=';', encoding='utf-8', index_col='employee_id', parse_dates=['hire_date'])
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


## Aggregate functions

- `.sum()` - sum
- `.mean()` - mean or average of a data set is found be adding all numbers in the data set and the dividing this number by the number of values in the set
- `.median()` - median is the middle value when a data set is order from least to greatest
- `.min()` - minimum
- `.max()` - maximum
- `.std()` - [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)
- ...

In [4]:
emps.salary.sum(), emps.salary.min(), emps.salary.max(), emps.salary.mean(), emps.salary.median()

(691400, 2100, 24000, 6461.682242990654, 6200.0)

We can execute aggregate function not only on Series (column) but also on a whole DataFrame. If so aggregate function will try to calculate the value for each Series of the DataFrame where possible. 

In [5]:
emps.sum()

  emps.sum()


first_name    StevenNeenaLexAlexanderBruceDavidValliDianaNan...
last_name     KingKochharDe HaanHunoldErnstAustinPataballaLo...
job_title     PresidentAdministration Vice PresidentAdminist...
salary                                                   691400
dtype: object

We can apply several function to one column also we can aggregate on multiple columns at the same time (but not on all of them, only the ones we want to).

For this approach, which is ery common, we will use `agg()` (`aggregate()`).

To `agg` function we will provide an argument being a dictionary with the following structure:
- key = columns / Series for which we will apply aggregate functions
- value = name of the aggregate function we want to apply or a list of names if we want to apply several functions to the same column.

In [7]:
emps.agg({
    'salary': 'mean',
    'hire_date': 'median'
})

salary               6461.682243
hire_date    2007-09-28 00:00:00
dtype: object

In [8]:
emps.agg({
    'salary': ['min', 'max', 'mean', 'median', 'sum', 'std'],
    'hire_date': ['min', 'max', 'median']
})

Unnamed: 0,salary,hire_date
min,2100.0,1987-09-17
max,24000.0,2011-02-06
mean,6461.682243,NaT
median,6200.0,2007-09-28
sum,691400.0,NaT
std,3909.365746,NaT


The other approach, instead of providing names of aggregate functions (as strings) we can provide python or NumPy functions.

In [9]:
import numpy as np

In [11]:
emps.agg({
    'salary': [min, max, np.median, 'mean', 'std'],
    'hire_date': [min, max, 'median'],
})

Unnamed: 0,salary,hire_date
min,2100.0,1987-09-17
max,24000.0,2011-02-06
median,6200.0,2007-09-28
mean,6461.682243,NaT
std,3909.365746,NaT


## Grouping

In [12]:
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


In [15]:
emps[emps.city == 'Seattle'].salary.agg([min, max])  # when executing .agg() on a column (Series) we can just provide a list instead of dictionary

min     2500
max    24000
Name: salary, dtype: int64

So far we've been working with aggregate functions that are applied on a whole Series. But very often we want to be able to calculate the resutls of aggregate function for particular values, like min, max salary per city. 

To solve that problem (without filtering which will be annoying, as we have to filter thorugh all the possible values) we can use `.groupby()` method that will create groups of rows for the same value on a column we want. 

We can group our DataFrame by `city`. Then we will have access to groups of rows for the same city, like `Seattle`, `London`, etc. Then we will be able to apply aggregate functions of those groups.

In [16]:
groups = emps.groupby('city')
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb0e8071760>

The results of `.groupby()` operation is a `DataFrameGroupBy` object that contains all the groups and all the rows.

In [17]:
len(groups)  # 7, because NaN are not included

7

In [18]:
emps.city.unique()

array(['Seattle', 'Southlake', 'South San Francisco', 'Oxford', nan,
       'Toronto', 'London', 'Munich'], dtype=object)

In [19]:
groups.size()  # how many rows we have within each group

city
London                  1
Munich                  1
Oxford                 34
Seattle                18
South San Francisco    45
Southlake               5
Toronto                 2
dtype: int64

In [20]:
# we can get the data for particular group
groups.get_group('Toronto')

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada


In [23]:
type(groups.get_group('Toronto'))

pandas.core.frame.DataFrame

In [22]:
for city, group in groups:
    print(f'Employees from {city}')
    for idx, employee in group.iterrows():
        print(f'\t{employee.first_name} {employee.last_name}')

Employees from London
	Susan Mavris
Employees from Munich
	Hermann Baer
Employees from Oxford
	John Russell
	Karen Partners
	Alberto Errazuriz
	Gerald Cambrault
	Eleni Zlotkey
	Peter Tucker
	David Bernstein
	Peter Hall
	Christopher Olsen
	Nanette Cambrault
	Oliver Tuvault
	Janette King
	Patrick Sully
	Allan McEwen
	Lindsey Smith
	Louise Doran
	Sarath Sewall
	Clara Vishney
	Danielle Greene
	Mattea Marvins
	David Lee
	Sundar Ande
	Amit Banda
	Lisa Ozer
	Harrison Bloom
	Tayler Fox
	William Smith
	Elizabeth Bates
	Sundita Kumar
	Ellen Abel
	Alyssa Hutton
	Jonathon Taylor
	Jack Livingston
	Charles Johnson
Employees from Seattle
	Steven King
	Neena Kochhar
	Lex De Haan
	Nancy Greenberg
	Daniel Faviet
	John Chen
	Ismael Sciarra
	Jose Manuel Urman
	Luis Popp
	Den Raphaely
	Alexander Khoo
	Shelli Baida
	Sigal Tobias
	Guy Himuro
	Karen Colmenares
	Jennifer Whalen
	Shelley Higgins
	William Gietz
Employees from South San Francisco
	Matthew Weiss
	Adam Fripp
	Payam Kaufling
	Shanta Vollman
	Kevin M

Since each group is a `DataFrame` we can use all the operations we know so far dedicated to DataFrames, including methods that will allow us to export `DataFrame` to another format, like CSV or Excel (many other formats are available as well).

Available `.to_` methods to export `DataFrame` into a different format: [Pandas docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv).

In [24]:
for city, group in groups:
    group.to_csv(f'group_{city}.csv')