# Pandas Groupby: Summarising, Aggregating, and Grouping data in Python

In this short project we will be exploring grouping large data frames by different variables, and applying summary functions on each group. This is accomplished in Pandas using the “groupby()” and “agg()” functions of Panda’s DataFrame objects.

For our project we will be using the Campus Placements dataset from Kaggle. For more details on this dataset you can check this link - https://www.kaggle.com/benroshan/factors-affecting-campus-placement.

We have covered the basics of DataFrames and how to lead them and manipulate data in the project "Pandas - Loading and Manipulating Data". We suggest you complete that before you start with this project. 

## What is the GroupBy function?

Pandas’ GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Let me take an example to elaborate on this. Let’s say we are trying to analyze the salary of a person in an MBA batch. We can easily get a fair idea of their salary by determining the mean salary of the whole batch. But here ‘s a question – would the salary be affected by the gender of a person?

We can group the batch into different gender groups and calculate their mean salary. This would give us a better insight into the salary of a student in the batch we are analyzing. But we can probably get an even better picture if we further separate these gender groups into different groups by the students prior degree and then take their mean salary. Maybe the prior degree has an impact on the earning potential after an MBA?

You can see how separating people into separate groups and then applying a statistical value allows us to make better analysis than just looking at the statistical value of the entire population. This is what makes GroupBy so great!

GroupBy allows us to group our data based on different features and get a more accurate idea about your data. It is a one-stop-shop for deriving deep insights from your data!

## This Lab Contains - 

1. [Loading the Data](#section1)
2. [Accessing the Data](#section2)
3. [Summarising the DataFrame](#section3)
4. [Summarising by Groups in the DataFrame](#section4)
5. [Summary Statistics by the Group](#section5)
    - [Multiple statistics by the Group](#section6)
6. [Renaming Grouped Aggregation Columns](#section7)
7. [Pivot Tables](#section8)

----
<a id='section1'></a>
## Loading the Data
Let's start then!

We will start by loading the data from the .csv file and applying the date parser to it - 

In [2]:
# We will use pandas for storing and manipulating data
import pandas as pd

# Load data from csv file
data = pd.read_csv('https://raw.githubusercontent.com/anikannal/data/master/Placement_Data.csv')


<a id='section2'></a>
## Accessing the Data

Now that we have the data loaded in a pandas dataframe, let's understand how to access the data.

In [None]:
# Look at all the column names and datatypes
# This is the placements data of an MBA batch

data.info()

In [None]:
# To access first few rows of the dataframe

data.head()

What did we learn about our dataset?

1. sl_no - serial number / roll number - this is a unique identifier
2. gender - gender of the student
3. ssc_p - SSC percentage scored by the student
4. ssc_b - SSC board the student graduated from (central, others)
5. hsc_p - HSC percentage scored by the student
6. hsc_b - HSC board the student graduated from (central, others)
7. hsc_s - HSC specialization (science, commerce, arts)
8. degree_p - student's prior degree percentage
9. degree_t - student's degree type (sci and tech, comm & management, etc.)
10. workex - work experience yes/no
11. etest_p - employability test percentage
12. specialisation - this is the student's mba specialization
13. mba_p - mba percentage for the student
14. status - placed/not placed
15. salary - salary offered by the recruiting company

In [None]:
# To access one of the columns -

data['ssc_p']

In [None]:
# To access two or more columns - 

data[['ssc_p','ssc_b']]

In [None]:
# To filter data based on some condition - 
# Can you show me all students with ssc percentages greater than 70?

data[data['ssc_p']>70]

In [None]:
# To filter a column based on some condition - 
# Can you find the placement status of all students whose HSC stream was Science?

data['status'][data['hsc_s']=='Science']

<a id='section3'></a>
## Summarising the DataFrame

Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable.

In [None]:
# How many rows in the dataset

data['item'].count()

In [None]:
# What was the highest employability test score?

data['etest_p'].max()

In [None]:
# What was the highest salary for a student with a Sci&Tech undergraduate degree?

data['salary'][data['degree_t']=='Sci&Tech'].max()

In [None]:
# What is the total of all salary packages offered to students in 'Mkt&Fin' specialization?

data['salary'][data['specialisation'] == 'Mkt&Fin'].sum()

In [None]:
# How many students are there in each specialization?

data['specialisation'].value_counts()

In [None]:
# What are the number of non-null unique degree types?

data['degree_t'].nunique()

In [None]:
# What are the unique degree types?

data['degree_t'].unique()

In [None]:
# Do any of the columns have NaN or Null values?

data.isnull().any()

In [None]:
# How many?

data.isnull().sum()

The salary column has 67 'nan' values! Lets replace those with a zero for ease of analysis -

In [None]:
data['salary'] = data['salary'].fillna(0)

The .describe() function is a useful summarisation tool that will quickly display statistics for any variable or group it is applied to. The describe() output varies depending on whether you apply it to a numeric or character column.

In [None]:
# Let's find out more about the network column

data['salary'].describe()

In [None]:
# Let's find out more about the gender ratio

data['gender'].describe()

<a id='section4'></a>
## Summarising by Groups in the DataFrame

We have learnt to access specific columns and rows, filtering based on conditions, and summarizing specific columns.

There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression data.groupby(‘month’)  will split our current DataFrame by month.

In [None]:
# Let's group the dataframe by the students' specialization

data.groupby(['specialisation'])

Notice - the groupby() function returns a DataFrameGroupBy object. Unlike the dataframe which displays itself this object doesnt!

The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. The GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

In [None]:
# The GroupBy object .groups variable is a dictionary

type(data.groupby(['specialisation']).groups)

In [None]:
# The GroupBy object .groups variable is a dictionary whose keys are the computed unique groups 
# and corresponding values being the groupings

data.groupby(['specialisation']).groups.keys()

In [None]:
# Accessing the groups
# How many students have a Marketing and HR specialization?

len(data.groupby(['specialisation']).groups['Mkt&HR'])

There are 95 students with a Marketing and HR specialization.

<a id='section5'></a>
## Summary Statistics 
Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to obtain summary statistics for each group – an immensely useful function.

In [None]:
# Using grouping and summary functions together - 
# Who is the top student for each specialization?

data.groupby('specialisation')['mba_p'].max()

In [None]:
# What is the total salary offered to students in each specialization?

data.groupby('specialisation')['salary'].sum()

In [None]:
# What is the mean salary offered to students in each specialization?

data.groupby('specialisation')['salary'].mean()

In [3]:
# What is the mean salary for students with prior degree in Sci&Tech grouped by specialization?
# Do students with prior degree in Sci&Tech get better salaries by doing a Mkt&Fin or Mkt&Hr specialization?

data[data['degree_t'] == 'Sci&Tech'].groupby('specialisation')['salary'].mean()

specialisation
Mkt&Fin    322720.0
Mkt&HR     301937.5
Name: salary, dtype: float64

**Grouping by multiple variables**

You can also group by more than one variable, allowing more complex queries.

In [4]:
# Grouping by multiple variables/column values.
# Can you find the male/female counts by specialization? Do you see females favoring one over the other?

data.groupby(['specialisation','gender'])['gender'].count()

# Do you think female students prefer HR or Finance?

specialisation  gender
Mkt&Fin         F         37
                M         83
Mkt&HR          F         39
                M         56
Name: gender, dtype: int64

In [None]:
# Does the prior degree or specialization matter for expected salary for a student?
# What is the mean salary for students with different specializations and prior degrees?

data.groupby(['degree_t','specialisation'])['salary'].mean()


#### Groupby output format – Series or DataFrame?

The output from a groupby and aggregation operation varies between Pandas Series and Pandas Dataframes, which can be confusing for new users. As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.

In [None]:
# produces Pandas Series

print(type(data.groupby('specialisation')['salary'].mean()))

data.groupby('specialisation')['salary'].mean()

In [None]:
# To get a Pandas DataFrame you could select your operation column differently
# notice the double brackets around 'salary'

print(type(data.groupby('specialisation')[['salary']].mean()))

data.groupby('specialisation')[['salary']].sum()

In [None]:
# To explicitly set the index pass the as_index=False parameter

data.groupby('specialisation', as_index=False)['salary'].mean()

<a id='section6'></a>
### Multiple Statistics per Group
The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.

We will look at two special cases - 

1. Applying a single function to columns in groups
2. Applying multiple functions to columns in groups

This is unequivocally the most important step of a GroupBy function where we can perform a variety of operations using aggregation, transformation, filtration or even with your own function! We have already applied various statistical functions like sum(), mean(), count(), etc.

Let’s have a look at how to apply these in further detail.

#### 1. Applying a single function to columns in groups
Instructions for aggregation are provided in the form of a python dictionary or list. The dictionary keys are used to specify the columns upon which you’d like to perform operations, and the dictionary values to specify the function to run.

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['degree_t', 'specialisation']).agg(
    {
         'salary':'mean',    # Sum duration per group
         'gender': 'count',  # get the count of networks
         'etest_p': 'max'  # get the first date per group
    }
)

The aggregation dictionary syntax is flexible and can be defined before the operation. 

In [None]:
# Define the aggregation procedure outside of the groupby operation
# Same example as above, just defining the aggregations as a dictionary

aggregations = {
         'salary':'mean',    # Sum duration per group
         'gender': 'count',  # get the count of networks
         'etest_p': 'max'  # get the first date per group
    }

data.groupby(['degree_t', 'specialisation']).agg(aggregations)

#### 2. Applying multiple functions to columns in groups
To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe. See below:

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(
    ['degree_t', 'specialisation']
).agg(
    {
        # Find the min, max, and mean of the salary column
        'salary': [min, max, 'mean'],
        # find the number of network type entries
        'gender': "count",
        # minimum, first, and number of the employability scores
        'etest_p': [min, max, 'mean']
    }
)

<a id='section7'></a>
## Renaming grouped aggregation columns
Introduced in Pandas 0.25.0, groupby aggregation with relabelling is supported using “named aggregation” with simple tuples. Python tuples are used to provide the column name on which to work on, along with the function to apply. 


In [None]:
# Is there an impact of hsc board or the prior degree on min, max, or mean salaries?
# Are salaries dependent on hsc board or the prior degree of a student?

data[data['specialisation'] == 'Mkt&Fin'].groupby(['hsc_b','degree_t']).agg(
    # Get max of the salary column for each group
    max_salary=('salary', max),
    # Get min of the salary column for each group
    min_salary=('salary', min),
    # Get max of the hsc_p column for each group
    mean_salary=('salary', 'mean'),
    # Get max of the hsc_p column for each group

)

<a id='section8'></a>
## Pivot Tables

Pivot tables are another way of summarizing your data.

A pivot table is composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.

**You can define how values are grouped by:**

- index= ("Rows" in Excel)
- columns


**We define which values are summarized by:**

values= the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function


**We define how values are summarized by:**

aggfunc= (Aggregation Function) how rows are summarized, such as sum, mean, or count

Let's go ahead and create a pivot table for our dataframe - 

In [97]:
pivot_data = data.pivot_table(
    index='degree_t', 
    columns=['specialisation','status'], 
    values=['gender'], 
    aggfunc='count'
)

In [98]:
pivot_data

Unnamed: 0_level_0,gender,gender,gender,gender
specialisation,Mkt&Fin,Mkt&Fin,Mkt&HR,Mkt&HR
status,Not Placed,Placed,Not Placed,Placed
degree_t,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Comm&Mgmt,18,68,25,34
Others,2,2,4,3
Sci&Tech,5,25,13,16


This project was created using many online resources. The most important being Shane Lynn's [blog](https://www.shanelynn.ie/) on pandas.