## Groupby
<b>DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)</b>

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

### Any groupby operation involves one of the following operations on the original object. They are −

* Splitting the Object

* Applying a function

* Combining the results

<img style="float: left;" src="https://i.stack.imgur.com/sgCn1.jpg"></img>

In [1]:
# import library
import pandas as pd

In [4]:
data = {'Students': ['S1', 'S2', 'S3', 'S3', 'S1',
         'S4', 'S4', 'S3', 'S2', 'S2', 'S4', 'S3'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Grade':[87,79,83,73,74,81,56,78,94,70,80,69]}
df = pd.DataFrame(data)
df

Unnamed: 0,Grade,Rank,Students,Year
0,87,1,S1,2014
1,79,2,S2,2015
2,83,2,S3,2014
3,73,3,S3,2015
4,74,3,S1,2014
5,81,4,S4,2015
6,56,1,S4,2016
7,78,1,S3,2017
8,94,2,S2,2016
9,70,4,S2,2014


### Split Data into Groups
Pandas object can be split into any of their objects. There are multiple ways to split an object like −

* obj.groupby('key')
* obj.groupby(['key1','key2'])
* obj.groupby(key,axis=1)

Let us now see how the grouping objects can be applied to the DataFrame object

In [7]:
# let's groupby students
df.groupby('Students')

<pandas.core.groupby.DataFrameGroupBy object at 0x0000015F54F9B748>

In [8]:
# to view groups 
df.groupby('Students').groups

{'S1': Int64Index([0, 4], dtype='int64'),
 'S2': Int64Index([1, 8, 9], dtype='int64'),
 'S3': Int64Index([2, 3, 7, 11], dtype='int64'),
 'S4': Int64Index([5, 6, 10], dtype='int64')}

In [9]:
# you can group by with multiple columns 
df.groupby(['Students','Year']).groups

{('S1', 2014): Int64Index([0, 4], dtype='int64'),
 ('S2', 2014): Int64Index([9], dtype='int64'),
 ('S2', 2015): Int64Index([1], dtype='int64'),
 ('S2', 2016): Int64Index([8], dtype='int64'),
 ('S3', 2014): Int64Index([2], dtype='int64'),
 ('S3', 2015): Int64Index([3], dtype='int64'),
 ('S3', 2017): Int64Index([7, 11], dtype='int64'),
 ('S4', 2015): Int64Index([5, 10], dtype='int64'),
 ('S4', 2016): Int64Index([6], dtype='int64')}

In [11]:
# iterating through groups
grouped = df.groupby('Students')
for student, group_name in grouped:
    print(student)
    print(group_name)

S1
   Grade  Rank Students  Year
0     87     1       S1  2014
4     74     3       S1  2014
S2
   Grade  Rank Students  Year
1     79     2       S2  2015
8     94     2       S2  2016
9     70     4       S2  2014
S3
    Grade  Rank Students  Year
2      83     2       S3  2014
3      73     3       S3  2015
7      78     1       S3  2017
11     69     2       S3  2017
S4
    Grade  Rank Students  Year
5      81     4       S4  2015
6      56     1       S4  2016
10     80     1       S4  2015


In [13]:
# select group by value
grouped = df.groupby('Year')
print(grouped.get_group(2014))

   Grade  Rank Students  Year
0     87     1       S1  2014
2     83     2       S3  2014
4     74     3       S1  2014
9     70     4       S2  2014


In [16]:
# find the mean of grouped by data
import numpy as np
grouped = df.groupby('Year')
print(grouped['Grade'].agg(np.mean))

Year
2014    78.50
2015    78.25
2016    75.00
2017    73.50
Name: Grade, dtype: float64


In [18]:
# find the average for all students
grouped = df.groupby('Students')
print(grouped['Grade'].agg(np.mean).round())

Students
S1    80.0
S2    81.0
S3    76.0
S4    72.0
Name: Grade, dtype: float64


In [19]:
# count 
grouped = df.groupby('Year')
print(grouped['Grade'].value_counts())

Year  Grade
2014  70       1
      74       1
      83       1
      87       1
2015  73       1
      79       1
      80       1
      81       1
2016  56       1
      94       1
2017  69       1
      78       1
Name: Grade, dtype: int64


In [21]:
#Filtration filters the data on a defined criteria and returns the subset of data. 
#The filter() function is used to filter the data.
# we are going to find the top 3 students
df.groupby('Students').filter(lambda x: len(x) >= 3)

Unnamed: 0,Grade,Rank,Students,Year
1,79,2,S2,2015
2,83,2,S3,2014
3,73,3,S3,2015
5,81,4,S4,2015
6,56,1,S4,2016
7,78,1,S3,2017
8,94,2,S2,2016
9,70,4,S2,2014
10,80,1,S4,2015
11,69,2,S3,2017


### I'll be updating this notebook soon!
using real dataset!!