---
<center><h1>Lesson 2 - Basic intro into pandas</h1></center> 
---
---

<center><h2>Part 4. Work with pandas DataFrames: grouping</h2></center>
---

## Table of Contents

- [Work with pandas DataFrames: grouping](#Work-with-pandas-DataFrames:-grouping)
    * [Splitting of a DataFrame into groups](#Splitting-of-a-DataFrame-into-groups)
    * [Selection and filtering](#Selection-and-filtering)
    * [Aggregation and function application](#Aggregation-and-function-application)
    - [*Exercise 4.1*](#Exercise-4.1)

In [1]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: grouping

[[back to top]](#Table-of-Contents)

For our future work with the aim of more visual demonstration let’s select a shorted DataFrame from `movies`, for instance, in such way

In [2]:
movies = pd.read_csv('data/movies.csv')

In [3]:
movies.dtypes

user_id           int64
movie_id          int64
rating            int64
timestamp         int64
age             float64
gender           object
occupation       object
zip_code         object
movie_title      object
release_date     object
IMDb_URL         object
unknown           int64
Action            int64
Adventure         int64
Animation         int64
Childrens         int64
Comedy            int64
Crime             int64
Documentary       int64
Drama             int64
Fantasy           int64
Film-Noir         int64
Horror            int64
Musical           int64
Mystery           int64
Romance           int64
Sci-Fi            int64
Thriller          int64
War               int64
Western           int64
dtype: object

In [4]:
# get selected collumns with row indexes within ranges
short_movies = movies[['user_id', 'movie_id', 'rating','timestamp', 'age', 'occupation','movie_title','release_date']] \
                     .loc[list(range(50)) + list(range(100,150)) + list(range(250,500))].reset_index(drop=True)
short_movies.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,occupation,movie_title,release_date
0,196,242,3,881250949,49.0,writer,Kolya (1996),1997-01-24
1,305,242,5,886307828,23.0,programmer,Kolya (1996),1997-01-24
2,6,242,4,883268170,42.0,executive,Kolya (1996),1997-01-24
3,234,242,4,891033261,60.0,retired,Kolya (1996),1997-01-24
4,63,242,3,875747190,31.0,marketing,Kolya (1996),1997-01-24
5,181,242,1,878961814,26.0,executive,Kolya (1996),1997-01-24
6,201,242,4,884110598,27.0,writer,Kolya (1996),1997-01-24
7,249,242,5,879571438,25.0,student,Kolya (1996),1997-01-24
8,13,242,2,881515193,47.0,educator,Kolya (1996),1997-01-24
9,279,242,3,877756647,33.0,programmer,Kolya (1996),1997-01-24


So, let’s shortly describe what you will know after reading the blog post:
1.	how to split a DataFrame into a group;
2.	how select a group in the grouped DataFrame and filter the grouping data;
3.	how aggregate grouping data and how apply many function at once to the grouping data;


### Splitting of a DataFrame into groups

[[back to top]](#Table-of-Contents)

pandas DataFrames can be split on any of their axes. Grouping denotes the providence of a mapping of labels to group names. Method `groupby()` is provided in pandas for grouping. The `groupby()` function returns a `GroupBy` object, but essentially describes how the rows of the original data set has been split. 


In [5]:
movies_grouped = movies.groupby('movie_id')
movies_grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x7f7e280594d0>

In [6]:
movies_grouped['user_id']

<pandas.core.groupby.SeriesGroupBy object at 0x7f7e28059890>

Use `list()` to show what a grouping looks like

In [7]:
short_grouped = short_movies.groupby('movie_id')
list(short_grouped)[:3]

[(242,
      user_id  movie_id  rating  timestamp   age     occupation   movie_title  \
  0       196       242       3  881250949  49.0         writer  Kolya (1996)   
  1       305       242       5  886307828  23.0     programmer  Kolya (1996)   
  2         6       242       4  883268170  42.0      executive  Kolya (1996)   
  3       234       242       4  891033261  60.0        retired  Kolya (1996)   
  4        63       242       3  875747190  31.0      marketing  Kolya (1996)   
  5       181       242       1  878961814  26.0      executive  Kolya (1996)   
  6       201       242       4  884110598  27.0         writer  Kolya (1996)   
  7       249       242       5  879571438  25.0        student  Kolya (1996)   
  8        13       242       2  881515193  47.0       educator  Kolya (1996)   
  9       279       242       3  877756647  33.0     programmer  Kolya (1996)   
  10      145       242       5  875269755  31.0  entertainment  Kolya (1996)   
  11       90       2

As it can be easily seen, the `GroupBy` object is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.

In [8]:
for key, value in list(short_grouped)[:3]:
# if you want to see more values uncomment row below and comment the previous row
#for key, value in short_grouped:
    print key
    print value

242
    user_id  movie_id  rating  timestamp   age     occupation   movie_title  \
0       196       242       3  881250949  49.0         writer  Kolya (1996)   
1       305       242       5  886307828  23.0     programmer  Kolya (1996)   
2         6       242       4  883268170  42.0      executive  Kolya (1996)   
3       234       242       4  891033261  60.0        retired  Kolya (1996)   
4        63       242       3  875747190  31.0      marketing  Kolya (1996)   
5       181       242       1  878961814  26.0      executive  Kolya (1996)   
6       201       242       4  884110598  27.0         writer  Kolya (1996)   
7       249       242       5  879571438  25.0        student  Kolya (1996)   
8        13       242       2  881515193  47.0       educator  Kolya (1996)   
9       279       242       3  877756647  33.0     programmer  Kolya (1996)   
10      145       242       5  875269755  31.0  entertainment  Kolya (1996)   
11       90       242       4  891382267  60.0  

Besides, you may to see the first and the last item of each group

In [9]:
movies_grouped.first().head(10)

Unnamed: 0_level_0,user_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,244,4,880604405,28.0,M,technician,80525,Toy Story (1995),1995-01-01,http://us.imdb.com/M/title-exact?Toy%20Story%2...,...,0,0,0,0,0,0,0,0,0,0
2,22,2,878887925,25.0,M,writer,40206,GoldenEye (1995),1995-01-01,http://us.imdb.com/M/title-exact?GoldenEye%20(...,...,0,0,0,0,0,0,0,1,0,0
3,244,5,880602451,28.0,M,technician,80525,Four Rooms (1995),1995-01-01,http://us.imdb.com/M/title-exact?Four%20Rooms%...,...,0,0,0,0,0,0,0,1,0,0
4,22,5,878886571,25.0,M,writer,40206,Get Shorty (1995),1995-01-01,http://us.imdb.com/M/title-exact?Get%20Shorty%...,...,0,0,0,0,0,0,0,0,0,0
5,303,2,879484534,19.0,M,student,14853,Copycat (1995),1995-01-01,http://us.imdb.com/M/title-exact?Copycat%20(1995),...,0,0,0,0,0,0,0,1,0,0
6,63,3,875747439,31.0,M,marketing,75240,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,1995-01-01,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,...,0,0,0,0,0,0,0,0,0,0
7,244,4,880602558,28.0,M,technician,80525,Twelve Monkeys (1995),1995-01-01,http://us.imdb.com/M/title-exact?Twelve%20Monk...,...,0,0,0,0,0,0,1,0,0,0
8,196,5,881251753,49.0,M,writer,55105,Babe (1995),1995-01-01,http://us.imdb.com/M/title-exact?Babe%20(1995),...,0,0,0,0,0,0,0,0,0,0
9,244,5,880604179,28.0,M,technician,80525,Dead Man Walking (1995),1995-01-01,http://us.imdb.com/M/title-exact?Dead%20Man%20...,...,0,0,0,0,0,0,0,0,0,0
10,234,3,891227851,60.0,M,retired,94702,Richard III (1995),1996-01-22,http://us.imdb.com/M/title-exact?Richard%20III...,...,0,0,0,0,0,0,0,0,1,0


In [10]:
movies_grouped.last().head(10)

Unnamed: 0_level_0,user_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,941,5,875049144,20.0,M,student,97229,Toy Story (1995),1995-01-01,http://us.imdb.com/M/title-exact?Toy%20Story%2...,...,0,0,0,0,0,0,0,0,0,0
2,943,5,888639953,22.0,M,student,77841,GoldenEye (1995),1995-01-01,http://us.imdb.com/M/title-exact?GoldenEye%20(...,...,0,0,0,0,0,0,0,1,0,0
3,936,4,886833148,24.0,M,other,32789,Four Rooms (1995),1995-01-01,http://us.imdb.com/M/title-exact?Four%20Rooms%...,...,0,0,0,0,0,0,0,1,0,0
4,940,2,885922040,32.0,M,administrator,2215,Get Shorty (1995),1995-01-01,http://us.imdb.com/M/title-exact?Get%20Shorty%...,...,0,0,0,0,0,0,0,0,0,0
5,925,4,884718156,18.0,F,salesman,49036,Copycat (1995),1995-01-01,http://us.imdb.com/M/title-exact?Copycat%20(1995),...,0,0,0,0,0,0,0,1,0,0
6,936,5,886832636,24.0,M,other,32789,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,1995-01-01,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,...,0,0,0,0,0,0,0,0,0,0
7,941,4,875048952,20.0,M,student,97229,Twelve Monkeys (1995),1995-01-01,http://us.imdb.com/M/title-exact?Twelve%20Monk...,...,0,0,0,0,0,0,1,0,0,0
8,930,3,879535713,28.0,F,administrator,7310,Babe (1995),1995-01-01,http://us.imdb.com/M/title-exact?Babe%20(1995),...,0,0,0,0,0,0,0,0,0,0
9,936,4,886832373,24.0,M,other,32789,Dead Man Walking (1995),1995-01-01,http://us.imdb.com/M/title-exact?Dead%20Man%20...,...,0,0,0,0,0,0,0,0,0,0
10,906,4,879435339,45.0,M,librarian,70124,Richard III (1995),1996-01-22,http://us.imdb.com/M/title-exact?Richard%20III...,...,0,0,0,0,0,0,0,0,1,0


And at the end of this subparagraph let’s note that `groupby()` groups by indexes by default, i.e. when no one columns is no set. Any function written as an argument of `groupby()` will work with indexes in this case.

In [11]:
# let's set indexes the 'timestamp'column items  
short_dated = short_movies.set_index('timestamp')
short_dated.head(10)

Unnamed: 0_level_0,user_id,movie_id,rating,age,occupation,movie_title,release_date
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
881250949,196,242,3,49.0,writer,Kolya (1996),1997-01-24
886307828,305,242,5,23.0,programmer,Kolya (1996),1997-01-24
883268170,6,242,4,42.0,executive,Kolya (1996),1997-01-24
891033261,234,242,4,60.0,retired,Kolya (1996),1997-01-24
875747190,63,242,3,31.0,marketing,Kolya (1996),1997-01-24
878961814,181,242,1,26.0,executive,Kolya (1996),1997-01-24
884110598,201,242,4,27.0,writer,Kolya (1996),1997-01-24
879571438,249,242,5,25.0,student,Kolya (1996),1997-01-24
881515193,13,242,2,47.0,educator,Kolya (1996),1997-01-24
877756647,279,242,3,33.0,programmer,Kolya (1996),1997-01-24


In [12]:
from datetime import datetime 
# convert dates of string type to datetime date format
# then we may handle to the year, month or date as to attributes
res = short_dated.groupby(lambda x: x)
for key, value in list(res)[:3]:
# if you want to see more values uncomment row below and comment the previous row
#for key, value in res:
    print key
    print value

874730320
           user_id  movie_id  rating   age occupation            movie_title  \
timestamp                                                                      
874730320      712       393       3  22.0    student  Mrs. Doubtfire (1993)   

          release_date  
timestamp               
874730320   1993-01-01  
874781628
           user_id  movie_id  rating   age  occupation         movie_title  \
timestamp                                                                    
874781628      119       655       5  32.0  programmer  Stand by Me (1986)   

          release_date  
timestamp               
874781628   1986-01-01  
874787350
           user_id  movie_id  rating   age occupation  \
timestamp                                               
874787350       23       381       4  30.0     artist   

                       movie_title release_date  
timestamp                                        
874787350  Muriel's Wedding (1994)   1994-01-01  


### Selection and filtering

[[back to top]](#Table-of-Contents)

Functions of descriptive statistic like `sum()`, `count()`, `max()`, `min()`, `mean()` can be quickly applied to the `GroupBy` object to obtain summary statistics for each group – an immensely useful function. The same statement is valid also for functions like `describe()` which return general information about an object. 

Let’s show a few examples of descriptive statistic function application to the `GroupBy` object

In [13]:
movies_grouped.count().head(10)

Unnamed: 0_level_0,user_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,452,452,452,417,452,427,452,452,452,452,...,452,452,452,452,452,452,452,452,452,452
2,131,131,131,120,131,127,131,131,131,131,...,131,131,131,131,131,131,131,131,131,131
3,90,90,90,82,90,87,90,90,90,90,...,90,90,90,90,90,90,90,90,90,90
4,209,209,209,201,209,201,209,209,209,209,...,209,209,209,209,209,209,209,209,209,209
5,86,86,86,82,86,81,86,86,86,86,...,86,86,86,86,86,86,86,86,86,86
6,26,26,26,25,26,25,26,26,26,26,...,26,26,26,26,26,26,26,26,26,26
7,392,392,392,365,392,379,392,392,392,392,...,392,392,392,392,392,392,392,392,392,392
8,219,219,219,208,219,198,219,219,219,219,...,219,219,219,219,219,219,219,219,219,219
9,299,299,299,279,299,284,299,299,299,299,...,299,299,299,299,299,299,299,299,299,299
10,89,89,89,82,89,86,89,89,89,89,...,89,89,89,89,89,89,89,89,89,89


In [14]:
movies_grouped.sum().add_prefix('sum_of_').head(10)
# calculate the sum only for numeric columns, 
# all columns with other types will be ignored and missed
# we have add also common prefics to all columns. It's very conveniently

Unnamed: 0_level_0,sum_of_user_id,sum_of_rating,sum_of_timestamp,sum_of_age,sum_of_unknown,sum_of_Action,sum_of_Adventure,sum_of_Animation,sum_of_Childrens,sum_of_Comedy,...,sum_of_Fantasy,sum_of_Film-Noir,sum_of_Horror,sum_of_Musical,sum_of_Mystery,sum_of_Romance,sum_of_Sci-Fi,sum_of_Thriller,sum_of_War,sum_of_Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,215609,1753,399028021059,13419.0,0,0,0,452,452,452,...,0,0,0,0,0,0,0,0,0,0
2,64453,420,115727673079,3526.0,0,131,131,0,0,0,...,0,0,0,0,0,0,0,131,0,0
3,41322,273,79400420109,2168.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,90,0,0
4,98125,742,184487964275,6573.0,0,209,0,0,0,209,...,0,0,0,0,0,0,0,0,0,0
5,37786,284,75902581757,2453.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,86,0,0
6,11819,93,22968466745,894.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,174585,1489,345926648221,11165.0,0,0,0,0,0,0,...,0,0,0,0,0,0,392,0,0,0
8,99574,875,193431714253,6883.0,0,0,0,0,219,219,...,0,0,0,0,0,0,0,0,0,0
9,137636,1165,263909983760,9792.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,40069,341,78556407374,3189.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,89,0


In [15]:
movies.groupby(['movie_id'])['rating'].sum()
# calculate sum for a single column
# it works more quickly then movies.groupby(['movie_id']).sum()['rating']

movie_id
1       1753
2        420
3        273
4        742
5        284
6         93
7       1489
8        875
9       1165
10       341
11       908
12      1171
13       629
14       726
15      1107
16       125
17       287
18        28
19       273
20       246
21       232
22      1233
23       750
24       600
25      1009
26       252
27       177
28      1085
29       304
30       146
        ... 
1653       5
1654       1
1655       2
1656       7
1657       3
1658       9
1659       1
1660       2
1661       1
1662       5
1663       2
1664      13
1665       2
1666       2
1667       3
1668       3
1669       2
1670       3
1671       1
1672       4
1673       3
1674       4
1675       3
1676       2
1677       3
1678       1
1679       3
1680       2
1681       3
1682       3
Name: rating, dtype: int64

In [16]:
movies.groupby(['movie_id', 'user_id']).sum()
# it is possible to group by many columns

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,timestamp,age,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,user_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1,5,874965758,24.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,4,888550871,53.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,5,4,875635748,33.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,6,4,883599478,42.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,10,4,877888877,53.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,13,3,882140487,47.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,15,1,879455635,49.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,16,5,877717833,21.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,17,4,885272579,30.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,18,5,880130802,35.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
movies.groupby(['user_id', 'movie_id']).sum()
# pay your attention that the changing of the order of 
# columns for grouping in 
# its list also changes the result table

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,timestamp,age,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
user_id,movie_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1,5,874965758,24.0,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,3,876893171,24.0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,3,4,878542960,24.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,4,3,876893119,24.0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,5,3,889751712,24.0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,6,5,887431973,24.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,4,875071561,24.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,8,1,875072484,24.0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,9,5,878543541,24.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10,3,875693118,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


It is possible also to show all groups

In [18]:
short_grouped.groups

{242: [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66],
 251: [259,
  260,
  261,
  262,
  263,
  264,
  265,
  266,
  267,
  268,
  269,
  270,
  271,
  272,
  273,
  274,
  275,
  276,
  277,
  278,
  279,
  280,
  281,
  282,
  283,
  284,
  285,
  286,
  287,
  288,
  289,
  290,
  291,
  292,
  293,
  294,
  295,
  296,
  297,
  298,
  299,
  300,
  301,
  302,
  303,
  304],
 381: [159,
  160,
  161,
  162,
  163,
  164,
  165,
  166,
  167,
  168,
  169,
  170,
  171,
  172,
  173,
  174,
  175,
  176,
  177,
  178,
  179,
  180,
  181,
  182,
  183,
  184,
  185,
  186,
  187,
  188,
  189,
  190,
  191,
  192,
  193,
  194,
  195,
  196,
  1

In [19]:
movie_user_id_grouped = movies.groupby(['movie_id', 'user_id'])
movie_user_id_grouped.groups

{(109, 365): [29444],
 (259, 938): [67483],
 (150, 805): [89157],
 (747, 795): [38570],
 (270, 302): [78406],
 (1008, 537): [65284],
 (203, 889): [16707],
 (76, 201): [70181],
 (385, 660): [7628],
 (734, 864): [82340],
 (461, 385): [76470],
 (1351, 896): [96150],
 (380, 311): [36419],
 (751, 800): [55656],
 (820, 903): [18133],
 (534, 782): [76107],
 (403, 504): [22440],
 (96, 870): [22066],
 (1039, 295): [40457],
 (29, 334): [22205],
 (527, 499): [49158],
 (216, 363): [28161],
 (414, 664): [90722],
 (739, 889): [44352],
 (100, 805): [12213],
 (485, 749): [63659],
 (144, 535): [30621],
 (515, 552): [26881],
 (787, 207): [95167],
 (56, 323): [16223],
 (939, 504): [11495],
 (134, 666): [59862],
 (28, 185): [53010],
 (739, 593): [44305],
 (473, 798): [58088],
 (346, 532): [54450],
 (648, 22): [23508],
 (498, 763): [57727],
 (136, 330): [75402],
 (583, 94): [85244],
 (198, 248): [67532],
 (491, 398): [77545],
 (2, 399): [30775],
 (274, 680): [58624],
 (244, 590): [91350],
 (117, 504): [102

and display the content only of necessary group

In [20]:
short_grouped.get_group(251)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,occupation,movie_title,release_date
259,196,251,3,881251274,49.0,writer,Shall We Dance? (1996),1997-07-11
260,305,251,5,886321764,23.0,programmer,Shall We Dance? (1996),1997-07-11
261,286,251,5,876521678,27.0,student,Shall We Dance? (1996),1997-07-11
262,303,251,4,879544533,19.0,student,Shall We Dance? (1996),1997-07-11
263,299,251,5,877877434,29.0,doctor,Shall We Dance? (1996),1997-07-11
264,63,251,4,875747514,31.0,marketing,Shall We Dance? (1996),1997-07-11
265,181,251,1,878962052,26.0,executive,Shall We Dance? (1996),1997-07-11
266,293,251,4,888904734,24.0,writer,Shall We Dance? (1996),1997-07-11
267,1,251,4,875071843,24.0,technician,Shall We Dance? (1996),1997-07-11
268,15,251,2,879455541,49.0,educator,Shall We Dance? (1996),1997-07-11


We have just one helpful and powerful option for you! It is `filter()`. For demonstrating how you can use it, let’s select at first new DataFrame by grouping movies by `‘age’` column.

In [21]:
age = movies[['user_id','movie_id','rating','timestamp','age','occupation','movie_title','release_date']]\
            .groupby('age')

for k,v in list(age)[:3]:
# if you want to see more values uncomment row below and comment the previous row
#for key, value in age:
    print k
    print v

7.0
       user_id  movie_id  rating  timestamp  age occupation  \
41          30       242       5  885941156  7.0    student   
1992        30       286       5  885941156  7.0        NaN   
3523        30       257       4  885941257  7.0    student   
4763        30      1007       5  885941156  7.0    student   
17506       30       258       5  885941156  7.0    student   
19035       30       294       4  875140648  7.0    student   
22203       30        29       3  875106638  7.0    student   
22293       30       683       3  885941798  7.0    student   
22396       30       403       2  875061066  7.0    student   
22667       30       435       5  885941156  7.0    student   
24343       30       172       4  875060742  7.0    student   
27656       30       231       2  875061066  7.0    student   
30027       30       181       4  875060217  7.0    student   
30761       30         2       3  875061066  7.0    student   
30925       30       161       4  875060883  7.0   

Suppose you need remain only those groups where there are less than 100 items. Sure, you may hardly calculate amount of items in each group and then remove not satisfying for condition. But it can be done very easily.

In [22]:
age.filter(lambda x: len(x) < 100).head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,occupation,movie_title,release_date
28,131,242,5,883681723,59.0,administrator,Kolya (1996),1997-01-24
41,30,242,5,885941156,7.0,student,Kolya (1996),1997-01-24
69,520,242,5,885168819,62.0,healthcare,Kolya (1996),1997-01-24
108,845,242,4,885409493,64.0,doctor,Kolya (1996),1997-01-24
215,471,393,5,889827918,10.0,student,Mrs. Doubtfire (1993),1993-01-01
220,481,393,3,885829045,73.0,retired,Mrs. Doubtfire (1993),1993-01-01
391,819,381,4,884105841,59.0,administrator,Muriel's Wedding (1994),1994-01-01
422,131,251,5,883681723,59.0,administrator,Shall We Dance? (1996),1997-07-11
871,845,306,2,885409374,64.0,doctor,"Mrs. Brown (Her Majesty, Mrs. Brown) (1997)",1997-01-01
1022,481,238,4,885828245,73.0,retired,Raising Arizona (1987),1987-01-01


One more example:

In [46]:
# get groups with sum of 'movie_id' between 400 and 10000
age.filter(lambda x: x['movie_id'].sum() > 4000 and x['movie_id'].sum() < 10000)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,occupation,movie_title,release_date
8881,289,742,4,876789463,11.0,none,Ransom (1996),1996-11-08
10197,289,117,4,876789514,11.0,none,"Rock, The (1996)",1996-06-07
13936,289,1016,5,876789843,11.0,none,Con Air (1997),1997-06-06
14811,289,121,3,876789736,11.0,none,Independence Day (ID4) (1996),1996-07-03
17281,289,477,2,876790323,11.0,none,Matilda (1996),1996-08-02
19923,289,405,2,876790576,11.0,none,Mission: Impossible (1996),1996-05-22
20210,289,147,3,876789581,11.0,none,"Long Kiss Goodnight, The (1996)",1996-10-05
23220,289,222,2,876789463,11.0,none,Star Trek: First Contact (1996),1996-11-22
23786,289,455,4,876790464,11.0,,Jackie Chan's First Strike (1996),1997-01-10
29021,289,24,4,876790292,11.0,none,Rumble in the Bronx (1995),1996-02-23


### Aggregation and function application

[[back to top]](#Table-of-Contents)

We have seen previously, pandas allows to apply any function to a Series or to a DataFrame. Grouping is not an exception and you may use the same method `apply()` directly with `GroupBy` objects. Suppose, you need to find for each `age` minimal and maximal values of `movie_id` and determine also rows, where they are. It can be done easily in such way:

In [24]:
# let's create a function that calculate min, max values and indexes of these values
def get_min_max(x):
    return {'min': x.min(), 'min_index': x.idxmin(),'max': x.max(), 'max_index': x.idxmax()}
# and apply this function to 'age' column of short_dated DataFrame
short_dated['age'].groupby(short_dated['movie_id']).apply(get_min_max)

movie_id           
242       max                 70.0
          max_index    891462614.0
          min                  7.0
          min_index    885941156.0
251       max                 59.0
          max_index    883681723.0
          min                 18.0
          min_index    876954752.0
381       max                 70.0
          max_index    885990998.0
          min                 13.0
          min_index    880174808.0
393       max                 70.0
          max_index    885991129.0
          min                 13.0
          min_index    880174926.0
655       max                 60.0
          max_index    892333616.0
          min                 19.0
          min_index    879483568.0
dtype: float64

We wrote of the function to return just a dictionary, an ulterior motive. Method `unstuck()` allows to represent this result in more intuitive view (it like transposes the interior matrix):

In [25]:
short_dated['age'].groupby(short_dated['movie_id']).apply(get_min_max).unstack()

Unnamed: 0_level_0,max,max_index,min,min_index
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
242,70.0,891462614.0,7.0,885941156.0
251,59.0,883681723.0,18.0,876954752.0
381,70.0,885990998.0,13.0,880174808.0
393,70.0,885991129.0,13.0,880174926.0
655,60.0,892333616.0,19.0,879483568.0


But the mentioned above recipe is not a single one. There are provided in pandas more flexible ways for application of your own functions to a Series or to a DataFrame. An obvious one is aggregation via the `aggregate()` or equivalently `agg()` method

In [26]:
movies_grouped.agg(np.sum).head(10)
# or state_grouped.aggregate(np.sum)

Unnamed: 0_level_0,user_id,rating,timestamp,age,unknown,Action,Adventure,Animation,Childrens,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,215609,1753,399028021059,13419.0,0,0,0,452,452,452,...,0,0,0,0,0,0,0,0,0,0
2,64453,420,115727673079,3526.0,0,131,131,0,0,0,...,0,0,0,0,0,0,0,131,0,0
3,41322,273,79400420109,2168.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,90,0,0
4,98125,742,184487964275,6573.0,0,209,0,0,0,209,...,0,0,0,0,0,0,0,0,0,0
5,37786,284,75902581757,2453.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,86,0,0
6,11819,93,22968466745,894.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,174585,1489,345926648221,11165.0,0,0,0,0,0,0,...,0,0,0,0,0,0,392,0,0,0
8,99574,875,193431714253,6883.0,0,0,0,0,219,219,...,0,0,0,0,0,0,0,0,0,0
9,137636,1165,263909983760,9792.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,40069,341,78556407374,3189.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,89,0


Sure, above variant is equivalent to `state_grouped.sum()`.
One of the advantages of `agg()` method is the possibility of application of many functions to one and same dataset at once

In [27]:
# at first let's write the function which returns amount of notnull values 
def count_not_null(x):
    return len([i for i in x if pd.notnull(i)])
movies_grouped['age'].agg([np.sum, count_not_null, np.mean]).head(10)

Unnamed: 0_level_0,sum,count_not_null,mean
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,13419.0,417.0,32.179856
2,3526.0,120.0,29.383333
3,2168.0,82.0,26.439024
4,6573.0,201.0,32.701493
5,2453.0,82.0,29.914634
6,894.0,25.0,35.76
7,11165.0,365.0,30.589041
8,6883.0,208.0,33.091346
9,9792.0,279.0,35.096774
10,3189.0,82.0,38.890244


We are fully agree with you, names of the central column is not a good and too long. No problem, let’s rename it (and the last column too at once)

In [28]:
movies_grouped['age'].agg({'sum': np.sum, 'amount': count_not_null, 'avg': np.mean}).head(10)

Unnamed: 0_level_0,sum,avg,amount
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,13419.0,32.179856,417.0
2,3526.0,29.383333,120.0
3,2168.0,26.439024,82.0
4,6573.0,32.701493,201.0
5,2453.0,29.914634,82.0
6,894.0,35.76,25.0
7,11165.0,30.589041,365.0
8,6883.0,33.091346,208.0
9,9792.0,35.096774,279.0
10,3189.0,38.890244,82.0


Sure, you may apply many functions to whole DataFrame at once

In [29]:
movies_grouped.agg([np.sum, count_not_null, np.mean]).head(10)

Unnamed: 0_level_0,user_id,user_id,user_id,rating,rating,rating,timestamp,timestamp,timestamp,age,...,Sci-Fi,Thriller,Thriller,Thriller,War,War,War,Western,Western,Western
Unnamed: 0_level_1,sum,count_not_null,mean,sum,count_not_null,mean,sum,count_not_null,mean,sum,...,mean,sum,count_not_null,mean,sum,count_not_null,mean,sum,count_not_null,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,215609,452,477.011062,1753,452,3.878319,399028021059,452,882805356,13419.0,...,0,0,452,0,0,452,0,0,452,0
2,64453,131,492.007634,420,131,3.206107,115727673079,131,883417351,3526.0,...,0,131,131,1,0,131,0,0,131,0
3,41322,90,459.133333,273,90,3.033333,79400420109,90,882226890,2168.0,...,0,90,90,1,0,90,0,0,90,0
4,98125,209,469.497608,742,209,3.550239,184487964275,209,882717532,6573.0,...,0,0,209,0,0,209,0,0,209,0
5,37786,86,439.372093,284,86,3.302326,75902581757,86,882588159,2453.0,...,0,86,86,1,0,86,0,0,86,0
6,11819,26,454.576923,93,26,3.576923,22968466745,26,883402567,894.0,...,0,0,26,0,0,26,0,0,26,0
7,174585,392,445.369898,1489,392,3.798469,345926648221,392,882465939,11165.0,...,1,0,392,0,0,392,0,0,392,0
8,99574,219,454.675799,875,219,3.995434,193431714253,219,883249836,6883.0,...,0,0,219,0,0,219,0,0,219,0
9,137636,299,460.32107,1165,299,3.896321,263909983760,299,882642086,9792.0,...,0,0,299,0,0,299,0,0,299,0
10,40069,89,450.213483,341,89,3.831461,78556407374,89,882656262,3189.0,...,0,0,89,0,89,89,1,0,89,0


However, you could not rename columns as above, because `agg()` method allows to apply of one or more functions to different columns – it is just one advantage of this method

In [30]:
movies_grouped.agg({ 'user_id': np.sum, 
                     'age': {'amount': count_not_null},  
                     'rating': np.mean,                  
                     'timestamp': [np.min, np.max]}).head(10)

Unnamed: 0_level_0,rating,age,user_id,timestamp,timestamp
Unnamed: 0_level_1,mean,amount,sum,amin,amax
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,3.878319,417.0,215609,874784615,893264174
2,3.206107,120.0,64453,874778568,893119336
3,3.033333,82.0,41322,874786924,892790676
4,3.550239,201.0,98125,874730179,893265741
5,3.302326,82.0,37786,874792692,893194607
6,3.576923,25.0,11819,875028165,891384357
7,3.798469,365.0,174585,874775185,893264300
8,3.995434,208.0,99574,874785474,893265710
9,3.896321,279.0,137636,874805804,893263994
10,3.831461,82.0,40069,874862734,893264335


Another simple aggregation example is to compute the size of each group. This is included in `GroupBy` as the `size()` method. It returns a Series whose indexes are the group names and whose values are the sizes of each group.

In [31]:
movie_user_id_grouped.size()

movie_id  user_id
1         1          1
          2          1
          5          1
          6          1
          10         1
          13         1
          15         1
          16         1
          17         1
          18         1
          20         1
          21         1
          23         1
          25         1
          26         1
          38         1
          41         1
          42         1
          43         1
          44         1
          45         1
          49         1
          54         1
          56         1
          57         1
          58         1
          59         1
          62         1
          63         1
          64         1
                    ..
1658      894        1
1659      747        1
1660      747        1
1661      751        1
1662      762        1
          782        1
1663      782        1
1664      782        1
          839        1
          870        1
          880        1
1665      782   

>### Exercise 4.1

> - Group `movies` DataFrame by `release_date` and `occupation` (in the same order) and display the obtained `GroupBy` object like we have made before. Let's call this `GroupBy` as `grouped`.

> - In the created in the previous task `GroupBy` object calculate the average value of `age` column for each group. Find the group with maximal average `age` value. If there are a few such group, find all them. Write results all found groups to `groups` pandas DataFrame and the maximal average `age` value to `max_avg_age` variable.

> - Select from the `grouped` rows for `release_date = ‘16-Feb-1996’` and `occupation = ‘student’` using some of learned in this segment recipe of selection.

> - Calculate for `age` column for each group in the `grouped` DataFrame
    * percent of null items; call this column "null"
    * percent of not null items; call this column "not_null"
    * product of all not null items in the column; call this column "product"
    * find the integer part of the fraction of difference between maximal and minimal values divided by the average of not null values in respective group; call this column "ratio".

>   Pay attention that all calculation should be done at the same time without creation additional columns etc. Write results to the `aggr` variable.

In [100]:
# type your code here


# To check the correctness of your answers we are using the "data/movies.csv".
# So, if use have changed the `movies` DataFrame in some way, please read "data/movies.csv" again before continuing.
#movies = pd.read_csv('data/movies.csv')
#movies['release_date'] = movies['release_date'].map(pd.to_datetime)

groupby_data_0 = movies.groupby(['release_date','occupation'])
groupby_data = groupby_data_0.mean().reset_index()
#print groupby_data['age'].max()
#print groupby_data_0.head(5)
max_avg_age = groupby_data['age'].max()

groups = movies[(movies['release_date']=='16-Feb-1996') & (movies['occupation']=='student')]
groups = groups.groupby(['release_date','occupation']).mean().reset_index()
#print groups.head(5)
print groups.head(5)

def count_null(x):
    return len([i for i in x if not pd.notnull(i)])

agg = groupby_data_0['age'].agg({ 'null': count_null, 
                     'not_null': count_not_null,  
                     'product': np.sum,                  
                     'ratio': np.std})
aggr = agg

  release_date occupation     user_id    movie_id    rating     timestamp  \
0   1996-02-16    student  486.519417  355.009709  3.854369  8.820670e+08   

         age  unknown    Action  Adventure   ...     Fantasy  Film-Noir  \
0  21.889447      0.0  0.412621   0.087379   ...         0.0        0.0   

   Horror   Musical   Mystery   Romance  Sci-Fi  Thriller       War  Western  
0     0.0  0.087379  0.038835  0.038835     0.0  0.334951  0.325243      0.0  

[1 rows x 26 columns]


In [101]:
from test_helper import Test

Test.assertEqualsHashed(max_avg_age, '3e02e7f62a64ba4e760f0cec0b44b479b974d267', 
                                     'Incorrect value of "max_avg_age"', "Exercise 4.1.1 is successful")
Test.assertEqualsHashed(groups, '71201ef71723a0337ad15e84cc9a86cfd3eeac01', 
                                'Incorrect content of "groups" DataFrame', "Exercise 4.1.2 is successful")
Test.assertEqualsHashed(aggr[['null', 'not_null', 'product',
                          'ratio']], '513ac52cc95c70657b76c0458355f1b45efc14f8', 
                              'Incorrect content of "aggr" DataFrame', "Exercise 4.1.3 is successful")

1 test passed. Exercise 4.1.1 is successful
1 test failed. Incorrect content of "groups" DataFrame
1 test failed. Incorrect content of "aggr" DataFrame


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>