# `pandas` - Summarization and Grouping

__Contents__: 
1. Setup
1. Grouping data
1. Aggregation with `agg()` method
1. Other aggregating functions

## Reference
- http://pandas.pydata.org/pandas-docs/stable/index.html
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html
- https://pandas.pydata.org/pandas-docs/stable/groupby.html

## 1. Setup

Load the libraries.

In [1]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

('0.24.2', '1.16.4')

Load the DataFrame from the `imports-85.csv` CSV file. Set the column names.

In [2]:
%%sh
git clone https://github.com/datalab-datasets/file-samples.git

Cloning into 'file-samples'...


In [3]:
%ls /content/file-samples/imports-85.csv

/content/file-samples/imports-85.csv


In [0]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type',
                'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
                'engine-location', 'wheel-base', 'length', 'width',
                'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
                'engine-size', 'fuel-system', 'bore', 'stroke',
                'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
                'highway-mpg', 'price']
import_df = pd.read_csv('/content/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

In [5]:
import_df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


Display basic information about each column of the DataFrame.

In [6]:
import_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized_losses    164 non-null float64
make                 205 non-null object
fuel_type            205 non-null object
aspiration           205 non-null object
num_of_doors         203 non-null object
body_style           205 non-null object
drive_wheels         205 non-null object
engine_location      205 non-null object
wheel_base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb_weight          205 non-null int64
engine_type          205 non-null object
num_of_cylinders     205 non-null object
engine_size          205 non-null int64
fuel_system          205 non-null object
bore                 201 non-null float64
stroke               201 non-null float64
compression_ratio    205 non-null float64
horsepower           203 non-

## 2. Grouping data

Pandas objects can be split into groups on any of their axes. To create a GroupBy object (more on what the GroupBy object later), you may do the following:
- `grouped = obj.groupby(key)`
- `grouped = obj.groupby(key, axis=1) (default is axis=0)`
- `grouped = obj.groupby([key1, key2])`

Note that `obj` is a pandas DataFrame object.

Create a GroupBy object by calling `groupby()` method of the dataframe. Below group by the `make` column.

In [7]:
grouped = import_df.groupby('make')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f473947abe0>

A single group can be selected using `get_group()`. The following code cell displays all the records that meet the condition of `make='toyota'` in the dataframe. Recall that `make` was specified above as the "group by" variable.

In [8]:
grouped.get_group('toyota')

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
150,1,87.0,toyota,gas,std,two,hatchback,fwd,front,95.7,158.7,63.6,54.5,1985,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,35,39,5348.0
151,1,87.0,toyota,gas,std,two,hatchback,fwd,front,95.7,158.7,63.6,54.5,2040,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,31,38,6338.0
152,1,74.0,toyota,gas,std,four,hatchback,fwd,front,95.7,158.7,63.6,54.5,2015,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,31,38,6488.0
153,0,77.0,toyota,gas,std,four,wagon,fwd,front,95.7,169.7,63.6,59.1,2280,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,31,37,6918.0
154,0,81.0,toyota,gas,std,four,wagon,4wd,front,95.7,169.7,63.6,59.1,2290,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,27,32,7898.0
155,0,91.0,toyota,gas,std,four,wagon,4wd,front,95.7,169.7,63.6,59.1,3110,ohc,four,92,2bbl,3.05,3.03,9.0,62.0,4800.0,27,32,8778.0
156,0,91.0,toyota,gas,std,four,sedan,fwd,front,95.7,166.3,64.4,53.0,2081,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,30,37,6938.0
157,0,91.0,toyota,gas,std,four,hatchback,fwd,front,95.7,166.3,64.4,52.8,2109,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,30,37,7198.0
158,0,91.0,toyota,diesel,std,four,sedan,fwd,front,95.7,166.3,64.4,53.0,2275,ohc,four,110,idi,3.27,3.35,22.5,56.0,4500.0,34,36,7898.0
159,0,91.0,toyota,diesel,std,four,hatchback,fwd,front,95.7,166.3,64.4,52.8,2275,ohc,four,110,idi,3.27,3.35,22.5,56.0,4500.0,38,47,7788.0


The code cell below groups by multiple columns and displays records that meet the conditions of `make='toyota'` and `body_style='sedan'`.

In [9]:
import_df.groupby(['make','body_style']).get_group(('toyota','sedan'))

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_of_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
156,0,91.0,toyota,gas,std,four,sedan,fwd,front,95.7,166.3,64.4,53.0,2081,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,30,37,6938.0
158,0,91.0,toyota,diesel,std,four,sedan,fwd,front,95.7,166.3,64.4,53.0,2275,ohc,four,110,idi,3.27,3.35,22.5,56.0,4500.0,34,36,7898.0
160,0,91.0,toyota,gas,std,four,sedan,fwd,front,95.7,166.3,64.4,53.0,2094,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,38,47,7738.0
162,0,91.0,toyota,gas,std,four,sedan,fwd,front,95.7,166.3,64.4,52.8,2140,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,28,34,9258.0
163,1,168.0,toyota,gas,std,two,sedan,rwd,front,94.5,168.7,64.0,52.6,2169,ohc,four,98,2bbl,3.19,3.03,9.0,70.0,4800.0,29,34,8058.0
165,1,168.0,toyota,gas,std,two,sedan,rwd,front,94.5,168.7,64.0,52.6,2265,dohc,four,98,mpfi,3.24,3.08,9.4,112.0,6600.0,26,29,9298.0
173,-1,65.0,toyota,gas,std,four,sedan,fwd,front,102.4,175.6,66.5,54.9,2326,ohc,four,122,mpfi,3.31,3.54,8.7,92.0,4200.0,29,34,8948.0
174,-1,65.0,toyota,diesel,turbo,four,sedan,fwd,front,102.4,175.6,66.5,54.9,2480,ohc,four,110,idi,3.27,3.35,22.5,73.0,4500.0,30,33,10698.0
176,-1,65.0,toyota,gas,std,four,sedan,fwd,front,102.4,175.6,66.5,54.9,2414,ohc,four,122,mpfi,3.31,3.54,8.7,92.0,4200.0,27,32,10898.0
180,-1,90.0,toyota,gas,std,four,sedan,rwd,front,104.5,187.8,66.5,54.1,3131,dohc,six,171,mpfi,3.27,3.35,9.2,156.0,5200.0,20,24,15690.0


##3. Aggregation with the `agg()` method

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. 
The `aggregate()` or equivalently `agg()` method is the most general way to summarize grouped dataframes. 
The following cells in this section demonstrate these methods.

Apply the `np.mean` function to the dataframe with the `make`, `body_style`, `city_mpg` and `highway_mpg` columns grouped by the `make` and `body_style` columns.

In [10]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg(np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,city_mpg,highway_mpg
make,body_style,Unnamed: 2_level_1,Unnamed: 3_level_1
alfa-romero,convertible,21.0,27.0
alfa-romero,hatchback,19.0,26.0
audi,hatchback,16.0,22.0
audi,sedan,19.4,24.4
audi,wagon,19.0,25.0
bmw,sedan,19.375,25.375
chevrolet,hatchback,42.5,48.0
chevrolet,sedan,38.0,43.0
dodge,hatchback,28.4,34.2
dodge,sedan,28.666667,35.333333


Apply multiple functions to all non-grouping columns by passing a list of functions to the `agg` or `aggregate` method.

In [11]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg([np.mean, np.max, np.min]) \
  .head()

Unnamed: 0_level_0,Unnamed: 1_level_0,city_mpg,city_mpg,city_mpg,highway_mpg,highway_mpg,highway_mpg
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,amax,amin,mean,amax,amin
make,body_style,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
alfa-romero,convertible,21.0,21,21,27.0,27,27
alfa-romero,hatchback,19.0,19,19,26.0,26,26
audi,hatchback,16.0,16,16,22.0,22,22
audi,sedan,19.4,24,17,24.4,30,20
audi,wagon,19.0,19,19,25.0,25,25


Apply specific functions to specific dataframe columns by passing a dict to the `agg` or `aggregate` method.

In [12]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg({'city_mpg'   : 'mean',
        'highway_mpg': lambda x: np.mean(x),
         }) \
  .head()

Unnamed: 0_level_0,Unnamed: 1_level_0,city_mpg,highway_mpg
make,body_style,Unnamed: 2_level_1,Unnamed: 3_level_1
alfa-romero,convertible,21.0,27.0
alfa-romero,hatchback,19.0,26.0
audi,hatchback,16.0,22.0
audi,sedan,19.4,24.4
audi,wagon,19.0,25.0


Notice that the summary function can also be specified as a string and using an anonymous function.

To name output columns use a nested dict (as in the following example). 
- Keys of the outer dictionary name existing columns in the grouped dataframe
- Keys of the inner dictionary name the summary columns of the resulting dataframe

Values of the inner dictionary can be strings, numpy functions or lamda functions.

In [13]:
import_df.columns

Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_of_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
       'highway_mpg', 'price'],
      dtype='object')

In [14]:
import_df \
  .groupby(['engine_type']) \
  .agg({'width'   : {'mean': np.mean,
                     'min' : np.min,
                     'max' : np.max
                    },
        'length'  : {'mean': 'mean',
                     'min' : np.min,
                     'max' : lambda x: np.max(x)}
       })

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


Unnamed: 0_level_0,width,width,width,length,length,length
Unnamed: 0_level_1,mean,min,max,mean,min,max
engine_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
dohc,66.4,64.0,69.6,182.5,168.7,199.6
dohcv,72.3,72.3,72.3,175.7,175.7,175.7
l,67.716667,60.3,68.4,186.966667,141.1,198.9
ohc,65.527703,61.8,71.7,171.953378,144.6,202.6
ohcf,64.96,63.4,65.4,168.866667,156.9,173.6
ohcv,68.776923,65.5,72.0,185.592308,170.7,208.1
rotor,65.7,65.7,65.7,169.0,169.0,169.0


##4. Other aggregating functions

There are other aggregation functions are included as GroupBy methods. Some commonly used aggregating functions are:
- `mean()`	Compute mean of groups
- `sum()`	Compute sum of group values
- `size()`	Compute group sizes
- `std()`	Standard deviation of groups
- `sem()`	Standard error of the mean of groups
- `describe()`	Generates descriptive statistics
- `min()`	Compute min of group values
- `max()`	Compute max of group values

The following cells in this section are some examples showing the use of those aggregation functions.

In each case the summary function is applied to all non-grouping columns of the grouped dataframe.

Display the mean of `city_mpg` and `highway_mpg` for each kind of `make`.

In [15]:
import_df[['make','city_mpg','highway_mpg']].groupby('make').mean()

Unnamed: 0_level_0,city_mpg,highway_mpg
make,Unnamed: 1_level_1,Unnamed: 2_level_1
alfa-romero,20.333333,26.666667
audi,18.857143,24.142857
bmw,19.375,25.375
chevrolet,41.0,46.333333
dodge,28.0,34.111111
honda,30.384615,35.461538
isuzu,31.0,36.0
jaguar,14.333333,18.333333
mazda,25.705882,31.941176
mercedes-benz,18.5,21.0


Display the mean of `city_mpg` and `highway_mpg` for each unique combination of `make` and `body_style`. Sort the values according to the  `city_mpg` in a descending way.

In [16]:
import_df[['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style'])\
  .mean()\
  .sort_values(by='city_mpg',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,city_mpg,highway_mpg
make,body_style,Unnamed: 2_level_1,Unnamed: 3_level_1
chevrolet,hatchback,42.5,48.0
chevrolet,sedan,38.0,43.0
isuzu,sedan,33.333333,38.333333
honda,hatchback,33.142857,38.285714
plymouth,sedan,31.0,38.0
nissan,hardtop,31.0,37.0
volkswagen,sedan,30.0,36.666667
honda,wagon,30.0,34.0
nissan,sedan,29.222222,35.111111
toyota,sedan,29.1,34.0


Display the number of records in each unique combination of `make` and `body_style` in the dataframe.

In [17]:
import_df[['make','body_style','city_mpg','highway_mpg']]\
       .groupby(['make','body_style'])\
       .size()

make           body_style 
alfa-romero    convertible     2
               hatchback       1
audi           hatchback       1
               sedan           5
               wagon           1
bmw            sedan           8
chevrolet      hatchback       2
               sedan           1
dodge          hatchback       5
               sedan           3
               wagon           1
honda          hatchback       7
               sedan           5
               wagon           1
isuzu          hatchback       1
               sedan           3
jaguar         sedan           3
mazda          hatchback      10
               sedan           7
mercedes-benz  convertible     1
               hardtop         2
               sedan           4
               wagon           1
mercury        hatchback       1
mitsubishi     hatchback       9
               sedan           4
nissan         hardtop         1
               hatchback       5
               sedan           9
               w

Display the descriptive statistics of `city_mpg` and `highway_mpg` for every value of the `make` column.

In [18]:
import_df[['make','city_mpg','highway_mpg']]\
  .groupby(['make'])\
  .describe()

Unnamed: 0_level_0,city_mpg,city_mpg,city_mpg,city_mpg,city_mpg,city_mpg,city_mpg,city_mpg,highway_mpg,highway_mpg,highway_mpg,highway_mpg,highway_mpg,highway_mpg,highway_mpg,highway_mpg
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
make,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
alfa-romero,3.0,20.333333,1.154701,19.0,20.0,21.0,21.0,21.0,3.0,26.666667,0.57735,26.0,26.5,27.0,27.0,27.0
audi,7.0,18.857143,2.544836,16.0,17.5,19.0,19.0,24.0,7.0,24.142857,3.236694,20.0,22.0,25.0,25.0,30.0
bmw,8.0,19.375,3.248626,15.0,16.0,20.5,21.5,23.0,8.0,25.375,3.622844,20.0,22.0,26.5,28.25,29.0
chevrolet,3.0,41.0,5.196152,38.0,38.0,38.0,42.5,47.0,3.0,46.333333,5.773503,43.0,43.0,43.0,48.0,53.0
dodge,9.0,28.0,5.545268,19.0,24.0,31.0,31.0,37.0,9.0,34.111111,5.710614,24.0,30.0,38.0,38.0,41.0
honda,13.0,30.384615,6.589619,24.0,27.0,30.0,30.0,49.0,13.0,35.461538,6.462912,28.0,33.0,34.0,34.0,54.0
isuzu,4.0,31.0,8.082904,24.0,24.0,31.0,38.0,38.0,4.0,36.0,8.082904,29.0,29.0,36.0,43.0,43.0
jaguar,3.0,14.333333,1.154701,13.0,14.0,15.0,15.0,15.0,3.0,18.333333,1.154701,17.0,18.0,19.0,19.0,19.0
mazda,17.0,25.705882,6.282562,16.0,19.0,26.0,31.0,36.0,17.0,31.941176,6.339071,23.0,27.0,32.0,38.0,42.0
mercedes-benz,8.0,18.5,3.817254,14.0,15.5,19.0,22.0,22.0,8.0,21.0,4.342481,16.0,17.5,21.5,25.0,25.0


__The End__