# Aggregating and Summarizing Dataframes

- How to calculate sum, mean, median and mode of a column?
- How to get the summary of the numerical variables?
- How to get number of missing values in each columns?
- How to group data based on categories of one column?
- How to group data based on categories of multiple columns?
- How to create new feature using aggregated results of column?

## READ THE DATASET

- We are going to use big mart sales data that we have used previously. It is stored in the folder name datasets.

In [1]:
# import the pandas library

import pandas as pd

In [2]:
# read the big mart sales data
data = pd.read_csv('datasets/big_mart_sales.csv')

In [3]:
# view the top rows of the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### How to calculate sum, mean, meadian, mode of a column?

In [4]:
#  calculate the sum of all MRP
data.Item_MRP.sum()

1201681.4808

In [5]:
# calculate the average Item_MRP
data.Item_MRP.mean()

140.9927819781767

In [6]:
# calculate the median of Item_MRP
data.Item_MRP.median()

143.0128

In [7]:
# calculate the most frequent outlet type
data.Outlet_Type.mode()

0    Supermarket Type1
Name: Outlet_Type, dtype: object

In [8]:
data.Outlet_Type.value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

### How to get the summary of the numerical variables?

- To get the summary of the numerical variables we have describe function in pandas.

In [9]:
# get the summary
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


### How to get the missing values in each columns?

In [10]:
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

### How to group the data based on categories of one column?

### GROUP BY
- Calculate the average MRP of each Item_Type using groupby.

In [11]:
d = data.groupby(['Item_Type'])
d

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000188A3D21910>

In [12]:
d.first()

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Baking Goods,FDP36,10.395,Regular,0.0,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
Breads,FDO23,17.85,Low Fat,0.0,93.1436,OUT045,2002,Medium,Tier 2,Supermarket Type1,2174.5028
Breakfast,FDP49,9.0,Regular,0.069089,56.3614,OUT046,1997,Small,Tier 1,Supermarket Type1,1547.3192
Canned,FDC14,21.35,Regular,0.072222,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
Dairy,FDA15,9.3,Low Fat,0.016047,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
Frozen Foods,FDH17,16.2,Regular,0.016687,96.9726,OUT045,2002,Small,Tier 2,Supermarket Type1,1076.5986
Fruits and Vegetables,FDX07,19.2,Regular,0.0,182.095,OUT010,1998,Medium,Tier 3,Grocery Store,732.38
Hard Drinks,DRI11,11.65,Low Fat,0.034238,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.668
Health and Hygiene,NCB42,11.8,Low Fat,0.008596,115.3492,OUT018,2009,Medium,Tier 3,Supermarket Type2,1621.8888
Household,NCD19,8.93,Low Fat,0.0,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [13]:
d.get_group('Baking Goods')

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
5,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
21,FDW12,,Regular,0.035400,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
48,FDL12,15.850,Regular,0.121633,Baking Goods,60.6220,OUT046,1997,Small,Tier 1,Supermarket Type1,2576.6460
71,FDL12,15.850,Regular,0.121532,Baking Goods,59.2220,OUT013,1987,High,Tier 3,Supermarket Type1,599.2200
...,...,...,...,...,...,...,...,...,...,...,...,...
8435,FDT48,,Low Fat,0.000000,Baking Goods,196.5084,OUT027,1985,Medium,Tier 3,Supermarket Type3,793.6336
8441,FDK60,16.500,Regular,0.094010,Baking Goods,95.2068,OUT049,1999,Medium,Tier 1,Supermarket Type1,777.6544
8465,FDX11,16.000,Regular,0.106969,Baking Goods,180.5634,OUT045,2002,,Tier 2,Supermarket Type1,2726.4510
8515,FDH24,20.700,Low Fat,0.021518,Baking Goods,157.5288,OUT018,2009,Medium,Tier 3,Supermarket Type2,1571.2880


In [14]:
d.size()

Item_Type
Baking Goods              648
Breads                    251
Breakfast                 110
Canned                    649
Dairy                     682
Frozen Foods              856
Fruits and Vegetables    1232
Hard Drinks               214
Health and Hygiene        520
Household                 910
Meat                      425
Others                    169
Seafood                    64
Snack Foods              1200
Soft Drinks               445
Starchy Foods             148
dtype: int64

In [15]:
d.last()

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Baking Goods,FDS36,8.38,Regular,0.046982,108.157,OUT045,2002,Medium,Tier 2,Supermarket Type1,549.285
Breads,FDW59,13.15,Low Fat,0.020712,82.7566,OUT035,2004,Small,Tier 2,Supermarket Type1,1691.132
Breakfast,FDO49,10.6,Regular,0.033104,48.9008,OUT049,1999,Medium,Tier 1,Supermarket Type1,708.4112
Canned,FDA01,15.0,Regular,0.054489,57.5904,OUT045,2002,Medium,Tier 2,Supermarket Type1,468.7232
Dairy,FDR26,20.7,Low Fat,0.042801,178.3028,OUT013,1987,High,Tier 3,Supermarket Type1,2479.4392
Frozen Foods,FDF53,20.75,reg,0.083607,178.8318,OUT046,1997,Small,Tier 1,Supermarket Type1,3608.636
Fruits and Vegetables,FDG45,8.1,Low Fat,0.214306,213.9902,OUT010,1998,Small,Tier 3,Grocery Store,424.7804
Hard Drinks,DRI11,8.26,Low Fat,0.034474,117.0834,OUT045,2002,Medium,Tier 2,Supermarket Type1,1612.5676
Health and Hygiene,NCJ29,10.6,Low Fat,0.035186,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
Household,NCN18,8.42,Low Fat,0.124111,111.7544,OUT027,1985,Medium,Tier 3,Supermarket Type3,4138.6128


### GROUPBY MEAN

In [16]:
d.mean(numeric_only=True)

Unnamed: 0_level_0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baking Goods,12.277108,0.069169,126.380766,1997.728395,1952.971207
Breads,11.346936,0.066255,140.952669,1997.657371,2204.132226
Breakfast,12.768202,0.085723,141.788151,1997.336364,2111.808651
Canned,12.305705,0.068129,139.763832,1998.152542,2225.194904
Dairy,13.426069,0.072427,148.499208,1997.681818,2232.542597
Frozen Foods,12.867061,0.065645,138.503366,1998.024533,2132.867744
Fruits and Vegetables,13.224769,0.068513,144.581235,1997.719968,2289.009592
Hard Drinks,11.400328,0.064943,137.077928,1998.17757,2139.221622
Health and Hygiene,13.142314,0.055216,130.818921,1997.734615,2010.000265
Household,13.384736,0.061322,149.424753,1997.784615,2258.7843


In [17]:
d.mean(numeric_only=True)['Item_MRP']

Item_Type
Baking Goods             126.380766
Breads                   140.952669
Breakfast                141.788151
Canned                   139.763832
Dairy                    148.499208
Frozen Foods             138.503366
Fruits and Vegetables    144.581235
Hard Drinks              137.077928
Health and Hygiene       130.818921
Household                149.424753
Meat                     139.882032
Others                   132.851430
Seafood                  141.841719
Snack Foods              146.194934
Soft Drinks              131.492506
Starchy Foods            147.838023
Name: Item_MRP, dtype: float64

### GROUPBY MAX

In [18]:
d.max(numeric_only=True)

Unnamed: 0_level_0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Item_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Baking Goods,20.85,0.31109,265.5568,2009,7931.6754
Breads,20.85,0.28151,263.6594,2009,8958.339
Breakfast,21.1,0.274592,234.93,2009,8209.314
Canned,21.35,0.328391,266.8884,2009,10306.584
Dairy,20.7,0.304737,266.6884,2009,10256.649
Frozen Foods,20.85,0.294939,264.891,2009,9678.0688
Fruits and Vegetables,21.35,0.321115,264.2252,2009,12117.56
Hard Drinks,19.7,0.298205,261.4278,2009,7843.124
Health and Hygiene,21.25,0.255348,266.6884,2009,9779.9362
Household,21.25,0.325781,264.791,2009,13086.9648


### How to group the data based on categories of multiple columns?

- Calculate the average MRP of each Item_Type for each category of Outlet_Size.

In [19]:
# step wise groupby

d = data.groupby(['Outlet_Size', 'Item_Type'])

In [20]:
d.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Size,Item_Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
High,Baking Goods,FDL12,15.85,Regular,0.121532,59.222,OUT013,1987,Tier 3,Supermarket Type1,599.22
High,Breads,FDU11,4.785,Low Fat,0.092517,120.1098,OUT013,1987,Tier 3,Supermarket Type1,1325.6078
High,Breakfast,FDO13,7.865,Low Fat,0.061009,166.0526,OUT013,1987,Tier 3,Supermarket Type1,3617.9572
High,Canned,FDL50,12.15,Regular,0.042278,126.5046,OUT013,1987,Tier 3,Supermarket Type1,373.5138
High,Dairy,FDU50,5.75,Regular,0.075108,112.8176,OUT013,1987,Tier 3,Supermarket Type1,1374.2112
High,Frozen Foods,FDM40,10.195,Low Fat,0.159804,141.5154,OUT013,1987,Tier 3,Supermarket Type1,850.8924
High,Fruits and Vegetables,FDF32,16.35,Low Fat,0.068024,196.4426,OUT013,1987,Tier 3,Supermarket Type1,1977.426
High,Hard Drinks,DRJ59,11.65,low fat,0.019356,39.1164,OUT013,1987,Tier 3,Supermarket Type1,308.9312
High,Health and Hygiene,NCI17,8.645,Low Fat,0.143303,96.341,OUT013,1987,Tier 3,Supermarket Type1,193.082
High,Household,NCD19,8.93,Low Fat,0.0,53.8614,OUT013,1987,Tier 3,Supermarket Type1,994.7052


In [21]:
d.mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Outlet_Size,Item_Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
High,Baking Goods,12.036918,0.05949,129.202044,1987.0,2050.901134
High,Breads,11.048,0.065586,133.75896,1987.0,2080.731528
High,Breakfast,12.564231,0.075118,147.490585,1987.0,2104.286508
High,Canned,11.922231,0.056733,135.442708,1987.0,2211.265203
High,Dairy,13.071875,0.068907,153.509173,1987.0,2453.181713
High,Frozen Foods,13.250707,0.065639,136.82925,1987.0,2214.096189
High,Fruits and Vegetables,13.259613,0.061302,145.57287,1987.0,2405.118103
High,Hard Drinks,11.741957,0.062271,141.927522,1987.0,2363.59
High,Health and Hygiene,13.02877,0.051031,135.11098,1987.0,1953.042439
High,Household,14.033398,0.053742,147.097522,1987.0,2408.217992


In [22]:
d.mean(numeric_only=True)['Item_MRP']

Outlet_Size  Item_Type            
High         Baking Goods             129.202044
             Breads                   133.758960
             Breakfast                147.490585
             Canned                   135.442708
             Dairy                    153.509173
             Frozen Foods             136.829250
             Fruits and Vegetables    145.572870
             Hard Drinks              141.927522
             Health and Hygiene       135.110980
             Household                147.097522
             Meat                     137.244790
             Others                   132.576613
             Seafood                  134.864240
             Snack Foods              145.847086
             Soft Drinks              131.758473
             Starchy Foods            158.157074
Medium       Baking Goods             126.178568
             Breads                   140.861039
             Breakfast                134.537511
             Canned               

### We can also do this using the pivot table also

Calculate the average MRP of each Item_Type using pivot table.

In [26]:
pd.pivot_table(data, index='Item_Type', values='Item_MRP', aggfunc='mean')

Unnamed: 0_level_0,Item_MRP
Item_Type,Unnamed: 1_level_1
Baking Goods,126.380766
Breads,140.952669
Breakfast,141.788151
Canned,139.763832
Dairy,148.499208
Frozen Foods,138.503366
Fruits and Vegetables,144.581235
Hard Drinks,137.077928
Health and Hygiene,130.818921
Household,149.424753


- **Calculate the average MRP of each Item_type for each category of OUtlet_Size using pivot table**

In [27]:
pd.pivot_table(data, index=['Outlet_Size', 'Item_Type'], values='Item_MRP', aggfunc='mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Item_MRP
Outlet_Size,Item_Type,Unnamed: 2_level_1
High,Baking Goods,129.202044
High,Breads,133.75896
High,Breakfast,147.490585
High,Canned,135.442708
High,Dairy,153.509173
High,Frozen Foods,136.82925
High,Fruits and Vegetables,145.57287
High,Hard Drinks,141.927522
High,Health and Hygiene,135.11098
High,Household,147.097522


### CROOS TAB

- The crosstab() fuction is used to compute a frequency table of two or more factors.
- By default, it computes a frequency table of the factors unless an array of values or an aggregation function which is passed.


In [28]:
pd.crosstab(data['Outlet_Size'], data['Item_Type'])

Item_Type,Baking Goods,Breads,Breakfast,Canned,Dairy,Frozen Foods,Fruits and Vegetables,Hard Drinks,Health and Hygiene,Household,Meat,Others,Seafood,Snack Foods,Soft Drinks,Starchy Foods
Outlet_Size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
High,73,25,13,65,80,92,142,23,61,103,41,16,5,125,49,19
Medium,203,83,36,217,218,274,413,75,170,289,149,52,21,408,137,48
Small,187,71,30,189,198,249,328,50,136,257,119,55,20,335,126,38


### How to create new feature using the aggregated results of a column?
**Let's create a dataframe which contains Item_Identifier and the average Item_Visibility using the groupby function.**

In [30]:
average_item_visibility = data.groupby(['Item_Identifier'])['Item_Visibility'].mean().reset_index()

average_item_visibility

Unnamed: 0,Item_Identifier,Item_Visibility
0,DRA12,0.031956
1,DRA24,0.048062
2,DRA59,0.134718
3,DRB01,0.082126
4,DRB13,0.008002
...,...,...
1554,NCZ30,0.024956
1555,NCZ41,0.051623
1556,NCZ42,0.009044
1557,NCZ53,0.027775


### Now, we want to create a new feature average_item_visibility using the above dataframe.
Let's first define a function that will take the parameter Item_Identifier and return the corresponding average Item_Visibility using the dataframe average_item_visibility

In [31]:
def get_item_visibility(x):
    return average_item_visibility.loc[(average_item_visibility.Item_Identifier == x), 'Item_Visibility'].values[0]

In [32]:
# let's test it on the sample Item_Identifier
get_item_visibility('DRA24')

0.04806226414285714

### Now, use the apply function to create the new feature. You just need to access the Item_Identifier column and use the apply method and pass the function that we have defined.

In [33]:
data['average_item_visibility'] = data.Item_Identifier.apply(get_item_visibility)

In [34]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,average_item_visibility
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,0.017387
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,0.019219
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,0.020145
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38,0.015274
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,0.008082


#### There is better way of doing the above task  is by using the transform function.
- The time tken by the transform function to perform the above operation is comaratively less over a large dataframe. That's a significant as compared to the first approch we used.


In [35]:
data['average_item_visibility_2'] = data.groupby(['Item_Identifier'])['Item_Visibility'].transform('mean')

In [36]:
data[['average_item_visibility', 'average_item_visibility_2']]

Unnamed: 0,average_item_visibility,average_item_visibility_2
0,0.017387,0.017387
1,0.019219,0.019219
2,0.020145,0.020145
3,0.015274,0.015274
4,0.008082,0.008082
...,...,...
8518,0.061705,0.061705
8519,0.046952,0.046952
8520,0.035203,0.035203
8521,0.120686,0.120686
