# Reporting using Pandas - Going Beyond Basics üêº

<img src="images/00_reporting.PNG">


## What's covered in this notebook?
1. Aggregating statistics grouped by category  
	- Reading a .csv File - Online Store Sales Data  
	- Grouping the Data on the basis of Product Category  
        - Returning all the groups and row indexes  
        - Get unique group keys  
        - Filter data on the basis of group keys  
        - Returning first row, last row and nth row for each group  
	- Grouping the Data Based on Product Category and Sub-Category  
		- Returning all the groups and row indexes
		- Get unique group keys
		- Filter data on the basis of group keys
		- Returning first row, last row and nth row for each group
	- split-apply-combine  
	- Aggregation  
		- Built-in Aggregation Methods
		- Aggregation with User-Defined Functions
		- Applying different aggregation functions to DataFrame columns
	- Filteration  
		- Built-in Filteration
		- Filteration with User-Defined Functions
	- Transformation  
		- Built-in Transformation
		- Transformation with User-Defined Functions
2. Solving a Case Study using groupby()  
	- Reading a .csv File - Online Store Sales Data  
	- What are the different customer segments?  
	- How many sales records do we have in the dataset?  
	- What are the different product categories?  
	- How many days on average it take for the products to get shipped?  
	- Are there more orders placed on weekends?  
	- What is the minimum order amount and maximum order amount?  
	- What is the revenue generated in the year 2017?  
	- Which customer contributed to the maximum revenue in 2017 and how much?  
	- Who is the customer with customer_id == TC-20980 ?  
	- Which region recorded maximum sales count?  
	- Which product category is doing best? (revenue and count)
3. Analysing and Summarizing using pivot_table()
	- What is the region-wise revenue?
	- What is the region-wise count of sales?
	- What is the region-wise count and sum of sales?
	- What is the region-wise revenue generated of each product category?
	- What is the region-wise revenue generated of each product sub-category under product category?

## Aggregating statistics grouped by category

<img style="float: right;" width="400" height="400" src="images/01_groupby.PNG">

**Question: How to calculate summary statistics?**  
**Answer:** Basic statistics (mean, median, min, max, counts‚Ä¶) are easily calculable. These or custom aggregations can be applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.

**Important Note**  
`groupby()` and `pivot_table()` are very powerful in analysing and summarizing the data. `pivot_table()` are more powerful when applying complex aggregation operations.

### Reading a .csv File - Online Store Sales Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/online_store_sales.csv', parse_dates=["Order Date", "Ship Date"], dayfirst=True)

df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9800 non-null   int64         
 1   Order ID       9800 non-null   object        
 2   Order Date     9800 non-null   datetime64[ns]
 3   Ship Date      9800 non-null   datetime64[ns]
 4   Ship Mode      9800 non-null   object        
 5   Customer ID    9800 non-null   object        
 6   Customer Name  9800 non-null   object        
 7   Segment        9800 non-null   object        
 8   Country        9800 non-null   object        
 9   City           9800 non-null   object        
 10  State          9800 non-null   object        
 11  Postal Code    9789 non-null   float64       
 12  Region         9800 non-null   object        
 13  Product ID     9800 non-null   object        
 14  Category       9800 non-null   object        
 15  Sub-Category   9800 n

In [4]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df.columns ]

df.columns = col_names

df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales'],
      dtype='object')

### Grouping the Data on the basis of Product Category

In [5]:
grouped_df = df.groupby('category')

#### Returning all the groups and row indexes

The `groups` attribute is a dictionary whose keys are the computed unique groups and corresponding values are the axis labels belonging to each group.

In [6]:
grouped_df.groups

{'Furniture': [0, 1, 3, 5, 10, 23, 24, 27, 29, 36, 38, 39, 51, 52, 57, 65, 66, 72, 73, 76, 78, 85, 93, 96, 104, 110, 117, 119, 124, 125, 128, 129, 139, 140, 146, 149, 157, 167, 173, 177, 189, 192, 201, 204, 213, 222, 226, 228, 229, 231, 232, 234, 238, 239, 241, 242, 244, 249, 254, 272, 282, 292, 293, 294, 295, 301, 303, 304, 309, 310, 311, 313, 317, 325, 328, 338, 354, 362, 364, 369, 377, 384, 387, 399, 408, 412, 413, 415, 417, 422, 424, 425, 439, 440, 444, 446, 453, 456, 457, 462, ...], 'Office Supplies': [2, 4, 6, 8, 9, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 25, 28, 30, 31, 32, 33, 34, 37, 42, 43, 45, 46, 49, 50, 53, 55, 56, 58, 60, 61, 63, 64, 67, 69, 70, 71, 74, 75, 77, 79, 80, 81, 82, 83, 84, 87, 88, 89, 91, 92, 94, 95, 97, 98, 99, 101, 102, 105, 108, 111, 112, 113, 114, 115, 116, 118, 120, 121, 122, 126, 127, 131, 132, 133, 134, 135, 136, 137, 138, 141, 142, 143, 144, 145, 150, 151, 153, 154, 155, 156, 158, 160, 162, 163, 164, ...], 'Technology': [7, 11, 19, 26, 35, 40, 41, 44, 

#### Get unique group keys

In [7]:
grouped_df.groups.keys()

dict_keys(['Furniture', 'Office Supplies', 'Technology'])

#### Filter data on the basis of group keys

In [8]:
# Selecting a group

grouped_df.get_group("Technology")

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales
7,8,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.152
11,12,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002033,Technology,Phones,Konftel 250 Conference¬†phone¬†- Charcoal black,911.424
19,20,CA-2015-143336,2015-08-27,2015-09-01,Second Class,ZD-21925,Zuschuss Donatelli,Consumer,United States,San Francisco,California,94109.0,West,TEC-PH-10001949,Technology,Phones,Cisco SPA 501G IP Phone,213.480
26,27,CA-2017-121755,2017-01-16,2017-01-20,Second Class,EH-13945,Eric Hoffmann,Consumer,United States,Los Angeles,California,90049.0,West,TEC-AC-10003027,Technology,Accessories,Imation¬†8GB Mini TravelDrive USB 2.0¬†Flash Drive,90.570
35,36,CA-2017-117590,2017-12-08,2017-12-10,First Class,GH-14485,Gene Hale,Corporate,United States,Richardson,Texas,75080.0,Central,TEC-PH-10004977,Technology,Phones,GE 30524EE4,1097.544
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9780,9781,CA-2017-153178,2017-09-14,2017-09-18,Standard Class,CL-12565,Clay Ludtke,Consumer,United States,Long Beach,New York,11561.0,East,TEC-PH-10001944,Technology,Phones,Wi-Ex zBoost YX540 Cellular Phone Signal Booster,437.850
9789,9790,CA-2018-144491,2018-03-27,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,TEC-AC-10004901,Technology,Accessories,Kensington SlimBlade Notebook Wireless Mouse w...,39.992
9797,9798,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.188
9798,9799,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.376


#### Returning first row, last row and nth row for each group

In [9]:
grouped_df.first()

Unnamed: 0_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,sub_category,product_name,sales
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Bookcases,Bush Somerset Collection Bookcase,261.96
Office Supplies,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
Technology,8,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002275,Phones,Mitel 5320 IP Phone VoIP phone,907.152


In [10]:
grouped_df.last()

Unnamed: 0_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,sub_category,product_name,sales
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,9793,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10003396,Chairs,Global Deluxe Steno Chair,107.772
Office Supplies,9797,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,OFF-AR-10001374,Art,"BIC Brite Liner Highlighters, Chisel Tip",10.368
Technology,9800,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-AC-10000487,Accessories,SanDisk Cruzer 4 GB USB Flash Drive,10.384


In [11]:
grouped_df.nth(10)

Unnamed: 0_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,sub_category,product_name,sales
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,39,CA-2016-117415,2016-12-27,2016-12-31,Standard Class,SN-20710,Steve Nguyen,Home Office,United States,Houston,Texas,77041.0,Central,FUR-BO-10002545,Bookcases,"Atlantic Metals Mobile 3-Shelf Bookcases, Cust...",532.3992
Office Supplies,18,CA-2015-167164,2015-05-13,2015-05-15,Second Class,AG-10270,Alejandro Grove,Consumer,United States,West Jordan,Utah,84084.0,West,OFF-ST-10000107,Storage,Fellowes Super Stor/Drawer,55.5
Technology,55,CA-2017-105816,2017-12-11,2017-12-17,Standard Class,JM-15265,Janet Molinari,Corporate,United States,New York City,New York,10024.0,East,TEC-PH-10002447,Phones,AT&T CL83451 4-Handset Telephone,1029.95


### Grouping the Data Based on Product Category and Sub-Category

In [12]:
# Grouping based on category first and then sub_category

grouped_df = df.groupby(['category', 'sub_category'])

#### Returning all the groups and row indexes

In [13]:
# Returning each group and row ids associated to the group

grouped_df.groups

{('Furniture', 'Bookcases'): [0, 27, 38, 189, 192, 213, 292, 354, 369, 399, 412, 468, 472, 485, 688, 708, 736, 783, 841, 906, 954, 1042, 1114, 1211, 1247, 1302, 1369, 1386, 1534, 1539, 1545, 1594, 1610, 1714, 1723, 1760, 1762, 1860, 1875, 1932, 2007, 2025, 2115, 2122, 2225, 2262, 2281, 2305, 2326, 2353, 2403, 2415, 2471, 2543, 2546, 2558, 2603, 2650, 2654, 2737, 2777, 2796, 2808, 2825, 2860, 3023, 3030, 3074, 3098, 3100, 3102, 3175, 3351, 3365, 3368, 3466, 3507, 3512, 3762, 3820, 3845, 3910, 3928, 3985, 3994, 3999, 4023, 4071, 4088, 4110, 4184, 4217, 4223, 4266, 4284, 4383, 4385, 4389, 4423, 4453, ...], ('Furniture', 'Chairs'): [1, 23, 39, 52, 57, 66, 72, 85, 124, 128, 149, 157, 167, 173, 177, 228, 229, 244, 249, 294, 310, 317, 328, 362, 413, 415, 417, 424, 439, 444, 456, 457, 498, 502, 526, 531, 539, 551, 569, 586, 622, 635, 657, 730, 769, 777, 787, 791, 799, 819, 829, 847, 880, 916, 960, 980, 983, 990, 1021, 1030, 1045, 1060, 1067, 1081, 1126, 1158, 1177, 1190, 1198, 1200, 1202, 1212

#### Get unique group keys

In [14]:
grouped_df.groups.keys()

dict_keys([('Furniture', 'Bookcases'), ('Furniture', 'Chairs'), ('Furniture', 'Furnishings'), ('Furniture', 'Tables'), ('Office Supplies', 'Appliances'), ('Office Supplies', 'Art'), ('Office Supplies', 'Binders'), ('Office Supplies', 'Envelopes'), ('Office Supplies', 'Fasteners'), ('Office Supplies', 'Labels'), ('Office Supplies', 'Paper'), ('Office Supplies', 'Storage'), ('Office Supplies', 'Supplies'), ('Technology', 'Accessories'), ('Technology', 'Copiers'), ('Technology', 'Machines'), ('Technology', 'Phones')])

#### Filter data on the basis of group keys

In [15]:
grouped_df.get_group(('Technology', 'Phones'))

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales
7,8,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.152
11,12,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002033,Technology,Phones,Konftel 250 Conference¬†phone¬†- Charcoal black,911.424
19,20,CA-2015-143336,2015-08-27,2015-09-01,Second Class,ZD-21925,Zuschuss Donatelli,Consumer,United States,San Francisco,California,94109.0,West,TEC-PH-10001949,Technology,Phones,Cisco SPA 501G IP Phone,213.480
35,36,CA-2017-117590,2017-12-08,2017-12-10,First Class,GH-14485,Gene Hale,Corporate,United States,Richardson,Texas,75080.0,Central,TEC-PH-10004977,Technology,Phones,GE 30524EE4,1097.544
40,41,CA-2016-117415,2016-12-27,2016-12-31,Standard Class,SN-20710,Steve Nguyen,Home Office,United States,Houston,Texas,77041.0,Central,TEC-PH-10000486,Technology,Phones,Plantronics HL10 Handset Lifter,371.168
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9764,9765,CA-2015-123855,2015-06-18,2015-06-23,Standard Class,MC-18100,Mick Crebagga,Consumer,United States,Los Angeles,California,90036.0,West,TEC-PH-10000215,Technology,Phones,Plantronics Cordless¬†Phone Headset¬†with In-lin...,139.800
9773,9774,CA-2017-160234,2017-06-26,2017-07-03,Standard Class,PF-19225,Phillip Flathmann,Consumer,United States,Atlanta,Georgia,30318.0,South,TEC-PH-10004434,Technology,Phones,Cisco IP¬†Phone¬†7961G VoIP¬†phone¬†- Dark gray,135.950
9780,9781,CA-2017-153178,2017-09-14,2017-09-18,Standard Class,CL-12565,Clay Ludtke,Consumer,United States,Long Beach,New York,11561.0,East,TEC-PH-10001944,Technology,Phones,Wi-Ex zBoost YX540 Cellular Phone Signal Booster,437.850
9797,9798,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.188


#### Returning first row, last row and nth row for each group

In [16]:
grouped_df.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,product_name,sales
category,sub_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,Bookcases,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Bush Somerset Collection Bookcase,261.96
Furniture,Chairs,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
Furniture,Furnishings,6,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,FUR-FU-10001487,Eldon Expressions Wood and Plastic Desk Access...,48.86
Furniture,Tables,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Bretford CR4500 Series Slim Rectangular Table,957.5775
Office Supplies,Appliances,10,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,OFF-AP-10002892,Belkin F5C206VTEL 6 Outlet Surge,114.9
Office Supplies,Art,7,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,OFF-AR-10002833,Newell 322,7.28
Office Supplies,Binders,9,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,OFF-BI-10003910,DXL Angle-View Binders with Locking Rings by S...,18.504
Office Supplies,Envelopes,31,US-2016-150630,2016-09-17,2016-09-21,Standard Class,TB-21520,Tracy Blumstein,Consumer,United States,Philadelphia,Pennsylvania,19140.0,East,OFF-EN-10001509,Poly String Tie Envelopes,3.264
Office Supplies,Fasteners,54,CA-2017-105816,2017-12-11,2017-12-17,Standard Class,JM-15265,Janet Molinari,Corporate,United States,New York City,New York,10024.0,East,OFF-FA-10000304,Advantus Push Pins,15.26
Office Supplies,Labels,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Self-Adhesive Address Labels for Typewriters b...,14.62


In [17]:
grouped_df.last()

Unnamed: 0_level_0,Unnamed: 1_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,product_name,sales
category,sub_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,Bookcases,9788,CA-2018-144491,2018-03-27,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-BO-10001811,"Atlantic Metals Mobile 5-Shelf Bookcases, Cust...",1023.332
Furniture,Chairs,9793,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10003396,Global Deluxe Steno Chair,107.772
Furniture,Furnishings,9785,CA-2016-149748,2016-05-31,2016-06-02,Second Class,EM-13825,Elizabeth Moffitt,Corporate,United States,Paterson,New Jersey,7501.0,East,FUR-FU-10001847,Eldon Image Series Black Desk Accessories,8.28
Furniture,Tables,9757,CA-2018-113705,2018-03-27,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,FUR-TA-10002533,BPI Conference Tables,292.1
Office Supplies,Appliances,9780,CA-2015-169019,2015-07-26,2015-07-30,Standard Class,LF-17185,Luke Foster,Consumer,United States,San Antonio,Texas,78207.0,Central,OFF-AP-10003281,Acco 6 Outlet Guardian Standard Surge Suppressor,4.836
Office Supplies,Art,9797,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,OFF-AR-10001374,"BIC Brite Liner Highlighters, Chisel Tip",10.368
Office Supplies,Binders,9796,CA-2017-125920,2017-05-21,2017-05-28,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,60610.0,Central,OFF-BI-10003429,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.798
Office Supplies,Envelopes,9792,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,OFF-EN-10003134,Staple envelope,56.064
Office Supplies,Fasteners,9702,CA-2017-105291,2017-10-30,2017-11-04,Standard Class,SP-20920,Susan Pistek,Consumer,United States,San Luis Obispo,California,93405.0,West,OFF-FA-10003059,Assorted Color Push Pins,3.62
Office Supplies,Labels,9754,CA-2018-113705,2018-03-27,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-LA-10000476,Avery 05222 Permanent Self-Adhesive File Folde...,8.26


In [18]:
grouped_df.nth(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,product_name,sales
category,sub_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Furniture,Bookcases,413,CA-2018-117457,2018-12-08,2018-12-12,Standard Class,KH-16510,Keith Herrera,Consumer,United States,San Francisco,California,94110.0,West,FUR-BO-10001972,O'Sullivan 4-Shelf Bookcase in Odessa Pine,1336.829
Furniture,Chairs,150,CA-2017-114489,2017-12-05,2017-12-09,Standard Class,JE-16165,Justin Ellison,Corporate,United States,Franklin,Wisconsin,53132.0,Central,FUR-CH-10000454,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",1951.84
Furniture,Furnishings,105,US-2016-156867,2016-11-13,2016-11-17,Standard Class,LC-16870,Lena Cacioppo,Consumer,United States,Aurora,Colorado,80013.0,West,FUR-FU-10004006,"Deflect-o DuraMat Lighweight, Studded, Beveled...",102.36
Furniture,Tables,283,CA-2016-130890,2016-11-02,2016-11-06,Standard Class,JO-15280,Jas O'Carroll,Consumer,United States,Los Angeles,California,90004.0,West,FUR-TA-10002903,"Bevis Round Bullnose 29"" High Table Top",1038.84
Office Supplies,Appliances,203,CA-2015-133690,2015-08-03,2015-08-05,First Class,BS-11755,Bruce Stewart,Consumer,United States,Denver,Colorado,80219.0,West,OFF-AP-10003622,"Bravo II Megaboss 12-Amp Hard Body Upright, Re...",2.6
Office Supplies,Art,112,CA-2017-128867,2017-11-03,2017-11-10,Standard Class,CL-12565,Clay Ludtke,Consumer,United States,Urbandale,Iowa,50322.0,Central,OFF-AR-10000380,"Hunt PowerHouse Electric Pencil Sharpener, Blue",75.96
Office Supplies,Binders,64,CA-2016-135545,2016-11-24,2016-11-30,Standard Class,KM-16720,Kunst Miller,Consumer,United States,Los Angeles,California,90004.0,West,OFF-BI-10001078,"Acco PRESSTEX Data Binder with Storage Hooks, ...",25.824
Office Supplies,Envelopes,270,US-2018-145366,2018-12-09,2018-12-13,Standard Class,CA-12310,Christine Abelman,Corporate,United States,Cincinnati,Ohio,45231.0,East,OFF-EN-10004386,Recycled Interoffice Envelopes with String and...,57.576
Office Supplies,Fasteners,340,CA-2016-128167,2016-06-22,2016-06-26,Second Class,KL-16645,Ken Lonsdale,Consumer,United States,Layton,Utah,84041.0,West,OFF-FA-10000490,"OIC Binder Clips, Mini, 1/4"" Capacity, Black",4.96
Office Supplies,Labels,361,CA-2018-155698,2018-03-08,2018-03-11,First Class,VB-21745,Victoria Brennan,Corporate,United States,Columbus,Georgia,31907.0,South,OFF-LA-10001158,"Avery Address/Shipping Labels for Typewriters,...",20.7


### split-apply-combine

<img style="float: right;" width="400" height="400" src="images/02_split_apply_combine.PNG">

Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The `groupby` method is used to support this type of operations. This fits in the more general split-apply-combine pattern:
- **Split** the data into groups
- **Apply** a function to each group independently
- **Combine** the results into a data structure

In the `apply` step, we might wish to do one of the following:
- **Aggregation:** compute a summary statistic (or statistics) for each group. Some examples:  
> Compute group sums or means.  
> Compute group sizes / counts.

- **Filtration:** discard some groups, according to a group-wise computation that evaluates to True or False. Some examples:
> Discard data that belong to groups with only a few members.  
> Filter out data based on the group sum or mean.

- **Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:
> Standardize data (zscore) within a group.  
> Filling NAs within groups with a value derived from each group.

### Aggregation

#### Built-in Aggregation Methods

Many common aggregations are built-in to GroupBy objects as methods. Of the methods listed below, those with a * do not have a Cython-optimized implementation.


| Method | Description |
|:--------|:--------|
| any() | Compute whether any of the values in the groups are truthy |
| all() | Compute whether all of the values in the groups are truthy |
| count()| Compute the number of non-NA values in the groups |
| cov() * | Compute the covariance of the groups |
| first() | Compute the first occurring value in each group |
| idxmax() * | Compute the index of the maximum value in each group |
| idxmin() * | Compute the index of the minimum value in each group |
| last() | Compute the last occurring value in each group |
| max() | Compute the maximum value in each group |
| mean() | Compute the mean of each group |
| median() | Compute the median of each group |
| min() | Compute the minimum value in each group |
| nunique() | Compute the number of unique values in each group |
| prod() | Compute the product of the values in each group |
| quantile() | Compute a given quantile of the values in each group |
| sem() | Compute the standard error of the mean of the values in each group |
| size() | Compute the number of values in each group |
| skew() * | Compute the skew of the values in each group |
| std() | Compute the standard deviation of the values in each group |
| sum() | Compute the sum of the values in each group |
| var() | Compute the variance of the values in each group |

In [19]:
grouped_df = df.groupby('category')

In [20]:
grouped_df['category'].count()

category
Furniture          2078
Office Supplies    5909
Technology         1813
Name: category, dtype: int64

In [21]:
grouped_df['sales'].min()

category
Furniture          1.892
Office Supplies    0.444
Technology         0.990
Name: sales, dtype: float64

In [22]:
grouped_df['sales'].max()

category
Furniture           4416.174
Office Supplies     9892.740
Technology         22638.480
Name: sales, dtype: float64

In [23]:
grouped_df['sales'].mean()

category
Furniture          350.653790
Office Supplies    119.381001
Technology         456.401474
Name: sales, dtype: float64

In [24]:
grouped_df = df.groupby(['category', 'sub_category'])

In [25]:
grouped_df['sub_category'].count()

category         sub_category
Furniture        Bookcases        226
                 Chairs           607
                 Furnishings      931
                 Tables           314
Office Supplies  Appliances       459
                 Art              785
                 Binders         1492
                 Envelopes        248
                 Fasteners        214
                 Labels           357
                 Paper           1338
                 Storage          832
                 Supplies         184
Technology       Accessories      756
                 Copiers           66
                 Machines         115
                 Phones           876
Name: sub_category, dtype: int64

In [26]:
grouped_df['sales'].min()

category         sub_category
Furniture        Bookcases        35.490
                 Chairs           26.640
                 Furnishings       1.892
                 Tables           24.368
Office Supplies  Appliances        0.444
                 Art               1.344
                 Binders           0.556
                 Envelopes         1.632
                 Fasteners         1.240
                 Labels            2.088
                 Paper             3.380
                 Storage           4.464
                 Supplies          1.744
Technology       Accessories       0.990
                 Copiers         299.990
                 Machines         11.560
                 Phones            2.970
Name: sales, dtype: float64

In [27]:
grouped_df['sales'].max()

category         sub_category
Furniture        Bookcases        4404.900
                 Chairs           4416.174
                 Furnishings      1336.440
                 Tables           4297.644
Office Supplies  Appliances       2625.120
                 Art              1113.024
                 Binders          9892.740
                 Envelopes         604.656
                 Fasteners          93.360
                 Labels            786.480
                 Paper             733.950
                 Storage          2934.330
                 Supplies         8187.650
Technology       Accessories      3347.370
                 Copiers         17499.950
                 Machines        22638.480
                 Phones           4548.810
Name: sales, dtype: float64

In [28]:
grouped_df['sales'].mean()

category         sub_category
Furniture        Bookcases        503.598224
                 Chairs           531.833165
                 Furnishings       95.823865
                 Tables           645.893720
Office Supplies  Appliances       227.926804
                 Art               34.019631
                 Binders          134.067550
                 Envelopes         65.032444
                 Fasteners         14.027850
                 Labels            34.587468
                 Paper             57.420257
                 Storage          263.633885
                 Supplies         252.284283
Technology       Accessories      217.178175
                 Copiers         2215.880212
                 Machines        1645.553313
                 Phones           374.180877
Name: sales, dtype: float64

In [29]:
grouped_df['sales'].idxmax()

category         sub_category
Furniture        Bookcases       9741
                 Chairs          7243
                 Furnishings     7387
                 Tables          9639
Office Supplies  Appliances      7579
                 Art               67
                 Binders         9039
                 Envelopes       2516
                 Fasteners       8006
                 Labels          1621
                 Paper           3262
                 Storage         3070
                 Supplies        2505
Technology       Accessories      251
                 Copiers         6826
                 Machines        2697
                 Phones          2492
Name: sales, dtype: int64

In [30]:
df.loc[2492]

row_id                          2493
order_id              CA-2015-144624
order_date       2015-11-19 00:00:00
ship_date        2015-11-23 00:00:00
ship_mode             Standard Class
customer_id                 JM-15865
customer_name            John Murray
segment                     Consumer
country                United States
city                       Jamestown
state                       New York
postal_code                  14701.0
region                          East
product_id           TEC-PH-10002885
category                  Technology
sub_category                  Phones
product_name          Apple iPhone 5
sales                        4548.81
Name: 2492, dtype: object

In [31]:
df.loc[2697]

row_id                                                        2698
order_id                                            CA-2015-145317
order_date                                     2015-03-18 00:00:00
ship_date                                      2015-03-23 00:00:00
ship_mode                                           Standard Class
customer_id                                               SM-20320
customer_name                                          Sean Miller
segment                                                Home Office
country                                              United States
city                                                  Jacksonville
state                                                      Florida
postal_code                                                32216.0
region                                                       South
product_id                                         TEC-MA-10002412
category                                                Techno

#### Aggregation with User-Defined Functions

In [32]:
grouped_df['sales'].agg(["min", "mean", "max"])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,mean,max
category,sub_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Furniture,Bookcases,35.49,503.598224,4404.9
Furniture,Chairs,26.64,531.833165,4416.174
Furniture,Furnishings,1.892,95.823865,1336.44
Furniture,Tables,24.368,645.89372,4297.644
Office Supplies,Appliances,0.444,227.926804,2625.12
Office Supplies,Art,1.344,34.019631,1113.024
Office Supplies,Binders,0.556,134.06755,9892.74
Office Supplies,Envelopes,1.632,65.032444,604.656
Office Supplies,Fasteners,1.24,14.02785,93.36
Office Supplies,Labels,2.088,34.587468,786.48


In [33]:
grouped_df['sales'].agg(lambda values : min(values))

category         sub_category
Furniture        Bookcases        35.490
                 Chairs           26.640
                 Furnishings       1.892
                 Tables           24.368
Office Supplies  Appliances        0.444
                 Art               1.344
                 Binders           0.556
                 Envelopes         1.632
                 Fasteners         1.240
                 Labels            2.088
                 Paper             3.380
                 Storage           4.464
                 Supplies          1.744
Technology       Accessories       0.990
                 Copiers         299.990
                 Machines         11.560
                 Phones            2.970
Name: sales, dtype: float64

#### Applying different aggregation functions to DataFrame columns

In [34]:
grouped_df.agg({'order_date' : ['min', 'max'], 'sales': ['mean', 'std']})

Unnamed: 0_level_0,Unnamed: 1_level_0,order_date,order_date,sales,sales
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std
category,sub_category,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Furniture,Bookcases,2015-01-13,2018-12-30,503.598224,641.41928
Furniture,Chairs,2015-01-06,2018-12-29,531.833165,551.180296
Furniture,Furnishings,2015-01-07,2018-12-29,95.823865,148.42149
Furniture,Tables,2015-01-27,2018-12-25,645.89372,598.584981
Office Supplies,Appliances,2015-01-18,2018-12-30,227.926804,378.006735
Office Supplies,Art,2015-01-05,2018-12-29,34.019631,60.301752
Office Supplies,Binders,2015-01-04,2018-12-30,134.06755,568.09997
Office Supplies,Envelopes,2015-01-13,2018-12-23,65.032444,85.170691
Office Supplies,Fasteners,2015-01-06,2018-12-30,14.02785,12.466864
Office Supplies,Labels,2015-01-04,2018-12-28,34.587468,74.802711


### Filteration

A filtration is a GroupBy operation the subsets the original grouping object. It may either filter out entire groups, part of groups, or both. Filtrations return a filtered version of the calling object, including the grouping columns when provided. In the following example, `class` is included in the result.  

#### Built-in Filteration
 | Method | Description | 
 |:------|:------|
 | head() | Select the top row(s) of each group | 
 | nth() | Select the nth row(s) of each group | 
 | tail() | Select the bottom row(s) of each group | 

#### Filteration with User-Defined Functions

The `filter` method takes a User-Defined Function (UDF) that, when applied to an entire group, returns either `True` or `False`. The result of the `filter` method is then the subset of groups for which the UDF returned `True`.

In [35]:
grouped_df = df.groupby('category')

In [36]:
grouped_df['category'].count()

category
Furniture          2078
Office Supplies    5909
Technology         1813
Name: category, dtype: int64

In [37]:
grouped_df['sales'].mean()

category
Furniture          350.653790
Office Supplies    119.381001
Technology         456.401474
Name: sales, dtype: float64

In [38]:
grouped_df.filter(lambda group: group['sales'].mean() > 200)

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
5,6,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,FUR-FU-10001487,Furniture,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.8600
7,8,CA-2015-115812,2015-06-09,2015-06-14,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032.0,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.1520
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9790,9791,CA-2018-144491,2018-03-27,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10001714,Furniture,Chairs,"Global Leather & Oak Executive Chair, Burgundy",211.2460
9792,9793,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10003396,Furniture,Chairs,Global Deluxe Steno Chair,107.7720
9797,9798,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.1880
9798,9799,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.3760


### Transformation

Unlike aggregations, the groupings that are used to split the original object are not included in the result.

In [39]:
df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [40]:
grouped_df = df.groupby('category')

In [41]:
grouped_df.cumsum()

Unnamed: 0,row_id,postal_code,sales
0,1,42420.0,261.9600
1,3,84840.0,993.9000
2,3,90036.0,14.6200
3,7,118151.0,1951.4775
4,8,123347.0,36.9880
...,...,...,...
9795,28842865,324811420.0,705411.9660
9796,28852662,324855035.0,705422.3340
9797,8884651,100590645.0,827419.1130
9798,8894450,100634260.0,827445.4890


#### Built-in Transformation

| Method | Description |
|:-----|:-----|
| bfill() | Back fill NA values within each group |
| cumcount() | Compute the cumulative count within each group |
| cummax() | Compute the cumulative max within each group |
| cummin() | Compute the cumulative min within each group |
| cumprod() | Compute the cumulative product within each group |
| cumsum() | Compute the cumulative sum within each group |
| diff() | Compute the difference between adjacent values within each group |
| ffill() | Forward fill NA values within each group |
| fillna() | Fill NA values within each group |
| pct_change() | Compute the percent change between adjacent values within each group |
| rank() | Compute the rank of each value within each group |
| shift() | Shift values up or down within each group

#### Transformation with User-Defined Functions

Similar to the aggregation method, the `transform()` method can accept string aliases to the built-in transformation methods in the previous section. It can also accept string aliases to the built-in aggregation methods. When an aggregation method is provided, the result will be broadcast across the group.

In addition to string aliases, the transform() method can also accept User-Defined Functions (UDFs). The UDF must:

**Note:** 
Transforming by supplying `transform` with a UDF is often less performant than using the built-in methods on GroupBy. Consider breaking up a complex operation into a chain of operations that utilize the built-in methods.

## Solving a Case Study using Groupby

### Reading a .csv File - Online Store Sales Data

In [42]:
df = pd.read_csv('data/online_store_sales.csv', parse_dates=["Order Date", "Ship Date"], dayfirst=True)

df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


**What comes to my mind immediately after looking at the dataset?**

> 1. What are the different customer segments?
> 2. How many sales records do we have in the dataset?
> 3. What are the different product categories?
> 4. How many days on average it takes for the products to get shipped?
> 5. Are there more orders placed on weekends?
> 6. What is the minimum order amount and maximum order amount?
> 7. Which customer contributed to the maximum revenue in 2017 and how much?
> 8. What is the revenue generated in the year 2017?
> 9. Which region recorded maximum sales count?
> 10. Which product category is doing best? (revenue and count)

**Let's try to answer all the questions.**

In [43]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df.columns ]

df.columns = col_names

df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales'],
      dtype='object')

### What are the different customer segments?

In [44]:
print("Customer Segments:\n", df['segment'].unique())

Customer Segments:
 ['Consumer' 'Corporate' 'Home Office']


### How many sales records do we have in the dataset?

In [45]:
print("Total Sales Records:", df.shape[0])

Total Sales Records: 9800


### What are the different product categories?

In [46]:
print("Product Categories:\n", df['category'].unique())

Product Categories:
 ['Furniture' 'Office Supplies' 'Technology']


In [47]:
print("Product Categories:\n", df['sub_category'].unique())

Product Categories:
 ['Bookcases' 'Chairs' 'Labels' 'Tables' 'Storage' 'Furnishings' 'Art'
 'Phones' 'Binders' 'Appliances' 'Paper' 'Accessories' 'Envelopes'
 'Fasteners' 'Supplies' 'Machines' 'Copiers']


### How many days on average it take for the products to get shipped?

In [48]:
df['ship_time'] = df['ship_date'] - df['order_date']

df['ship_time'] = df['ship_time'].dt.days

df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,ship_time
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,3
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,4
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,7
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,7


In [49]:
print("Average ship time is", df['ship_time'].mean(), 'days.')

Average ship time is 3.9611224489795918 days.


### Are there more orders placed on weekends?

In [50]:
df['order_date'].dt.day_name()

0       Wednesday
1       Wednesday
2          Monday
3         Tuesday
4         Tuesday
          ...    
9795       Sunday
9796      Tuesday
9797      Tuesday
9798      Tuesday
9799      Tuesday
Name: order_date, Length: 9800, dtype: object

In [51]:
df['week_day'] = df['order_date'].dt.day_name()

df.week_day.value_counts()

Tuesday      1889
Saturday     1786
Sunday       1695
Monday       1593
Wednesday    1229
Friday       1067
Thursday      541
Name: week_day, dtype: int64

In [52]:
grouped_df = df.groupby('week_day')

grouped_df['week_day'].count().sort_values(ascending=False)

week_day
Tuesday      1889
Saturday     1786
Sunday       1695
Monday       1593
Wednesday    1229
Friday       1067
Thursday      541
Name: week_day, dtype: int64

In [53]:
# Not just this we can also know the maximum revenue generated on which week day?

grouped_df['sales'].sum().sort_values(ascending=False)

week_day
Saturday     420901.4763
Tuesday      420535.9243
Sunday       377868.7779
Monday       348791.5516
Wednesday    315888.9722
Friday       234710.8402
Thursday     142839.2402
Name: sales, dtype: float64

### What is the minimum order amount and maximum order amount?

In [54]:
df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,ship_time,week_day
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,3,Wednesday
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,Wednesday
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,4,Monday
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,7,Tuesday
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,7,Tuesday


In [55]:
grouped_df = df.groupby('order_id')

In [56]:
print('Minimum Order Amount:', grouped_df['sales'].min())
print('Maximum Order Amount:', grouped_df['sales'].max())

Minimum Order Amount: order_id
CA-2015-100006    377.970
CA-2015-100090    196.704
CA-2015-100293     91.056
CA-2015-100328      3.928
CA-2015-100363      2.368
                   ...   
US-2018-168802     18.368
US-2018-169320     11.680
US-2018-169488     16.900
US-2018-169502     21.810
US-2018-169551     13.392
Name: sales, Length: 4922, dtype: float64
Maximum Order Amount: order_id
CA-2015-100006    377.970
CA-2015-100090    502.488
CA-2015-100293     91.056
CA-2015-100328      3.928
CA-2015-100363     19.008
                   ...   
US-2018-168802     18.368
US-2018-169320    159.750
US-2018-169488     39.960
US-2018-169502     91.600
US-2018-169551    683.988
Name: sales, Length: 4922, dtype: float64


**What just happened? ü§Ø**

**This is not what I expected. üò•**

**Always remember the basics - Groupby Splits, Aggregation is applied on each group and results are combined and displayed.**

In [57]:
grouped_df['sales'].sum()

order_id
CA-2015-100006     377.970
CA-2015-100090     699.192
CA-2015-100293      91.056
CA-2015-100328       3.928
CA-2015-100363      21.376
                    ...   
US-2018-168802      18.368
US-2018-169320     171.430
US-2018-169488      56.860
US-2018-169502     113.410
US-2018-169551    1344.838
Name: sales, Length: 4922, dtype: float64

In [58]:
order_df = grouped_df['sales'].sum()

order_df = order_df.reset_index()

order_df.head()

Unnamed: 0,order_id,sales
0,CA-2015-100006,377.97
1,CA-2015-100090,699.192
2,CA-2015-100293,91.056
3,CA-2015-100328,3.928
4,CA-2015-100363,21.376


In [59]:
print('Minimum Order Amount:', order_df['sales'].min())
print('Maximum Order Amount:', order_df['sales'].max())

Minimum Order Amount: 0.556
Maximum Order Amount: 23661.228


### What is the revenue generated in the year 2017?

In [60]:
df['order_year'] = df['order_date'].dt.year

In [61]:
# Method 1 - Using filtering and aggregation
df.loc[ df['order_year'] == 2017, 'sales' ].sum()

600192.55

In [62]:
# Method 2 - Using splitting and aggregation
grouped_df = df.groupby(['order_year'])

grouped_df['sales'].sum()

order_year
2015    479856.2081
2016    459436.0054
2017    600192.5500
2018    722052.0192
Name: sales, dtype: float64

In [63]:
yearwise_revenue_df = grouped_df['sales'].sum()

yearwise_revenue_df.loc[2017]

600192.55

### Which customer contributed to the maximum revenue in 2017 and how much?

In [64]:
grouped_df = df.groupby(['order_year', 'customer_id'])

In [65]:
grouped_df['sales'].sum()

order_year  customer_id
2015        AA-10315        756.048
            AA-10375         50.792
            AA-10480         27.460
            AA-10645       1434.330
            AB-10015        322.216
                             ...   
2018        XP-21865        449.312
            YC-21895        750.680
            YS-21880       5340.264
            ZC-21910        227.066
            ZD-21925         61.440
Name: sales, Length: 2481, dtype: float64

In [66]:
yearwise_cust_revenue_contribution = grouped_df['sales'].sum()

yearwise_cust_revenue_contribution

order_year  customer_id
2015        AA-10315        756.048
            AA-10375         50.792
            AA-10480         27.460
            AA-10645       1434.330
            AB-10015        322.216
                             ...   
2018        XP-21865        449.312
            YC-21895        750.680
            YS-21880       5340.264
            ZC-21910        227.066
            ZD-21925         61.440
Name: sales, Length: 2481, dtype: float64

In [67]:
yearwise_cust_revenue_contribution.loc[2017].max()

18344.052000000003

In [68]:
yearwise_cust_revenue_contribution.loc[2017].idxmax()

'TC-20980'

In [69]:
cust_id = yearwise_cust_revenue_contribution.loc[2017].idxmax()
revenue_contributed = yearwise_cust_revenue_contribution.loc[(2017, cust_id)]

print("Customer ID:", cust_id, ", contributed to the maximum revenue of", revenue_contributed, "in 2017")

Customer ID: TC-20980 , contributed to the maximum revenue of 18344.052000000003 in 2017


In [70]:
print("Total company revenue in 2017:")
print(yearwise_cust_revenue_contribution.loc[2017].sum())

Total company revenue in 2017:
600192.55


In [71]:
yearwise_cust_revenue_contribution.loc[2017].sort_values()

customer_id
SJ-20215        1.964
SH-20395        2.214
JW-15955        2.610
BM-11650        2.907
RW-19690        3.282
              ...    
BS-11365     9199.780
SE-20110     9879.220
AB-10105    10403.865
CC-12370    11901.184
TC-20980    18344.052
Name: sales, Length: 635, dtype: float64

### Who is the customer with `customer_id == TC-20980` ?

In [72]:
df.loc[(df.customer_id == 'TC-20980') , ['order_date', 'customer_name', 'city', 'state', 'postal_code']]

Unnamed: 0,order_date,customer_name,city,state,postal_code
2072,2017-11-26,Tamara Chand,Seattle,Washington,98105.0
3185,2015-11-07,Tamara Chand,Houston,Texas,77041.0
3186,2015-11-07,Tamara Chand,Houston,Texas,77041.0
6825,2017-10-02,Tamara Chand,Lafayette,Indiana,47905.0
6826,2017-10-02,Tamara Chand,Lafayette,Indiana,47905.0
6827,2017-10-02,Tamara Chand,Lafayette,Indiana,47905.0
6828,2017-10-02,Tamara Chand,Lafayette,Indiana,47905.0
6829,2017-10-02,Tamara Chand,Lafayette,Indiana,47905.0
8060,2016-09-20,Tamara Chand,Long Beach,New York,11561.0
8061,2016-09-20,Tamara Chand,Long Beach,New York,11561.0


### Which region recorded maximum sales count?

In [73]:
df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales', 'ship_time', 'week_day', 'order_year'],
      dtype='object')

In [74]:
# Method 1 - Using value_counts()
df.region.value_counts()

West       3140
East       2785
Central    2277
South      1598
Name: region, dtype: int64

In [75]:
# Method 2 - Using split and aggregation

grouped_df = df.groupby("region")

grouped_df['sales'].count()

region
Central    2277
East       2785
South      1598
West       3140
Name: sales, dtype: int64

In [76]:
# What if the question is: Which region recorded maximum sales revenue?

grouped_df['sales'].sum()

region
Central    492646.9132
East       669518.7260
South      389151.4590
West       710219.6845
Name: sales, dtype: float64

### Which product category is doing best? (revenue and count)

In [77]:
grouped_df = df.groupby('category')

grouped_df['sales'].count()

category
Furniture          2078
Office Supplies    5909
Technology         1813
Name: sales, dtype: int64

In [78]:
grouped_df['sales'].sum()

category
Furniture          728658.5757
Office Supplies    705422.3340
Technology         827455.8730
Name: sales, dtype: float64

## Analysing and Summarizing using pivot_table()

**NOTE: Use MS Excel to understand the results of pivot_table()**  

### What is the region-wise revenue?

In [83]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc="sum")

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,492646.9132
East,669518.726
South,389151.459
West,710219.6845


In [11]:
df.pivot_table(values="sales", 
               index=["region"],
               margins=True, 
               aggfunc="sum").round(2)

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,492646.91
East,669518.73
South,389151.46
West,710219.68
All,2261536.78


In [7]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc="sum").apply(lambda values: values*100/sum(values))

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,21.783723
East,29.604591
South,17.20739
West,31.404295


In [13]:
df.pivot_table(values="sales", 
               index=["region"],
               aggfunc="sum").apply(lambda values: values*100/sum(values))

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,21.783723
East,29.604591
South,17.20739
West,31.404295


### What is the region-wise count of sales?

In [86]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc="count")

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,2277
East,2785
South,1598
West,3140


In [91]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc="count").apply(lambda values: values*100/sum(values))

Unnamed: 0_level_0,sales
region,Unnamed: 1_level_1
Central,23.234694
East,28.418367
South,16.306122
West,32.040816


### What is the region-wise count and sum of sales?

In [93]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc=["count", "sum"])

Unnamed: 0_level_0,count,sum
Unnamed: 0_level_1,sales,sales
region,Unnamed: 1_level_2,Unnamed: 2_level_2
Central,2277,492646.9132
East,2785,669518.726
South,1598,389151.459
West,3140,710219.6845


In [92]:
df.pivot_table(values="sales", 
               index=["region"], 
               aggfunc=["count", "sum"]).apply(lambda values: values*100/sum(values))

Unnamed: 0_level_0,count,sum
Unnamed: 0_level_1,sales,sales
region,Unnamed: 1_level_2,Unnamed: 2_level_2
Central,23.234694,21.783723
East,28.418367,29.604591
South,16.306122,17.20739
West,32.040816,31.404295


### What is the region-wise revenue generated of each product category?

In [84]:
df.pivot_table(values="sales", 
               index=["region"], 
               columns=["category"], 
               aggfunc="sum")

category,Furniture,Office Supplies,Technology
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Central,160317.4622,163590.243,168739.208
East,206461.388,199940.811,263116.527
South,116531.48,124424.771,148195.208
West,245348.2455,217466.509,247404.93


In [15]:
df.pivot_table(values="sales", 
               index=["region"], 
               columns=["category"], 
               margins=True, 
               aggfunc="sum").round(3)

category,Furniture,Office Supplies,Technology,All
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central,160317.462,163590.243,168739.208,492646.913
East,206461.388,199940.811,263116.527,669518.726
South,116531.48,124424.771,148195.208,389151.459
West,245348.246,217466.509,247404.93,710219.684
All,728658.576,705422.334,827455.873,2261536.783


In [85]:
df.pivot_table(values="sales", 
               index=["category"], 
               columns=["region"], 
               aggfunc="sum")

region,Central,East,South,West
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Furniture,160317.4622,206461.388,116531.48,245348.2455
Office Supplies,163590.243,199940.811,124424.771,217466.509
Technology,168739.208,263116.527,148195.208,247404.93


In [87]:
df.pivot_table(values="sales", 
               index=["category"], 
               columns=["region"], 
               aggfunc="count")

region,Central,East,South,West
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Furniture,470,591,326,691
Office Supplies,1399,1667,983,1860
Technology,408,527,289,589


### What is the region-wise revenue generated of each product sub-category under product category?

In [94]:
df.pivot_table(values="sales", 
               index=["category", 'sub_category'], 
               columns=["region"], 
               aggfunc="count")

Unnamed: 0_level_0,region,Central,East,South,West
category,sub_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Furniture,Bookcases,49,70,28,79
Furniture,Chairs,151,167,86,203
Furniture,Furnishings,198,275,162,296
Furniture,Tables,72,79,50,113
Office Supplies,Appliances,122,123,81,133
Office Supplies,Art,175,225,140,245
Office Supplies,Binders,362,427,241,462
Office Supplies,Envelopes,58,70,54,66
Office Supplies,Fasteners,53,61,29,71
Office Supplies,Labels,75,105,64,113
