# Lambda Functions and Pivot Tables

Until now, we have not made any changes or modifications to the data. In this section, we will:
* Use lambda functions to create new and alter existing columns
* Use pandas pivot tables as an alternative to ```df.groupby()``` to summarise data

Let's first read all the files and create a ```master_df```. 

In [2]:
# Loading libraries and files

import numpy as np
import pandas as pd

# Merging the dataframes to create a master_df
market_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/market_fact.csv")
customer_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/cust_dimen.csv")
orders_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/orders_dimen.csv")
product_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/prod_dimen.csv")
shipping_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/shipping_dimen.csv")

In [9]:
df_1 = pd.merge(market_df,customer_df, how = 'inner', on = 'Cust_id')

df_2 = pd.merge(df_1,product_df, how = 'inner', on = 'Prod_id')

df_3 = pd.merge(df_2,shipping_df, how = 'inner',on = 'Ship_id')

master_df = pd.merge(df_3, orders_df, how = 'inner', on = 'Ord_id')

master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Region,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,WEST,CORPORATE,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,WEST,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,ONTARIO,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,WEST,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW


### Lambda Functions

Say you want to create a new column indicating whether a given order was profitable or not (1/0). 

You need to apply a function which returns 1 if Profit > 0, else 0. This can be easily done using the ```apply()``` method on a column of the dataframe. 

In [12]:
# Create a function to be applied
def greet(name):
    return f"Hello {name}"

In [15]:
greet("Yaswanth")

'Hello Yaswanth'

In [16]:
# Create a lambda function to be applied(Sinlge-use and single line/in-line functions /anonymous/Nameless)

square = lambda x:x*x

print(square(5))

25


In [19]:
# Create a function(standard - reuseability) to be applied/ readability
def square(x):
    return x*x

square(5)

25

In [21]:
numbers = [1,2,3,4,5]

list(map(lambda x:x**2, numbers))

[1, 4, 9, 16, 25]

In [23]:
students = [("Raj", 90),("Rajeev", 70),("Raji", 80)]

students.sort(key = lambda x:x[1])   # lambda Key : Value x[1] --> means takes index '1' value in the list.

students

[('Rajeev', 70), ('Raji', 80), ('Raj', 90)]

The same can be done in just one line of code using lambda functions. 

In [26]:
#Create using the lambda function:

def is_positive(x):
    return x > 0

In [33]:
master_df['is_profitable'] = master_df['Profit'].apply(is_positive)

master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority,is_profitable
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,CORPORATE,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED,True
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH,True
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW,False


In [32]:
# Create a new column using a lambda function
master_df['is_profitable'] = master_df['Profit'].apply(lambda x:x > 0)

master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority,is_profitable
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,CORPORATE,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED,True
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH,True
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW,False


Now you can use the new column to compare the percentage of profitable orders across groups.

In [None]:
# Comparing percentage of profitable orders across customer segments


In [None]:
# Comparing percentage of profitable orders across product categories


In FURNITURE, 46% orders are profitable, compared to 57% in TECHNOLOGY. 

In [35]:
# You can also use apply and lambda to alter existing columns
# E.g. you want to see Profit as one decimal place
# apply the round() function 

master_df['Profit'].apply(lambda x : round(x,1))  #takes round off the profit value.

0        -30.5
1       1148.9
2        -47.6
3         23.1
4        -17.6
         ...  
8394    1899.2
8395   -4437.9
8396    -379.3
8397    -735.3
8398    -391.9
Name: Profit, Length: 8399, dtype: float64

You sometimes need to create new columns using existing columns, for instance, say you want a column ```Profit / Order_Quantity```. 

In [39]:
# Creating a column Profit / Order_Quantity

master_df['profit_per_qty'] = master_df['Profit']/master_df['Order_Quantity']

master_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority,is_profitable,profit_per_qty
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56,...,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False,-1.326522
1,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59,...,TECHNOLOGY,TELEPHONES AND COMMUNICATION,36262,EXPRESS AIR,27-07-2010,36262,27-07-2010,NOT SPECIFIED,True,44.188462
2,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED,False,-2.071304
3,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,...,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",37863,REGULAR AIR,26-02-2011,37863,24-02-2011,HIGH,True,0.856296
4,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,...,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",53026,REGULAR AIR,03-03-2012,53026,26-02-2012,LOW,False,-0.502286


In [43]:
## two lambda functions at once.
f1 = lambda x:x+2
f2 = lambda x:x*3
f2(f1((5)))

21

In [50]:
#Importance of map() with lambda

# Using map() with lambda is:

# Faster than writing a for loop
#Cleaner for small transformations
#Useful in data cleaning, transformation, and feature engineering


numbers = [1,2,3,4,5]

doubled = list(map(lambda x : x*2, numbers))

doubled

[2, 4, 6, 8, 10]

In [57]:
list(filter(lambda x:x>5, doubled))

[6, 8, 10]

In [59]:
list(map(lambda x : x**2 if x%2 ==0 else x,[1,2,3,4,5])) # if this is even it will squared...

[1, 4, 3, 16, 5]

In [60]:
outer = lambda x: (lambda y: x+y)

add = outer(5)

add(3)

8

### Pivot Tables

You may want to use pandas pivot tables as an alternative to ```groupby()```. They provide Excel-like functionalities to create aggregate tables. 

In [61]:
# Read documentation
help(pd.DataFrame.pivot_table)

Help on function pivot_table in module pandas.core.frame:

pivot_table(self, values=None, index=None, columns=None, aggfunc: 'AggFuncType' = 'mean', fill_value=None, margins: 'bool' = False, dropna: 'bool' = True, margins_name: 'Level' = 'All', observed: 'bool' = False, sort: 'bool' = True) -> 'DataFrame'
    Create a spreadsheet-style pivot table as a DataFrame.
    
    The levels in the pivot table will be stored in MultiIndex objects
    (hierarchical indexes) on the index and columns of the result DataFrame.
    
    Parameters
    ----------
    values : list-like or scalar, optional
        Column or columns to aggregate.
    index : column, Grouper, array, or list of the previous
        Keys to group by on the pivot table index. If a list is passed,
        it can contain any of the other types (except list). If an array is
        passed, it must be the same length as the data and will be used in
        the same manner as column values.
    columns : column, Grouper, array, 

# Aggregate functions take a group of values and return a single summary result — like total, average, count, etc.

| Function          | Description                     | Module                    |
| ----------------- | ------------------------------- | ------------------------- |
| `sum()`           | Adds all numbers                | built-in / NumPy / Pandas |
| `mean()`          | Average of values               | NumPy / Pandas            |
| `min()` / `max()` | Smallest / largest value        | built-in / NumPy / Pandas |
| `count()`         | Total non-null values           | Pandas only               |
| `std()` / `var()` | Std deviation / Variance        | NumPy / Pandas            |
| `agg()`           | Custom aggregate logic (Pandas) | Pandas                    |


The general syntax is ```pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', ...)```.
* ```data``` is a dataframe
* ```values``` contains the column to aggregate
* ```index``` is the row in the pivot table
* ```columns``` contains the columns you want in the pivot table
* ```aggfunc``` is the aggregate function

Let's see some examples.

In [63]:
# E.g. Compare average Sales across customer segments
master_df.pivot_table(values = "Sales",
                     index = "Customer_Segment",
                     aggfunc = "mean")

Unnamed: 0_level_0,Sales
Customer_Segment,Unnamed: 1_level_1
CONSUMER,1857.859965
CORPORATE,1787.680389
HOME OFFICE,1754.312931
SMALL BUSINESS,1698.124841


In [65]:
# E.g. compare total number of profitable orders across regions
# Note that since is_profitable is 1/0, we can directly compute the sum

master_df.pivot_table(values = 'is_profitable',
                     index = 'Region',
                     aggfunc = "sum")

Unnamed: 0_level_0,is_profitable
Region,Unnamed: 1_level_1
ATLANTIC,544
NORTHWEST TERRITORIES,194
NUNAVUT,38
ONTARIO,916
PRARIE,852
QUEBEC,360
WEST,969
YUKON,262


In [72]:
# Grouping by both rows and columns
# Compare the total profit across product categories and customer segments
# Since there are two categorical variables, we use both rows (index) and columns
master_df.pivot_table(values = "Profit",
                     index ="Product_Category",
                     columns = "Customer_Segment",
                     aggfunc ="sum", margins = True, margins_name = "Total")



Customer_Segment,CONSUMER,CORPORATE,HOME OFFICE,SMALL BUSINESS,Total
Product_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FURNITURE,42728.26,22008.08,23979.2,28717.49,117433.03
OFFICE SUPPLIES,88532.29,203037.38,121145.65,105306.11,518021.43
TECHNOLOGY,156699.39,374700.54,173229.18,181684.41,886313.52
Total,287959.94,599746.0,318354.03,315708.01,1521767.98


In [67]:
master_df.columns

Index(['Ord_id', 'Prod_id', 'Ship_id', 'Cust_id', 'Sales', 'Discount',
       'Order_Quantity', 'Profit', 'Shipping_Cost', 'Product_Base_Margin',
       'Customer_Name', 'Province', 'Region', 'Customer_Segment',
       'Product_Category', 'Product_Sub_Category', 'Order_ID_x', 'Ship_Mode',
       'Ship_Date', 'Order_ID_y', 'Order_Date', 'Order_Priority',
       'is_profitable', 'profit_per_qty'],
      dtype='object')

You don't necessarily need to specify all four arguments, since ```pivot_table()``` has some smart defaults. For instance, if you just provide ```columns```, it will compute the **mean of all the numeric columns** across each column. For e.g.:

In [73]:
# Computes the mean of all numeric columns across categories
# Notice that the means of Order_IDs are meaningless
master_df[['Sales', 'Discount',
       'Order_Quantity', 'Profit', 'Shipping_Cost', 'Product_Base_Margin',
       'is_profitable', 'profit_per_qty','Product_Category']].pivot_table(columns = 'Product_Category')
# df[numerical_variables].pivot_table[categorical variable]
# instead of groupby we used pivot_table.

Product_Category,FURNITURE,OFFICE SUPPLIES,TECHNOLOGY
Discount,0.049287,0.05023,0.048746
Order_Quantity,25.709977,25.656833,25.266344
Product_Base_Margin,0.598555,0.46127,0.556305
Profit,68.116607,112.369074,429.207516
Sales,3003.82282,814.048178,2897.941008
Shipping_Cost,30.883811,7.829829,8.954886
is_profitable,0.465197,0.466161,0.573366
profit_per_qty,-3.606958,1.736015,-52.274315
