# Merging and Concatenating Dataframes


In this section, you will merge and concatenate multiple dataframes. Merging is one of the most common operations you will do, since data often comes in various files. 

In our case, we have sales data of a retail store spread across multiple files. We will now work with all these data files and learn to:
* Merge multiple dataframes using common columns/keys using ```pd.merge()```
* Concatenate dataframes using ```pd.concat()```

Let's first read all the data files.

In [13]:
# loading libraries and reading the data
import numpy as np
import pandas as pd

market_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/market_fact.csv")
cust_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/cust_dimen.csv")
orders_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/orders_dimen.csv")
prod_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/prod_dimen.csv")
shipping_df = pd.read_csv("/Volumes/Personal/Python_pandas/Session4_Session_Materials/global_sales_data/shipping_dimen.csv")

In [14]:
# Customer dimension table: Each row contains metadata about customers
cust_df.head()

Unnamed: 0,Customer_Name,Province,Region,Customer_Segment,Cust_id
0,MUHAMMED MACINTYRE,NUNAVUT,NUNAVUT,SMALL BUSINESS,Cust_1
1,BARRY FRENCH,NUNAVUT,NUNAVUT,CONSUMER,Cust_2
2,CLAY ROZENDAL,NUNAVUT,NUNAVUT,CORPORATE,Cust_3
3,CARLOS SOLTERO,NUNAVUT,NUNAVUT,CONSUMER,Cust_4
4,CARL JACKSON,NUNAVUT,NUNAVUT,CORPORATE,Cust_5


In [15]:
# Orders dimension table
orders_df.head()

Unnamed: 0,Order_ID,Order_Date,Order_Priority,Ord_id
0,3,13-10-2010,LOW,Ord_1
1,293,01-10-2012,HIGH,Ord_2
2,483,10-07-2011,HIGH,Ord_3
3,515,28-08-2010,NOT SPECIFIED,Ord_4
4,613,17-06-2011,HIGH,Ord_5


In [16]:
# Product dimension table
prod_df.head()

Unnamed: 0,Product_Category,Product_Sub_Category,Prod_id
0,OFFICE SUPPLIES,STORAGE & ORGANIZATION,Prod_1
1,OFFICE SUPPLIES,APPLIANCES,Prod_2
2,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,Prod_3
3,TECHNOLOGY,TELEPHONES AND COMMUNICATION,Prod_4
4,FURNITURE,OFFICE FURNISHINGS,Prod_5


In [17]:
# Shipping metadata
shipping_df.head()

Unnamed: 0,Order_ID,Ship_Mode,Ship_Date,Ship_id
0,3,REGULAR AIR,20-10-2010,SHP_1
1,293,DELIVERY TRUCK,02-10-2012,SHP_2
2,293,REGULAR AIR,03-10-2012,SHP_3
3,483,REGULAR AIR,12-07-2011,SHP_4
4,515,REGULAR AIR,30-08-2010,SHP_5


In [18]:
# Already familiar with market data: Each row is an order
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


In [19]:

#pd.concat([a, b], axis = 0)


In [20]:

# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries



### Merging Dataframes Using ```pd.merge()```

There are five data files:
1. The ```market_fact``` table contains the sales data of each order
2. The other 4 files are called 'dimension tables/files' and contain metadata about customers, products, shipping details, order details etc.

If you are familiar with star schemas and data warehouse designs, you will note that we have one fact table and four dimension tables. 


### Merging Dataframes

Say you want to select all orders and observe the ```Sales``` of the customer segment *Corporate*. Since customer segment details are present in the dataframe ```customer_df```, we will first need to merge it with ```market_df```.


- inner join - Matches records only (from both the dataframe).
- left join - All records from left table & match records from the right
- right join - All records from right table & match records from the left
- outer join - All records from both tables & filling missing values with NaN(Not a Number).

# Inner join

In [33]:
# Merging the dataframes
# Note that Cust_id is the common column/key, which is provided to the 'on' argument
# how = 'inner' makes sure that only the customer ids present in both dfs are included in the result

df_1 = pd.merge(market_df,cust_df, how = 'inner', on = 'Cust_id')

df_1

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.8100,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.2700,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.6900,0.00,26,1148.90,2.50,0.59,AARON BERGMAN,ALBERTA,WEST,CORPORATE
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.8900,0.09,43,729.34,14.30,0.37,AARON BERGMAN,ALBERTA,WEST,CORPORATE
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.1500,0.08,35,1219.87,26.30,0.38,AARON BERGMAN,ALBERTA,WEST,CORPORATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8394,Ord_5353,Prod_4,SHP_7479,Cust_1798,2841.4395,0.08,28,374.63,7.69,0.59,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8395,Ord_5411,Prod_6,SHP_7555,Cust_1798,127.1600,0.10,20,-74.03,6.92,0.37,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8396,Ord_5388,Prod_6,SHP_7524,Cust_1798,243.0500,0.02,39,-70.85,5.35,0.40,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8397,Ord_5348,Prod_15,SHP_7469,Cust_1798,3872.8700,0.03,23,565.34,30.00,0.62,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER


In [37]:
# Now, you can subset the orders made by customers from 'CORPORATE' segment

df_1.loc[df_1['Customer_Segment'] == 'CORPORATE', :]

df_1

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.8100,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.2700,0.01,13,4.56,0.93,0.54,AARON BERGMAN,ALBERTA,WEST,CORPORATE
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.6900,0.00,26,1148.90,2.50,0.59,AARON BERGMAN,ALBERTA,WEST,CORPORATE
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.8900,0.09,43,729.34,14.30,0.37,AARON BERGMAN,ALBERTA,WEST,CORPORATE
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.1500,0.08,35,1219.87,26.30,0.38,AARON BERGMAN,ALBERTA,WEST,CORPORATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8394,Ord_5353,Prod_4,SHP_7479,Cust_1798,2841.4395,0.08,28,374.63,7.69,0.59,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8395,Ord_5411,Prod_6,SHP_7555,Cust_1798,127.1600,0.10,20,-74.03,6.92,0.37,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8396,Ord_5388,Prod_6,SHP_7524,Cust_1798,243.0500,0.02,39,-70.85,5.35,0.40,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER
8397,Ord_5348,Prod_15,SHP_7469,Cust_1798,3872.8700,0.03,23,565.34,30.00,0.62,YOSEPH CARROLL,ALBERTA,WEST,CONSUMER


In [41]:
# Example 2: Select all orders from product category = office supplies and from the corporate segment
# We now need to merge the product_df

df_2 = pd.merge(df_1, prod_df, how = 'inner', on = 'Prod_id')

df_2

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
1,Ord_2978,Prod_16,SHP_4112,Cust_1088,305.05,0.04,27,23.12,3.37,0.57,AARON HAWKINS,ONTARIO,ONTARIO,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
2,Ord_5484,Prod_16,SHP_7663,Cust_1820,322.82,0.05,35,-17.58,3.98,0.56,ADRIAN SHAMI,ALBERTA,WEST,CONSUMER,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
3,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
4,Ord_4143,Prod_16,SHP_5771,Cust_1417,207.21,0.06,24,-78.64,6.14,0.59,ALLEN ARMOLD,NEW BRUNSWICK,ATLANTIC,HOME OFFICE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8394,Ord_5467,Prod_14,SHP_7639,Cust_1803,5964.19,0.10,33,988.20,24.49,0.46,TONJA TURNELL,ALBERTA,WEST,CONSUMER,TECHNOLOGY,COPIERS AND FAX
8395,Ord_825,Prod_14,SHP_1132,Cust_247,27663.92,0.05,8,-391.92,24.49,0.37,TONY CHAPMAN,BRITISH COLUMBIA,WEST,CONSUMER,TECHNOLOGY,COPIERS AND FAX
8396,Ord_5368,Prod_14,SHP_7497,Cust_1795,17279.62,0.04,40,4176.25,24.49,0.52,TONY SAYRE,ALBERTA,WEST,SMALL BUSINESS,TECHNOLOGY,COPIERS AND FAX
8397,Ord_1765,Prod_14,SHP_2446,Cust_595,14647.26,0.07,25,5485.15,24.49,0.37,VICTORIA WILSON,ONTARIO,ONTARIO,HOME OFFICE,TECHNOLOGY,COPIERS AND FAX


In [43]:
# Select all orders from product category = office supplies and from the corporate segment

df_2 = df_2[(df_2['Product_Category'] == "OFFICE SUPPLIES") & (df_2['Customer_Segment'] == "CORPORATE")]

df_2

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
3,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
7,Ord_4506,Prod_16,SHP_6273,Cust_1544,92.02,0.07,9,-24.88,4.68,0.59,AMY COX,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
9,Ord_1551,Prod_16,SHP_2145,Cust_531,184.77,0.00,29,-71.96,5.30,0.55,ANDY YOTOV,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
11,Ord_1429,Prod_16,SHP_1976,Cust_510,539.06,0.05,42,-123.07,4.59,0.82,ANNA HABERLIN,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7545,Ord_4629,Prod_1,SHP_6447,Cust_1587,848.19,0.06,25,120.02,5.49,0.60,VICTORIA PISTEKA,BRITISH COLUMBIA,WEST,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7546,Ord_4604,Prod_1,SHP_6403,Cust_1522,234.24,0.09,24,-151.80,9.45,0.60,VICTORIA PISTEKA,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7551,Ord_3543,Prod_1,SHP_4905,Cust_1266,1184.11,0.07,6,-145.07,19.99,0.71,WILLIAM BROWN,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION
7552,Ord_2722,Prod_1,SHP_3731,Cust_1006,3508.33,0.04,21,-546.98,35.00,0.85,XYLONA PRICE,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION



Similary, you can merge the other dimension tables - ```shipping_df``` and ```orders_df``` to create a ```master_df``` and perform indexing using any column in the master dataframe.


In [46]:
# Merging shipping_df

df_3 = pd.merge(df_2, shipping_df, how = 'inner', on = 'Ship_id')

df_3

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,Customer_Name,Province,Region,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID,Ship_Mode,Ship_Date
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,AARON BERGMAN,ALBERTA,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010
1,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,ALEKSANDRA GANNAWAY,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36992,EXPRESS AIR,09-12-2009
2,Ord_4506,Prod_16,SHP_6273,Cust_1544,92.02,0.07,9,-24.88,4.68,0.59,AMY COX,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",19524,REGULAR AIR,15-01-2010
3,Ord_1551,Prod_16,SHP_2145,Cust_531,184.77,0.00,29,-71.96,5.30,0.55,ANDY YOTOV,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",39330,EXPRESS AIR,18-07-2010
4,Ord_1429,Prod_16,SHP_1976,Cust_510,539.06,0.05,42,-123.07,4.59,0.82,ANNA HABERLIN,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",20259,REGULAR AIR,25-05-2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,Ord_4629,Prod_1,SHP_6447,Cust_1587,848.19,0.06,25,120.02,5.49,0.60,VICTORIA PISTEKA,BRITISH COLUMBIA,WEST,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,34980,EXPRESS AIR,04-08-2010
1676,Ord_4604,Prod_1,SHP_6403,Cust_1522,234.24,0.09,24,-151.80,9.45,0.60,VICTORIA PISTEKA,YUKON,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,30885,REGULAR AIR,22-06-2010
1677,Ord_3543,Prod_1,SHP_4905,Cust_1266,1184.11,0.07,6,-145.07,19.99,0.71,WILLIAM BROWN,SASKACHEWAN,PRARIE,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,8803,REGULAR AIR,21-12-2012
1678,Ord_2722,Prod_1,SHP_3731,Cust_1006,3508.33,0.04,21,-546.98,35.00,0.85,XYLONA PRICE,ONTARIO,ONTARIO,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,36896,REGULAR AIR,03-11-2009


In [49]:
# Merging the orders table to create a master df
master_df = pd.merge(df_3,orders_df, how = 'inner', on = 'Ord_id')

master_df

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin,...,Region,Customer_Segment,Product_Category,Product_Sub_Category,Order_ID_x,Ship_Mode,Ship_Date,Order_ID_y,Order_Date,Order_Priority
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.60,0.56,...,WEST,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36262,REGULAR AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
1,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37,...,WEST,CORPORATE,OFFICE SUPPLIES,PAPER,36262,EXPRESS AIR,28-07-2010,36262,27-07-2010,NOT SPECIFIED
2,Ord_3730,Prod_16,SHP_5175,Cust_1314,459.08,0.04,34,61.57,3.14,0.60,...,PRARIE,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",36992,EXPRESS AIR,09-12-2009,36992,07-12-2009,MEDIUM
3,Ord_4506,Prod_16,SHP_6273,Cust_1544,92.02,0.07,9,-24.88,4.68,0.59,...,YUKON,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",19524,REGULAR AIR,15-01-2010,19524,14-01-2010,CRITICAL
4,Ord_1551,Prod_16,SHP_2145,Cust_531,184.77,0.00,29,-71.96,5.30,0.55,...,ONTARIO,CORPORATE,OFFICE SUPPLIES,"SCISSORS, RULERS AND TRIMMERS",39330,EXPRESS AIR,18-07-2010,39330,17-07-2010,CRITICAL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,Ord_4549,Prod_1,SHP_6332,Cust_1530,297.85,0.04,5,-69.89,19.99,0.59,...,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,23366,REGULAR AIR,19-11-2010,23366,19-11-2010,NOT SPECIFIED
1676,Ord_4758,Prod_1,SHP_6637,Cust_1638,448.10,0.10,35,-15.07,4.51,0.59,...,WEST,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,50950,EXPRESS AIR,30-12-2012,50950,30-12-2012,NOT SPECIFIED
1677,Ord_3339,Prod_1,SHP_4632,Cust_1180,257.41,0.09,43,-131.08,4.69,0.68,...,PRARIE,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,25601,REGULAR AIR,06-01-2009,25601,05-01-2009,MEDIUM
1678,Ord_4604,Prod_1,SHP_6403,Cust_1522,234.24,0.09,24,-151.80,9.45,0.60,...,YUKON,CORPORATE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,30885,REGULAR AIR,22-06-2010,30885,21-06-2010,HIGH


In [50]:
master_df.shape

(1680, 22)

### Chain of merge

In [52]:
# Syntax : df_1.merge(df_2,on 'ID').merge(df_3, on = 'ID').merge(df_4, on 'ID')

Similary, you can perform left, right and outer merges (joins) by using the argument ```how = 'left' / 'right' / 'outer'```.

### Concatenating Dataframes

Concatenation is much more straightforward than merging. It is used when you have dataframes having the same columns and want to append them (pile(heap or stack) one on top of the other), or having the same rows and want to append them side-by-side.

#### Concatenating Dataframes Having the Same columns

Say you have two dataframes having the same columns, like so:

In [54]:
# dataframes having the same columns

df1 = pd.DataFrame({"Name" : ['Raju','Somu','Mouni'],
                    'Age' : [21,23,25],
                    'Gender' : ['M','M','F']}
                  )

df2 = pd.DataFrame({"Name" : ['Yash','Bunny','Manu'],
                    'Age' : [26,22,28],
                    'Gender' : ['M','M','F']}
                  )



In [56]:
df1

Unnamed: 0,Name,Age,Gender
0,Raju,21,M
1,Somu,23,M
2,Mouni,25,F


In [57]:
df2

Unnamed: 0,Name,Age,Gender
0,Yash,26,M
1,Bunny,22,M
2,Manu,28,F


In [58]:
# To concatenate them, one on top of the other, you can use pd.concat
# The first argument is a sequence (list) of dataframes
# axis = 0 indicates that we want to concat along the row axis

pd.concat([df1,df2], axis =0)   # Row Concatenation (axis = 0)

Unnamed: 0,Name,Age,Gender
0,Raju,21,M
1,Somu,23,M
2,Mouni,25,F
0,Yash,26,M
1,Bunny,22,M
2,Manu,28,F


In [59]:
pd.concat([df1,df2], axis = 1)  # Column Concatenation (axis = 1)

Unnamed: 0,Name,Age,Gender,Name.1,Age.1,Gender.1
0,Raju,21,M,Yash,26,M
1,Somu,23,M,Bunny,22,M
2,Mouni,25,F,Manu,28,F


In [64]:
# A useful and intuitive alternative to concat along the rows is the append() function
# It concatenates along the rows


# df1.append(df2) (Old_Version) # In new_version Removed.

#### Concatenating Dataframes Having the Same Rows

You may also have dataframes having the same rows but different columns (and having no common columns). In this case, you may want to concat them side-by-side. For e.g.:

In [65]:
df1

Unnamed: 0,Name,Age,Gender
0,Raju,21,M
1,Somu,23,M
2,Mouni,25,F


In [66]:
df2

Unnamed: 0,Name,Age,Gender
0,Yash,26,M
1,Bunny,22,M
2,Manu,28,F


In [67]:
pd.concat([df1,df2], axis = 1)

Unnamed: 0,Name,Age,Gender,Name.1,Age.1,Gender.1
0,Raju,21,M,Yash,26,M
1,Somu,23,M,Bunny,22,M
2,Mouni,25,F,Manu,28,F


In [None]:
# To join the two dataframes, use axis = 1 to indicate joining along the columns axis
# The join is possible because the corresponding rows have the same indices


Note that you can also use the ```pd.concat()``` method to merge dataframes using common keys, though here we will not discuss that. For simplicity, we have used the ```pd.merge()``` method for database-style merging and ```pd.concat()``` for appending dataframes having no common columns.

#### Performing Arithmetic Operations on two or more dataframes

We can also perform simple arithmetic operations on two or more dataframes. Below are the stats for IPL 2018 and 2017.

In [87]:
# Teamwise stats for IPL 2018

df7 = pd.DataFrame({'Name' : ['Raju','Somu','Mouni'],
                    'Age' : [21,23,25],
                    'Gender' : [2,2,1]}
                  )

df7.set_index('Name', inplace = True)
# Set the 'IPL Team' column as the index to perform arithmetic operations on the other rows using the team as reference

df7

Unnamed: 0_level_0,Age,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Raju,21,2
Somu,23,2
Mouni,25,1


In [88]:
# Similarly, we have the stats for IPL 2017

df8 = pd.DataFrame({"Name" : ['Yash','Bunny','Manu'],
                    'Age' : [26,22,28],
                    'Gender' : [2,2,1]}
                  )

df8.set_index('Name', inplace = True)

df8

Unnamed: 0_level_0,Age,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Yash,26,2
Bunny,22,2
Manu,28,1


In [89]:
# Simply add the two DFs using the add opearator

df7 + df8   # we dont have a luxury by simple using '+' Operator, lets see how

Unnamed: 0_level_0,Age,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bunny,,
Manu,,
Mouni,,
Raju,,
Somu,,
Yash,,


Notice that there are a lot of NaN values. This is because some teams which played in IPL 2017 were not present in IPL 2018. In addition, there were also new teams present in IPL 2018. We can handle these NaN values by using `df.add()` instead of the simple add operator. Let's see how.

In [91]:
# The fill_value argument inside the df.add() function replaces all the NaN values in the two dataframes w.r.t. each other with zero.

Total = df7.add(df8, fill_value=0)

Total

Unnamed: 0_level_0,Age,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bunny,22.0,2.0
Manu,28.0,1.0
Mouni,25.0,1.0
Raju,21.0,2.0
Somu,23.0,2.0
Yash,26.0,2.0


Also notice how the resultant dataframe is sorted by the index, i.e. 'IPL Team' alphabetically.

In [94]:
# Creating a new column - 'Win Percentage'

Total['Win Percentage'] = Total['Gender']/Total['Age']

Total['Win Percentage']

Name
Bunny    0.090909
Manu     0.035714
Mouni    0.040000
Raju     0.095238
Somu     0.086957
Yash     0.076923
Name: Win Percentage, dtype: float64

In [95]:
Total

Unnamed: 0_level_0,Age,Gender,Win Percentage
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bunny,22.0,2.0,0.090909
Manu,28.0,1.0,0.035714
Mouni,25.0,1.0,0.04
Raju,21.0,2.0,0.095238
Somu,23.0,2.0,0.086957
Yash,26.0,2.0,0.076923


In [96]:
# Sorting to determine the teams with most number of wins. If the number of wins of two teams are the same, sort by the win percentage.

Total.sort_values(by = (['Gender', 'Win Percentage']), ascending = False)

Unnamed: 0_level_0,Age,Gender,Win Percentage
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Raju,21.0,2.0,0.095238
Bunny,22.0,2.0,0.090909
Somu,23.0,2.0,0.086957
Yash,26.0,2.0,0.076923
Mouni,25.0,1.0,0.04
Manu,28.0,1.0,0.035714


Apart from add(), there are also other operator-equivalent mathematical functions that you can use on Dataframes. Below is a list of all the functions that you can use to perform operations on two or more dataframes
-  `add()`: +
-  `sub()`: -
-  `mul()`: *
-  `div()`: /
-  `floordiv()`: //
-  `mod()`: %
-  `pow()`: **