# Merging DataFrames

In [26]:
import pandas as pd

## Our Dataset
- Our datasets are spread across multiple files in this section. Each file has a `restaurant_` prefix.
- The `customers.csv` file stores our restaurant's customers.
- The `foods.csv` file stores our restaurant's menu items.
- The `week_1_sales` and `week_2_sales` files store our orders.

In [27]:
# These datasets store customer data
foods = pd.read_csv("restaurant_foods.csv")
foods.head()
customers = pd.read_csv("restaurant_customers.csv")
customers.head()

Unnamed: 0,ID,First Name,Last Name,Gender,Company,Occupation
0,1,Joseph,Perkins,Male,Dynazzy,Community Outreach Specialist
1,2,Jennifer,Alvarez,Female,DabZ,Senior Quality Engineer
2,3,Roger,Black,Male,Tagfeed,Account Executive
3,4,Steven,Evans,Male,Fatz,Registered Nurse
4,5,Judy,Morrison,Female,Demivee,Legal Assistant


In [28]:
# These datasets store sales(order) data
week1 = pd.read_csv("restaurant_week_1_sales.csv")
week2 = pd.read_csv("restaurant_week_2_sales.csv")

## The pd.concat Function I
- The `concat` function concatenates one **DataFrame** to the end of another.
- **The original index labels will be kept by default.** Set `ignore_index` to True to generate a new index.
- The `keys` parameter create a **MultiIndex** using the specified keys/labels.

##### Also, the concat by default works along the vertical axis or along the direction of index. <br> This can be changed using `axis` parameter 

In [29]:
len(week1)
len(week2)
pd.concat([week1,week2])
pd.concat([week1,week2],ignore_index=True)

Unnamed: 0,Customer ID,Food ID
0,537,9
1,97,4
2,658,1
3,202,2
4,155,9
...,...,...
495,783,10
496,556,10
497,547,9
498,252,9


##### We might add multi-index to keep track of which dataset a record originated from.<br> This is done using the `keys` parameter

In [31]:
pd.concat([week1,week2],keys=["Week 1","Week 2"])

Unnamed: 0,Unnamed: 1,Customer ID,Food ID
Week 1,0,537,9
Week 1,1,97,4
Week 1,2,658,1
Week 1,3,202,2
Week 1,4,155,9
...,...,...,...
Week 2,245,783,10
Week 2,246,556,10
Week 2,247,547,9
Week 2,248,252,9


##### Once the dataframes are concatenated into a single dataframe, we can access the records using regular `loc` / `iloc` accessors

In [36]:
allweeks = pd.concat([week1,week2],keys=["Week 1","Week 2"])
allweeks.loc[[("Week 1",4)]]

Unnamed: 0,Unnamed: 1,Customer ID,Food ID
Week 1,4,155,9


## The pd.concat Function II
- Pandas will concatenate the **DataFrames** along the row/index axis.
- Pandas will include all columns that exist in either **DataFrame**. If there are no matching values, pandas will use `NaN` values.
- We can pass the `axis` parameter an argument of `"columns"` to concatenate on the column axis.

In [None]:
pd.concat([week1,week2],keys=["Week 1","Week 2"])

##### Remember that `concat` by default adds one df to the end of another df. So if columns names are same then values are added in the same column. however, if column names are different, then df2 would be added to end of df1 but due to different column names ,result will not be as expected.

In [46]:
df1 = pd.DataFrame([1,2,3],columns=["A"])
df2 = pd.DataFrame([4,5,6],columns=["B"])

In [49]:
pd.concat([df1,df2])

Unnamed: 0,A,B
0,1.0,
1,2.0,
2,3.0,
0,,4.0
1,,5.0
2,,6.0


#### Adding one df to the right of another df
Note that if either of the df's had direct no. of rows then the difference would show up as NaN in the output

In [50]:
pd.concat([df1,df2],axis="columns")

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Left Joins
- The `merge` method joins two **DataFrames** together based on shared values in a column or an index.
- A left join merges one **DataFrame** into another based on values in the first one.
- The "left" **DataFrame** is the one we invoke the `merge` method on.
- If the left **DataFrame's** value is not found in the right **DataFrame**, the row will hold `NaN` values.
<img src="SQL_Joins.png" width="700" height="400"/>

##### *Keep the left df but add from right df if the common column exists*
Joining `week1` df with the `foods` df (menu) to get the names of menu items against each purchase.<br>
Using the  `FoodID` as the common column

In [59]:
week1
foods

Unnamed: 0,Food ID,Food Item,Price
0,1,Sushi,3.99
1,2,Burrito,9.99
2,3,Taco,2.99
3,4,Quesadilla,4.25
4,5,Pizza,2.49
5,6,Pasta,13.99
6,7,Steak,24.99
7,8,Salad,11.25
8,9,Donut,0.99
9,10,Drink,1.75


#### In the `merge` method , the `on` param helps specify the common column to look for as a reference when joining the two df's 

In [61]:
week1.merge(foods,how="left",on="Food ID")

Unnamed: 0,Customer ID,Food ID,Food Item,Price
0,537,9,Donut,0.99
1,97,4,Quesadilla,4.25
2,658,1,Sushi,3.99
3,202,2,Burrito,9.99
4,155,9,Donut,0.99
...,...,...,...,...
245,413,9,Donut,0.99
246,926,6,Pasta,13.99
247,134,3,Taco,2.99
248,396,6,Pasta,13.99


## The left_on and right_on Parameters
- The `left_on` and `right_on` parameters designate the column names from each **DataFrame** to use in the merge.

## Inner Joins I
- Inner joins merge two tables based on *shared*/*common* values in columns.
- If only one **DataFrame** has a value, pandas will exclude it from the final results set.
- If the same ID occurs multiple times, pandas will store each possible combination of the values.
- The design of the join ensures that the results will be the same no matter what **DataFrame** the `merge` method is invoked upon.
<img src="SQL_Joins.png" width="800" height="800"/>

## Inner Joins II
- We can pass multiple arguments to the `on` parameter of the `merge` method. Pandas will require matches in both columns across the **DataFrames**.

## Full/Outer Join
- A **full/outer** joins values that are found in either **DataFrame** or both **DataFrames**.
- Pandas does not mind if a value exists in one **DataFrame** but not the other.
- If a value does not exist in one **DataFrame**, it will have a `NaN`.

<img src="SQL_Joins.png" width="800" height="800"/>

## Merging by Indexes with the left_index and right_index Parameters
- Use the `on` parameter if the column(s) to be matched on have the same names in both **DataFrames**.
- Use the `left_on` and `right_on` parameters if the column(s) to be matched on have different names in the two **DataFrames**.
- Use the `left_index` or `right_index` parameters (set to True) to if the values to be matched on are found in the index of a **DataFrame**.

## The join Method
- The `join` method is a shortcut for concatenating two **DataFrames** when merging by index labels.