# Working with Multiple DataFrames

- In most of the real life projects you will not get data from a single resource.
- You might need to combine data that you gather from multiple sources.

## Reading and Loading Data

In [2]:
# import the pandas library
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)

2.1.1



Read 3 different files **`outlet_size_small.csv`**, **`outlet_size_medium.csv`** and **`outlet_size_high.csv`** stored in the **datasets** folder.

In [3]:
# Read the datasets
data1 = pd.read_csv('datasets/outlet_size_small.csv')
data2 = pd.read_csv('datasets/outlet_size_medium.csv')
data3 = pd.read_csv('datasets/outlet_size_high.csv')

print(data1.shape, data2.shape, data3.shape)

(2388, 9) (2793, 9) (932, 9)


#### Concatenate all the dataframes

- We will use the **`concat()`** function to concatenate all the dataframes. You just need to pass the list of dataframes to concatenate.
    - For row wise concatenation, set `axis = 0`
    - For column wise concatenation, set `axis = 1`

In [4]:
# concatenate all the dataframes by rows
data = pd.concat([data1, data2, data3], axis=0)
print(data.shape)

(6113, 9)


## Performing SQL-Like Joins in Pandas

Performing 'Join' operations on python dataframes

In [5]:
# Creating a student dataframe
student_df = pd.DataFrame({
    'roll_no': [ 102, 101, 104, 103, 105],
    'name': ['Aravind', 'Rahul', 'Prateek', 'Piyuesh', 'Kartik'],
    'city': ['Gurugram', 'Delhi', 'Delhi', 'Gurugram', 'Hyderabad']
})
student_df.head()

Unnamed: 0,roll_no,name,city
0,102,Aravind,Gurugram
1,101,Rahul,Delhi
2,104,Prateek,Delhi
3,103,Piyuesh,Gurugram
4,105,Kartik,Hyderabad


In [6]:
# Creating a city dataframe
city_df = pd.DataFrame({
    'city' :  ['Gurugram', 'Delhi', 'Hyderabad', 'Faridabad'],
    'state' : ['Haryana',  'Delhi', 'Telangana', 'Haryana']
})
city_df.head()

Unnamed: 0,city,state
0,Gurugram,Haryana
1,Delhi,Delhi
2,Hyderabad,Telangana
3,Faridabad,Haryana


- We want to add another column state to the `student_df` using the `city_df`. 
- We can do this by doing a left join. We need to use the merge function and set the parameters **`how='left'`** and **`on='city`.**

In [7]:
# Merging dataframes with the help of left join
student_df.merge(city_df, how = 'left', on = 'city')

Unnamed: 0,roll_no,name,city,state
0,102,Aravind,Gurugram,Haryana
1,101,Rahul,Delhi,Delhi
2,104,Prateek,Delhi,Delhi
3,103,Piyuesh,Gurugram,Haryana
4,105,Kartik,Hyderabad,Telangana


In [8]:
roll_no = pd.DataFrame({
    'roll_no' : [102, 103]
})

- We have another dataframe that contains roll_no of some students. We need to find out the other details of the students.
- We can do this by using right join. You just need to set **`how='right'`**

In [9]:
# Merging dataframes with the help of right join
student_df.merge(roll_no, how = 'right', on = 'roll_no')

Unnamed: 0,roll_no,name,city
0,102,Aravind,Gurugram
1,103,Piyuesh,Gurugram


In [10]:
# creating a dataframe
student_selection = pd.DataFrame({
    'roll_no' : [102, 105, 101],
    'company' : ['ABC', 'XYZ', 'ABC'],
})

- We want to combine the `student_df` and `student_selection`.
- We can do this by using outer/full join. You need to set parameter **`how = 'outer'`**.

In [11]:
# Merging dataframes with the help of full/outer join
student_df.merge(student_selection, how = 'outer')

Unnamed: 0,roll_no,name,city,company
0,102,Aravind,Gurugram,ABC
1,101,Rahul,Delhi,ABC
2,104,Prateek,Delhi,
3,103,Piyuesh,Gurugram,
4,105,Kartik,Hyderabad,XYZ


## Example Problem

We have another dataset **`outlet_data.csv`** in the dataset folder. It has column `Outlet_Identifier`, `Outlet_Establishment_Year`, `Outlet_Size` and `Outlet_Location_Type`. Now, we have `Outlet_Identifier` in both the datasets and we need to combine them and get the rest of the variables in our dataset. So we will do a `left join` to merge the data frames.

### Load and read the data

In [12]:
# Load and read the data
data = pd.read_csv('datasets/outlet_size_concatenated_data.csv')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Type,Item_Outlet_Sales
0,FDA03,18.5,Regular,0.045464,Dairy,144.1102,OUT046,Supermarket Type1,2187.153
1,FDS46,17.6,Regular,0.047257,Snack Foods,119.6782,OUT046,Supermarket Type1,2145.2076
2,FDP49,9.0,Regular,0.069089,Breakfast,56.3614,OUT046,Supermarket Type1,1547.3192
3,FDU02,13.35,Low Fat,0.102492,Dairy,230.5352,OUT035,Supermarket Type1,2748.4224
4,NCB30,14.6,Low Fat,0.025698,Household,196.5084,OUT035,Supermarket Type1,1587.2672


In [13]:
# Read the outlet data
data1 = pd.read_csv('datasets/outlet_data.csv')
data1.head()

Unnamed: 0,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type
0,OUT013,1987,High,Tier 3
1,OUT018,2009,Medium,Tier 3
2,OUT019,1985,Small,Tier 1
3,OUT027,1985,Medium,Tier 3
4,OUT035,2004,Small,Tier 2


In [51]:
# view the top rows of the data
outlet_data.head()

Unnamed: 0,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type
0,OUT013,1987,High,Tier 3
1,OUT018,2009,Medium,Tier 3
2,OUT019,1985,Small,Tier 1
3,OUT027,1985,Medium,Tier 3
4,OUT035,2004,Small,Tier 2


Use the merge function to set parameter **`how = 'left'`** for the left join.

In [14]:
# Merge the data
combined_data = data.merge(data1, how = 'left', on = 'Outlet_Identifier')
combined_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Type,Item_Outlet_Sales,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type
0,FDA03,18.5,Regular,0.045464,Dairy,144.1102,OUT046,Supermarket Type1,2187.153,1997,Small,Tier 1
1,FDS46,17.6,Regular,0.047257,Snack Foods,119.6782,OUT046,Supermarket Type1,2145.2076,1997,Small,Tier 1
2,FDP49,9.0,Regular,0.069089,Breakfast,56.3614,OUT046,Supermarket Type1,1547.3192,1997,Small,Tier 1
3,FDU02,13.35,Low Fat,0.102492,Dairy,230.5352,OUT035,Supermarket Type1,2748.4224,2004,Small,Tier 2
4,NCB30,14.6,Low Fat,0.025698,Household,196.5084,OUT035,Supermarket Type1,1587.2672,2004,Small,Tier 2


In [15]:
# Shape of the data
print('Shape:', combined_data.shape)

Shape: (6113, 12)


#### Read the item_identifier data 

In [16]:
# read the data
item_data = pd.read_csv('datasets/item_idenifier.csv')
item_data.head()

Unnamed: 0,Item_Identifier
0,DRI51
1,FDL48
2,FDL38
3,FDF17
4,FDN56


In [17]:
# Shape of the data
print('Shape:', item_data.shape)

Shape: (100, 1)


- We have 100 Item_Identifiers and we need to provide the other details.
- We just need to use the merge function and set parameter **`how='right'`**

In [18]:
# Combined data
combined_item_data = data.merge(item_data, how='right', on='Item_Identifier')
combined_item_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Type,Item_Outlet_Sales
0,DRI51,17.25,Low Fat,0.042234,Dairy,173.3764,OUT035,Supermarket Type1,2061.3168
1,DRI51,17.25,Low Fat,0.042414,Dairy,173.1764,OUT018,Supermarket Type2,4466.1864
2,DRI51,,Low Fat,0.042037,Dairy,172.6764,OUT027,Supermarket Type3,6183.9504
3,FDL48,19.35,Regular,0.082251,Baking Goods,48.7034,OUT035,Supermarket Type1,534.6374
4,FDL48,19.35,Regular,0.082266,Baking Goods,48.8034,OUT046,Supermarket Type1,340.2238
