# Lesson 3E: Data Joining

## Learning Objectives:
1. Understand and apply different types of joins (inner, left, right, outer) to merge customer demographics with transaction data.

2. Understand how to use groupby() in Pandas to aggregate and analyze customer transactions based on attributes like gender and price.



### Now le'ts examine the Restaurant Transaction Data and Customer Demographics Data

In [2]:
import pandas as pd

In [3]:
restaurant = pd.read_csv("restaurant_transaction_processed_02.csv")
demographics = pd.read_csv("customer_demographics.csv")

In [4]:
restaurant.head()

Unnamed: 0.1,Unnamed: 0,Customer_ID,Food_Item,Category,Date_of_Visit,Time,Weather,Price,Weekend,Public_Holiday,Year,Month,Day,Day_of_Week,Discounted_Price,Category_Encoded,Default_Date
0,0,1075,Smoothie,Cold,2023-03-24,12:30,sunny,14.88,No,No,2023,3,24,Fri,13.39,0,03-24-2023
1,1,1030,Soup,Hot,2023-03-19,14:30,raining,6.72,Yes,No,2023,3,19,Sun,6.05,1,03-19-2023
2,2,1055,Ice Cream,Cold,2023-03-24,10:30,sunny,14.27,No,No,2023,3,24,Fri,12.84,0,03-24-2023
3,3,1058,Ice Cream,Cold,2023-03-05,22:00,sunny,14.69,Yes,No,2023,3,5,Sun,13.22,0,03-05-2023
4,4,1084,Smoothie,Cold,2023-03-29,16:30,sunny,8.84,No,Yes,2023,3,29,Wed,7.96,0,03-29-2023


In [5]:
restaurant.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        989 non-null    int64  
 1   Customer_ID       989 non-null    int64  
 2   Food_Item         989 non-null    object 
 3   Category          989 non-null    object 
 4   Date_of_Visit     989 non-null    object 
 5   Time              989 non-null    object 
 6   Weather           989 non-null    object 
 7   Price             989 non-null    float64
 8   Weekend           989 non-null    object 
 9   Public_Holiday    989 non-null    object 
 10  Year              989 non-null    int64  
 11  Month             989 non-null    int64  
 12  Day               989 non-null    int64  
 13  Day_of_Week       989 non-null    object 
 14  Discounted_Price  989 non-null    float64
 15  Category_Encoded  989 non-null    int64  
 16  Default_Date      989 non-null    object 
dt

In [6]:
demographics.head()

Unnamed: 0,Customer_ID,Gender,Age,Occupation,Marital_Status
0,1000,Male,36.0,Blue Collar,Single
1,1001,Female,33.0,Blue Collar,Single
2,1002,Female,,Student,Single
3,1003,Male,62.0,Executive,Married
4,1004,Female,52.0,Manager,Married


In [7]:
demographics.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Customer_ID     99 non-null     int64  
 1   Gender          99 non-null     object 
 2   Age             98 non-null     float64
 3   Occupation      99 non-null     object 
 4   Marital_Status  99 non-null     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 4.0+ KB


In [8]:
print(f"Data shape for restaurant:{restaurant.shape} \nData shape for customer demographics:{demographics.shape}")

Data shape for restaurant:(989, 17) 
Data shape for customer demographics:(99, 5)


### ✅ 1. Merge Function

In [9]:
merged_df = demographics.merge(restaurant, on="Customer_ID", how="left")
# more in notes
merged_df.isnull().sum()

Customer_ID          0
Gender               0
Age                 13
Occupation           0
Marital_Status       0
Unnamed: 0           0
Food_Item            0
Category             0
Date_of_Visit        0
Time                 0
Weather              0
Price                0
Weekend              0
Public_Holiday       0
Year                 0
Month                0
Day                  0
Day_of_Week          0
Discounted_Price     0
Category_Encoded     0
Default_Date         0
dtype: int64

 ### ✅ 2. GroupBy Function


### Average spending by gender

### Now the "Gender" column and "Price" column are together in the merged_dataset

In [33]:
avg_spending_by_gender = merged_df.groupby("Gender")["Price"].mean().reset_index()
avg_spending_by_gender

Unnamed: 0,Gender,Price
0,Female,12.806818
1,Male,12.650735


# ✅ 3. GroupBy Function - For Categorical Variable

In [35]:
popular_food_by_occupation = merged_df.groupby(["Occupation","Food_Item"]).size().reset_index(name="Count")
popular_food_by_occupation

Unnamed: 0,Occupation,Food_Item,Count
0,Blue Collar,Coffee,24
1,Blue Collar,Ice Cream,78
2,Blue Collar,Smoothie,80
3,Blue Collar,Soup,26
4,Blue Collar,Tea,23
5,Executive,Coffee,20
6,Executive,Ice Cream,45
7,Executive,Smoothie,56
8,Executive,Soup,21
9,Executive,Tea,15


# ✅ 4: Export the Cleaned Dataset After applying the transformations, we save the cleaned dataset for further analysis in lesson 4B.

In [36]:
merged_df.to_csv("merged_data.csv", index=False)