# **4.6 IC Combining and Exporting Data – Part 1**

# **Merging Orders and Orders-Products Data**
This notebook focuses on combining the cleaned orders data with the orders_products_prior dataset to create a unified dataset for analysis. The workflow involves importing both datasets into a single working environment and merging them on the shared order_id column. After the merge, indicators are used to verify the accuracy and completeness of the results. Finally, the combined dataset is exported in Pickle format to ensure efficient storage and facilitate later analysis.

## Table of Contents  
- [1. Import Libraries](#1-import-libraries)  
- [2. Set File Pathe](#2-set-file-path)
- [3. Import Orders-Products Prior to Data](#3-import-orders-products-prior-to-data)  
- [4. Import Cleaned Orders Data](#4-import-cleaned-orders-data)  
- [5. Check Data Dimensions](#5-check-data-dimensions)  
- [6. Merge Orders with Orders-Products Prior (Inner Join)](#6-merge-orders-with-orders-products-prior-inner-join)
- [7. Merge Orders with Orders-Products Prior (Outer Join)](#7-merge-orders-with-orders-products-prior-outer-join)  
- [8. Export Combined Data](#8-export-combined-data)

---

## 1. Import Libraries
Here, I import the standard Python libraries required for this task:
- `pandas` and `numpy` for data handling
- `os` for file path management

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import os

---

## 2. Set File Path
Here, I set up the base project path so that I can easily access all required data files in a consistent way.

In [3]:
# Importing df_ords_prior
path = r'/Users/yaseminmustafa/Desktop/CareerFoundry/Exercise 4/15-05-2025_Instacart Basket Analysis'

In [4]:
path

'/Users/yaseminmustafa/Desktop/CareerFoundry/Exercise 4/15-05-2025_Instacart Basket Analysis'

---

## 3. Import Orders-Products Prior Data
Here, I import the orders_products_prior dataset, which contains details of products linked to past customer orders.

In [5]:
df_ords_prior = pd.read_csv(os.path.join(path,"02_Data/Original Data/orders_products_prior.csv"), index_col = False)

In [6]:
# Check the output
df_ords_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


---

## 4. Import Cleaned Orders Data
Here, I import the cleaned orders dataset (orders_checked.csv) that I prepared earlier.
I also drop unnecessary index columns created during export.

In [8]:
# Importing df_ords - df_orders_checked
df_ords = pd.read_csv(os.path.join(path,"02_Data/Prepared Data/orders_checked.csv"), index_col = False)

In [9]:
# Drop unnecessary columns
df_ords = df_ords.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])

In [10]:
df_ords.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2398795,1,prior,2,3,7,15.0
1,473747,1,prior,3,3,12,21.0
2,2254736,1,prior,4,4,7,29.0
3,431534,1,prior,5,4,15,28.0
4,3367565,1,prior,6,2,7,19.0


---

## 5. Check Data Dimensions
Before merging, I check the number of rows and columns in each dataset to understand their size.

In [11]:
# Checking dimensions of df_ords_prior
df_ords_prior.shape

(32434489, 4)

In [12]:
# Checking dimensions of df_ords
df_ords.shape

(3214874, 7)

---

## 6. Merge Orders with Orders-Products Prior (Inner Join)
Here, I merge the orders and orders_products_prior datasets on the order_id column using an inner join.
I also include a merge indicator (_merge) to verify matches.

In [13]:
# Merging dataframes
df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True)

In [14]:
# Check output
df_merged_large.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2398795,1,prior,2,3,7,15.0,196,1,1,both
1,2398795,1,prior,2,3,7,15.0,10258,2,0,both
2,2398795,1,prior,2,3,7,15.0,12427,3,1,both
3,2398795,1,prior,2,3,7,15.0,13176,4,0,both
4,2398795,1,prior,2,3,7,15.0,26088,5,1,both


In [15]:
# Checking the shape of df_merged_large
df_merged_large.shape

(30356421, 11)

In [16]:
# Frequency check
df_merged_large['_merge'].value_counts()

_merge
both          30356421
left_only            0
right_only           0
Name: count, dtype: int64

---

## 7. Merge Orders with Orders-Products Prior (Outer Join)
Next, I test an outer join to check if there are orders or products in either dataset that don’t match.

In [17]:
df_merged_large = df_ords.merge(df_ords_prior, on = 'order_id', indicator = True, how = 'outer')

In [18]:
# Check output
df_merged_large.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,1,112108.0,train,4.0,4.0,10.0,9.0,,,,left_only
1,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,both
2,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,both
3,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,both
4,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,both


In [19]:
# Checking the shape of df_merged_large
df_merged_large.shape

(32640698, 11)

In [20]:
# Frequency check
df_merged_large['_merge'].value_counts()

_merge
both          30356421
right_only     2078068
left_only       206209
Name: count, dtype: int64

---

## 8. Export Combined Data
Finally, I export the merged dataframe as a Pickle file for efficiency and easy re-import later.

In [21]:
# Export data to pkl
df_merged_large.to_pickle(os.path.join(path, '02_Data','Prepared Data', 'orders_products_combined.pkl'))