# Instacart Data Analysis: Merging and Exporting Data

## Table of Contents
1. [Introduction](#1-Introduction)
2. [Import Libraries](#2-Import-Libraries)
3. [Load Data](#3-Load-Data)
4. [Merge Data](#4-Merge-Data)
   - 4.1 [Merge Orders and Orders_Products_Prior](#41-Merge-Orders-and-Orders_Products_Prior)
   - 4.2 [Merge with Products Data](#42-Merge-with-Products-Data)
5. [Export Data](#5-Export-Data)
6. [Validate Export](#6-Validate-Export)
7. [Conclusion](#7-Conclusion)

## 1. Introduction

In this notebook, I focused on **combining different Instacart datasets** into one **enriched dataset** for analysis. 
Here’s what I’ll be doing step by step:

- Merging the `orders` and `orders_products_prior` dataframes to add product-level details to each order.
- Combining that result with the cleaned `products` dataframe to include more product information.
- Exporting the final merged dataset in a format that works best, like Pickle.
- Double-checking everything to make sure it all lines up perfectly.

The idea is to have all the **important data in one place** so I can dig into customer shopping habits and trends. I’ll also make sure the notebook is clean and **organized**, with clear comments and headings, so it’s easy to follow.

**Goals:**
1. Use the `merge()` function to combine data.
2. Check if the merges worked using merge flags.
3. Save the final datasets in the right formats.
4. Keep the notebook neat, clear, and easy to understand.

## 2. Import Libraries

I imported **pandas** for handling dataframes, **numpy** for any calculations I might need, and **os** to make file paths easier to work with. I also included **matplotlib** and **seaborn** for visualizations, which I can use later if needed.

In [4]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [15]:
# Optional: For visualizations or additional exports
import matplotlib.pyplot as plt
import seaborn as sns

##### Explanation:
pandas (pd): For working with dataframes.
\
numpy (np): For numerical operations.
\
os: For handling file paths.

## 3. Load Data

I loaded the **cleaned orders, products,** and `orders_products_prior` datasets to make sure I’m working with the **most up-to-date** and **accurate data**. I also used the `head()` **function** to double-check that each dataset was loaded correctly and everything looks as expected.

In [31]:
# Load the orders_products_prior.csv file normally
df_orders_prior = pd.read_csv(os.path.join(path, 'orders_products_prior.csv'))

# Check the first few rows of each dataframe to confirm they loaded correctly
print("Orders DataFrame:")
print(df_orders.head())

print("\nProducts DataFrame:")
print(df_products.head())

print("\nOrders_Products_Prior DataFrame Shape:", df_orders_prior.shape)

Orders DataFrame:
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2398795        1    prior             2          3                  7   
2    473747        1    prior             3          3                 12   
3   2254736        1    prior             4          4                  7   
4    431534        1    prior             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  

Products DataFrame:
   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Sma

## 4. Merge Data
### 4.1 Merge Orders and Orders_Products_Prior

This step **combined the orders** and `orders_products_prior` datasets so that I can link each order to its **product-level details**. I used an **inner merge** to make sure only **matching rows** are included, and I checked the **merge flags** and **shape** to confirm everything worked as planned.

In [34]:
# Merge orders with orders_products_prior using 'order_id' as the key
df_merged = df_orders.merge(df_orders_prior, on='order_id', how='inner', indicator=True)

# Check the shape and preview the merged dataframe
print("Merged DataFrame Shape:", df_merged.shape)
print(df_merged.head())

# Check merge flag to confirm the success of the merge
print("\nMerge Flag Counts:")
print(df_merged['_merge'].value_counts())

Merged DataFrame Shape: (32434489, 11)
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2539329        1    prior             1          2                  8   
2   2539329        1    prior             1          2                  8   
3   2539329        1    prior             1          2                  8   
4   2539329        1    prior             1          2                  8   

   days_since_prior_order  product_id  add_to_cart_order  reordered _merge  
0                     NaN         196                  1          0   both  
1                     NaN       14084                  2          0   both  
2                     NaN       12427                  3          0   both  
3                     NaN       26088                  4          0   both  
4                     NaN       26405                  5          0   both  

Merge Flag Counts:
_merge
both     

### 4.2 Merge with Products Data

Here, I merged the **combined dataset** with the **products data** to add product details like **names, departments, and prices.** This step is important for creating a dataset that’s ready for **in-depth analysis.** I double-checked the merge using **flags** to make sure the data lined up properly.

In [39]:
# Drop the _merge column from the previous merge to avoid conflicts
df_merged.drop(columns=['_merge'], inplace=True)

# Merge the merged dataframe with the products dataframe using 'product_id' as the key
df_final = df_merged.merge(df_products, on='product_id', how='inner', indicator=True)

# Check the shape and preview the final merged dataframe
print("Final Merged DataFrame Shape:", df_final.shape)
print(df_final.head())

# Check merge flag to confirm the success of the merge
print("\nMerge Flag Counts:")
print(df_final['_merge'].value_counts())

Final Merged DataFrame Shape: (32434212, 15)
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2539329        1    prior             1          2                  8   
2   2539329        1    prior             1          2                  8   
3   2539329        1    prior             1          2                  8   
4   2539329        1    prior             1          2                  8   

   days_since_prior_order  product_id  add_to_cart_order  reordered  \
0                     NaN         196                  1          0   
1                     NaN       14084                  2          0   
2                     NaN       12427                  3          0   
3                     NaN       26088                  4          0   
4                     NaN       26405                  5          0   

                              product_name  aisle_id  department_

## 5. Export Data

I exported the **final merged dataset** as a **pickle file** so it’s saved in a **compact format** that’s easy to load later. This way, I can jump straight into analysis without having to repeat these steps

In [42]:
# Export the final merged dataframe to a pickle file
df_final.to_pickle(os.path.join(path, 'orders_products_combined.pkl'))

# Confirm the file has been saved
print("Export completed successfully!")

Export completed successfully!


## 6. Validate Exported File

In this section, I re-import the exported `orders_products_combined.pkl` file to confirm that it was saved correctly and the shape matches the original dataset.

In [15]:
# Validating the exported file is critical to ensure the data was saved and can be loaded back correctly.
# This step confirms that no issues occurred during the export process, preserving data integrity for future analysis.

import pandas as pd

# Import the exported pickle file
df_combined_check = pd.read_pickle(r"D:/YVC/Data Analytics (CF)/Python Fundamentals for Data Analysts/Instacart Basket Analysis/02 Data/Original Data/orders_products_combined.pkl")

# Check the shape of the imported DataFrame
print("Imported DataFrame Shape:", df_combined_check.shape)

# Optionally, display the first few rows to confirm successful import
print(df_combined_check.head())

Imported DataFrame Shape: (32434212, 15)
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2539329        1    prior             1          2                  8   
2   2539329        1    prior             1          2                  8   
3   2539329        1    prior             1          2                  8   
4   2539329        1    prior             1          2                  8   

   days_since_prior_order  product_id  add_to_cart_order  reordered  \
0                     NaN         196                  1          0   
1                     NaN       14084                  2          0   
2                     NaN       12427                  3          0   
3                     NaN       26088                  4          0   
4                     NaN       26405                  5          0   

                              product_name  aisle_id  department_id  

## 6. Conclusion
In this notebook, I brought together multiple Instacart datasets into a single enriched file, ensuring that the data is both clean and ready for analysis. These steps not only streamlined the data but also set the stage for analyzing customer behavior and uncovering trends. With the final dataset verified and exported successfully, I’m ready to dive deeper into the data and generate actionable insights.