# 4.6: Combining and Exporting Data

## Table of Contents
1. [Introduction](#1-Introduction)
2. [Import Libraries](#2-Import-Libraries)
3. [Load and Explore Data](#3-Load-and-Explore-Data)
   3.1 [Load `orders.csv`](#31-Load-orderscsv)
   3.2 [Load `products.csv`](#32-Load-productscsv)
4. [Combine Data](#4-Combine-Data)
   4.1 [Load `orders_products_prior.csv`](#41-Load-orders_products_priorcsv)
   4.2 [Merge Orders and Products](#42-Merge-Orders-and-Products)
5. [Export Data](#5-Export-Data)

## 1. Introduction
This notebook focuses on combining and exporting data for the Instacart dataset. We will:
- Load and explore multiple datasets.
- Combine dataframes using concatenation and merging.
- Export the combined data into CSV or Pickle formats.

## 2. Import Libraries

In [47]:
# Import necessary libraries
import pandas as pd  # For working with dataframes
import os  # For handling file paths

# Define the project folder path
path = r"D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis"

### 3.1 Load `orders.csv`

In [59]:
# Load the orders.csv file
df_orders = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'))

# Display the first few rows and shape of the dataframe
print(df_orders.head())  # Preview the first 5 rows
print(df_orders.shape)   # Check the number of rows and columns

# Check for missing values
print(df_orders.isnull().sum())  # Consistency check for nulls

   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2398795        1    prior             2          3                  7   
2    473747        1    prior             3          3                 12   
3   2254736        1    prior             4          4                  7   
4    431534        1    prior             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  
(3421083, 7)
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64


### 3.2 Load `products.csv`

In [25]:
# Load the products.csv file
df_products = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

# Display the first few rows and shape of the dataframe
print(df_products.head())  # Preview the first 5 rows
print(df_products.shape)   # Check the number of rows and columns

# Check for unique values in a key column
print(df_products['product_id'].nunique())  # Check uniqueness of product_id

   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  prices  
0             19     5.8  
1             13     9.3  
2              7     4.5  
3              1    10.5  
4             13     4.3  
(49693, 5)
49686


### 4.1 Load `orders_products_prior.csv`

In [38]:
df_orders_prior = pd.read_csv(
    os.path.join(path, '02 Data', 'Original Data', 'orders_products_prior.csv'),
    encoding='utf-8'
)

In [50]:
# Load the orders_products_prior.csv file with encoding specified
df_orders_prior = pd.read_csv(
    os.path.join(path, '02 Data', 'Original Data', 'orders_products_prior.csv'),
    encoding='utf-8'
)

# Display the first few rows and shape of the dataframe
print(df_orders_prior.head())  # Preview the first 5 rows
print(df_orders_prior.shape)   # Check the number of rows and columns

# Check for missing values
print(df_orders_prior.isnull().sum())  # Consistency check for nulls

   order_id  product_id  add_to_cart_order  reordered
0         2       33120                  1          1
1         2       28985                  2          1
2         2        9327                  3          0
3         2       45918                  4          1
4         2       30035                  5          0
(32434489, 4)
order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64


In [42]:
# Merge orders_prior with products data on 'product_id'
df_merged = df_orders_prior.merge(df_products, on='product_id', how='inner')

# Display the first few rows and shape of the merged dataframe
print(df_merged.head())  # Preview the first 5 rows
print(df_merged.shape)   # Check the number of rows and columns

# Check for missing values in the merged dataframe
print(df_merged.isnull().sum())  # Consistency check for nulls

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices  
0        86             16    11.3  
1        83              4    13.4  
2       104             13     3.6  
3        19             13     8.4  
4        17             13    13.7  
(32434212, 8)
order_id                 0
product_id               0
add_to_cart_order        0
reordered                0
product_name         28171
aisle_id                 0
department_id            0
prices                   0
dtype: int64


### 4.2 Merge Orders and Products

In [53]:
# Merge orders_prior with products data on 'product_id'
df_merged = df_orders_prior.merge(df_products, on='product_id', how='inner')

# Display the first few rows and shape of the merged dataframe
print(df_merged.head())  # Preview the first 5 rows
print(df_merged.shape)   # Check the number of rows and columns

# Check for missing values in the merged dataframe
print(df_merged.isnull().sum())  # Consistency check for nulls

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices  
0        86             16    11.3  
1        83              4    13.4  
2       104             13     3.6  
3        19             13     8.4  
4        17             13    13.7  
(32434212, 8)
order_id                 0
product_id               0
add_to_cart_order        0
reordered                0
product_name         28171
aisle_id                 0
department_id            0
prices                   0
dtype: int64


### 5. Export Data

In [56]:
# Export the merged dataframe to CSV
output_path = os.path.join(path, '02 Data', 'Prepared Data', 'merged_orders_products.csv')
df_merged.to_csv(output_path, index=False)

# Confirm that the file has been saved
print(f"File saved to {output_path}")

File saved to D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\merged_orders_products.csv
