# Step 3 Order Products Merged

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing Data](#2.-Importing-Data)
    * [2.1 Importing Libraries](#2.1-Importing-Libraries)
    * [2.2 Importing Data](#2.2-Importing-Data)
* [3. Combining Data](#3.-Combining-Data)
    * [3.1 RAM Memory Space](#3.1-RAM-Memory-Space)
    * [3.2 Key Variables](#3.2-Key-Variables)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting Data](#4.-Exporting-Data) 

# 1. Introduction

Instacart's research question will be addressed by combining all dataframes in one comprehensive way. A unified dataframe called "orders_product_combined" was previously created by combining the Order and Orders Products Prior dataframes. Subsequently, this combined dataset will also incorporate the products dataframe. By doing this, the dataset will be enhanced with more product information, such as names and prices.


# 2. Importing Data

## 2.1 Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import os

# 2.2 Importing Data

In [3]:
path = r'C:\Users\kuohe\Instacart Basket Analysis 12-18-2023'

### Order Products Combined

In [4]:
# Impiorting the file
orders_products_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_combined_step2.pkl'))

In [5]:
orders_products_combined.shape

(32434489, 9)

In [6]:
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


### Products

In [7]:
# Importing File
products = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'products_step1.csv'))

In [8]:
products.shape

(49688, 5)

In [9]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


# 3. Cominbing Data

## 3.1 RAM Memory Space

### Order Products Combined

In [10]:
# Running some functions will take a lot of memory, so converting data types will help the process
orders_products_combined.dtypes

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
product_id                  int32
add_to_cart_sequence        int32
reordered                    int8
dtype: object

In step 2, some of the data types were already converted, so no furthur action needed

### Products

In [11]:
products.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [12]:
# Converting the Data type to help the process
products['product_id'] = products['product_id'].astype('int32')

In [13]:
# Converting the Data type to help the process
products['aisle_id'] = products['aisle_id'].astype('int8')

In [14]:
# Converting the Data type to help the process
products['department_id'] = products['department_id'].astype('int8')

In [15]:
# Check if implemented successfully
products.dtypes

product_id         int32
product_name      object
aisle_id            int8
department_id       int8
prices           float64
dtype: object

# 3.2 Key Variables

In [17]:
# "product_id" is a shared variable among 2 dataframes so this will be the main focus
orders_products_combined['product_id'].describe()

count    3.243449e+07
mean     2.557634e+04
std      1.409669e+04
min      1.000000e+00
25%      1.353000e+04
50%      2.525600e+04
75%      3.793500e+04
max      4.968800e+04
Name: product_id, dtype: float64

In [18]:
# Double checking the values in "product_id" from the orders product combined file
values_orders_products_combined = orders_products_combined['product_id'].nunique()
print(f"The amoount of values in 'product_id': {values_orders_products_combined}")

The amoount of values in 'product_id': 49677


In [19]:
# Now checking it from the products file for "product_id"
products['product_id'].describe()

count    49688.000000
mean     24844.500000
std      14343.834425
min          1.000000
25%      12422.750000
50%      24844.500000
75%      37266.250000
max      49688.000000
Name: product_id, dtype: float64

In [20]:
values_products = products['product_id'].nunique()
print(f"The amounts of values in 'product_id': {values_products}")

The amounts of values in 'product_id': 49688


# 3.3 Merge

In [21]:
# Merging the dataframes
orders_products_merged = orders_products_combined.merge(products, on = 'product_id')

In [22]:
# Checking
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


# 4. Exporting Data

In [23]:
# Check size
orders_products_merged.shape

(32434489, 13)

In [24]:
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


In [25]:
# Exporting to prepared data folder
# The file type should be pickle sinze there are over 32 million rows
orders_products_merged.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_merged_step3.pkl'))