# Step 2 Instacart Orders Products Combined

## Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-Data)
    * [2.1 Importing libraries](#2.1-Importing-Libraries)
    * [2.2 Importing data](#2.2-Importing-Data)
* [3. Data combining](#3.-Data-Combining)
    * [3.1 RAM memory space](#3.1-RAM-Memory-Space)
    * [3.2 Key variable](#3.2-Key-Variable)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting data](#4.-Exporting-Data) 

# 1. Introduction

Instacart's research question will be answered by merging every relevant dataframes. The Orders Products Prior dataframe and the Order dataframe will be integrated initially. By displaying all of the shopping items connected to each of the 3,421,083 total orders, this integration will improve each individual order.


# 2. Importing Data

## 2.1 Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## 2.2 Importing Data

In [2]:
path = r'C:\Users\kuohe\Instacart Basket Analysis 12-18-2023'

### Order

In [3]:
orders = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'orders_step1.csv'), index_col = False)

In [4]:
orders.shape

(3421083, 6)

In [5]:
orders.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


### Orders Products Prior

In [6]:
orders_products_prior = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_prior_step1.pkl'))

In [7]:
orders_products_prior.shape

(32434489, 4)

In [8]:
orders_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_sequence,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


# 3. Combining the Data

## 3.1 RAM Memory Space

### Orders

In [9]:
# Taking a look at the data types to save memory
orders.dtypes

order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [10]:
# Converting the data type to be more efficient
orders['order_id'] = orders['order_id'].astype('int32')

In [11]:
# Converting the data type to be more efficient
orders['user_id'] = orders['user_id'].astype('int32')

In [12]:
# Converting the data type to be more efficient
orders['order_number'] = orders['order_number'].astype('int8')

In [13]:
# Converting the data type to be more efficient
orders['orders_day_of_week'] = orders['orders_day_of_week'].astype('int8')

In [14]:
# Converting the data type to be more efficient
orders['order_hour_of_day'] = orders['order_hour_of_day'].astype('int8')

In [16]:
# Checking if successful
orders.dtypes

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
dtype: object

### Orders Product Prior

In [17]:
orders_products_prior.dtypes

order_id                int64
product_id              int64
add_to_cart_sequence    int64
reordered               int64
dtype: object

In [18]:
# Converting the data type to be more efficient
orders_products_prior['order_id'] = orders_products_prior['order_id'].astype('int32')

In [19]:
# Converting the data type to be more efficient
orders_products_prior['product_id'] = orders_products_prior['product_id'].astype('int32')

In [20]:
# Converting the data type to be more efficient
orders_products_prior['add_to_cart_sequence'] = orders_products_prior['add_to_cart_sequence'].astype('int32')

In [21]:
# Converting the data type to be more efficient
orders_products_prior['reordered'] = orders_products_prior['reordered'].astype('int8')

In [22]:
# Checking if successful
orders_products_prior.dtypes

order_id                int32
product_id              int32
add_to_cart_sequence    int32
reordered                int8
dtype: object

## Key Variables

In [23]:
# Since order_id is shared among 2 dataframe, this will be the main focus for observation
orders['order_id'].describe()

count    3.421083e+06
mean     1.710542e+06
std      9.875817e+05
min      1.000000e+00
25%      8.552715e+05
50%      1.710542e+06
75%      2.565812e+06
max      3.421083e+06
Name: order_id, dtype: float64

In [24]:
unique_orders = orders['order_id'].nunique()
print(f"The amount of values in 'order_id': {unique_orders}")

The amount of values in 'order_id': 3421083


In [25]:
# We will look at the order products prior for the "order_id" for observation
orders_products_prior['order_id'].describe()

count    3.243449e+07
mean     1.710749e+06
std      9.873007e+05
min      2.000000e+00
25%      8.559430e+05
50%      1.711048e+06
75%      2.565514e+06
max      3.421083e+06
Name: order_id, dtype: float64

In [26]:
unique_orders_products_prior = orders_products_prior['order_id'].nunique()
print(f"The amount of values in 'order_id': {unique_orders_products_prior}")

The amount of values in 'order_id': 3214874


# 3.3 Merge

In [27]:
# Merging the dataframes
orders_products_combined = orders.merge(orders_products_prior, on = 'order_id')

In [28]:
# Checking if successful
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


In [29]:
# Checking the amount of values in the combined order_id
unique_orders_products_combined = orders_products_combined['order_id'].nunique()
print(f"The amount of values in 'order_id': {unique_orders_products_combined}")

The amount of values in 'order_id': 3214874


# 4. Exporting Data

In [30]:
orders_products_combined.shape

(32434489, 9)

In [31]:
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


In [32]:
# Exporting to prepared data folder
# This will be exported as a pickle due to the large size of data
orders_products_combined.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_combined_step2.pkl'))