# Step 4 Instacart Orders Products All

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing Data](#2.-Importing-Data)
    * [2.1 Importing Libraries](#2.1-Importing-Libraries)
    * [2.2 Importing Data](#2.2-Importing-Data)
* [3. Combining Data](#3.-Combining-Data)
    * [3.1 RAM Memory Space](#3.1-RAM-Memory-Space)
    * [3.2 Key Variable](#3.2-Key-Variable)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting Data](#4.-Exporting-Data) 

# 1. Introduction

A thorough connection of all relevant dataframes is in progress to address Instacart's research question. The Order and Orders Products Prior dataframes were first combined to form the "orders_product_combined" dataframe. The products dataframe was then added as well, creating the new dataframe "orders_product_merged". The Customer dataframe must be added to this amalgamation in the following step. This enhancement will make it possible to link orders to particular customers, making it easier to analyze consumer purchasing patterns in-depth.


# 2. Importing Data

## 2.1 Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## 2.2 Importing Data

In [3]:
path = r'C:\Users\kuohe\Instacart Basket Analysis 12-18-2023'

In [4]:
orders_products_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_merged_step3.pkl'))

In [5]:
# Checking Size
orders_products_merged.shape

(32434489, 13)

In [7]:
# Checking Columns
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


### Customers

In [8]:
customers = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'customers_step1.csv'))

In [9]:
# Check Size
customers.shape

(206209, 10)

In [10]:
#Check Columns
customers.head()

Unnamed: 0,user_id,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,2017-01-01,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,2017-01-01,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,2017-01-01,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,2017-01-01,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,2017-01-01,1,married,40374


# 3. Combining Data

## 3.1 RAM Memory Space

### Orders Product Merged

In [11]:
# Checking data type to save RAM
orders_products_merged.dtypes

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
product_id                  int32
add_to_cart_sequence        int32
reordered                    int8
product_name               object
aisle_id                     int8
department_id                int8
prices                    float64
dtype: object

Most of the data types have already been converted in previous steps

### Customers

In [12]:
# Checking data type to save RAM
customers.dtypes

user_id               int64
First Name           object
Surname              object
Gender               object
STATE                object
Age                   int64
date_joined          object
No._of_dependants     int64
family_status        object
income                int64
dtype: object

In [13]:
# Converting the data type to improve processing speed
customers['user_id'] = customers['user_id'].astype('int32')

In [15]:
# Converting the data type to improve processing speed
customers['Gender'] = customers['Gender'].astype('category')

In [16]:
# Converting the data type to improve processing speed
customers['STATE'] = customers['STATE'].astype('category')

In [17]:
# Converting the data type to improve processing speed
customers['Age'] = customers['Age'].astype('int32')

In [19]:
# Converting the data type to improve processing speed
customers['No._of_dependants'] = customers['No._of_dependants'].astype('int32')

In [20]:
# Converting the data type to improve processing speed
customers['family_status'] = customers['family_status'].astype('category')

# 3.2 Key Variables

In [22]:
# Since "user_id is shared among 2 dataframes, it will be the main focus"
orders_products_merged['user_id'].describe()

count    3.243449e+07
mean     1.029372e+05
std      5.946648e+04
min      1.000000e+00
25%      5.142100e+04
50%      1.026110e+05
75%      1.543910e+05
max      2.062090e+05
Name: user_id, dtype: float64

In [23]:
values_orders_products_merged = orders_products_merged['user_id'].nunique()
print(f"The amount of values in 'user_id': {values_orders_products_merged}")

The amount of values in 'user_id': 206209


In [25]:
# As "user_id" is also in the customer file, we must check the values as well
customers['user_id'].describe()

count    206209.000000
mean     103105.000000
std       59527.555167
min           1.000000
25%       51553.000000
50%      103105.000000
75%      154657.000000
max      206209.000000
Name: user_id, dtype: float64

In [26]:
values_customers = customers['user_id'].nunique()
print(f"The amount of values in 'user_id': {values_customers}")

The amount of values in 'user_id': 206209


# 3.3 Merge

In [27]:
# Merging the dataframes
orders_products_all = orders_products_merged.merge(customers, on = 'user_id')

In [28]:
# Checking Columns
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,prices,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423


In [29]:
# Double checking the values if the merge was successful, it should be the same values
values_orders_products_all = orders_products_all['user_id'].nunique()
print(f"The amount of values in 'user_id': {values_orders_products_all}")

The amount of values in 'user_id': 206209


In [30]:
orders_products_all['user_id'].describe()

count    3.243449e+07
mean     1.029372e+05
std      5.946648e+04
min      1.000000e+00
25%      5.142100e+04
50%      1.026110e+05
75%      1.543910e+05
max      2.062090e+05
Name: user_id, dtype: float64

It was merged successfully

# 4. Exporting Data

In [31]:
# Checking Size
orders_products_all.shape

(32434489, 22)

In [32]:
# Checking Column name
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,prices,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423


In [33]:
# Exporting to prepared data folder
# Since there is over 32 million rows of data this must be exported as a pickle  
orders_products_all.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_all_step4.pkl'))