# IC 4.6 Part 2_Combining and Exporting orders_products_combined data with products data

# Table of Contents
##### 1. Import libraries
##### 2. Import data
##### 3. Check shape of each dataframe
    3.1 df_ords_prods_combined
    3.2 df_prods
##### 4. Merge Data (orders_products_combined / products)
##### 5. Export data

## Import libraries

In [3]:
# Import pandas, numPy, and os

import pandas as pd
import numpy as np
import os

## 2. Import data

In [8]:
path = r'C:\Users\ashle\Documents\07-2024 Instacart Basket Analysis' # Assigns the master file path to variable 'path'

# Import orders_products_combined
df_ords_prods_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

# Import products_checked
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'), index_col = False)

## 3. Check dataframe shape

### 3.1 df_ords_prods_combined

In [11]:
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,True,196,1,0,both
1,2539329,1,1,2,8,,True,14084,2,0,both
2,2539329,1,1,2,8,,True,12427,3,0,both
3,2539329,1,1,2,8,,True,26088,4,0,both
4,2539329,1,1,2,8,,True,26405,5,0,both


In [13]:
# Drop '_merge' column on df_ords_prods_combined
df_ords_prods_combined = df_ords_prods_combined.drop(columns=['_merge'])

In [17]:
# Confirm column was successfully dropped
df_ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,,True,196,1,0
1,2539329,1,1,2,8,,True,14084,2,0
2,2539329,1,1,2,8,,True,12427,3,0
3,2539329,1,1,2,8,,True,26088,4,0
4,2539329,1,1,2,8,,True,26405,5,0


In [19]:
# Check shape of df_ords_prods_combined
df_ords_prods_combined.shape

(32434489, 10)

### 3.2 df_prods

In [21]:
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [23]:
#dropping unecessary indexed column"Unnamed: 0" column
df_prods = df_prods.drop(columns=['Unnamed: 0'])

In [25]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [27]:
df_prods.shape

(49672, 5)

## 4. Merge Data (orders_products_combined / products)

Approach: I chose to do an inner join because based on the key questions we need to answer in this analysis such as analyzing order times, product popularity, and spending patterns, an inner join is more advisable. 

In [29]:
# Merge df_ords_prods_combined with df_prods using an inner join
df_ords_prods_merge = df_ords_prods_combined.merge(df_prods, on = 'product_id', indicator = True)

In [30]:
df_ords_prods_merge

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both
1,2539329,1,1,2,8,,True,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both
2,2539329,1,1,2,8,,True,12427,3,0,Original Beef Jerky,23,19,4.4,both
3,2539329,1,1,2,8,,True,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,both
4,2539329,1,1,2,8,,True,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32404854,2977660,206209,13,1,12,7.0,False,14197,5,1,Tomato Paste,9,9,5.6,both
32404855,2977660,206209,13,1,12,7.0,False,38730,6,0,Brownie Crunch High Protein Bar,3,19,5.9,both
32404856,2977660,206209,13,1,12,7.0,False,31477,7,0,High Protein Bar Chunky Peanut Butter,3,19,4.2,both
32404857,2977660,206209,13,1,12,7.0,False,6567,8,0,Chocolate Peanut Butter Protein Bar,3,19,4.9,both


The 'indicator = True' argument in the merge code above is the merge flag. This created the _merge column in the dataframe. In the next cell, I'll do a frequency check on the '_merge' column to see the counts of each result (both, right_only, or left_only). 

In [33]:
# Check for full match
df_ords_prods_merge['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

In [35]:
# Check shape in new merged dataframe df_ords_prods_merge
df_ords_prods_merge.shape

(32404859, 15)

## 5. Export data

In [None]:
#Export the newly created dataframe as ords_prods_merge
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge.pkl'))