## 4.6.2 Combining & Exporting Data

### This script contains the following points:

#### 1. Import the data sets into Jupyter

#### 2. Check the dimensions of the imported dataframes

#### 3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set

#### 4. Confirm the results of the merge using the merge flag

#### 5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [24]:
# Import libraries
import pandas as pd
import numpy as np
import os
from os.path import join

### 1. Import the data sets into Jupyter

In [25]:
# Tell Python to remember a main folder path
path = "../02 Data/Prepared Data"

In [29]:
# Import dataset orders_products_combined.pkl
df_ords_prods_combined = pd.read_csv("../02 Data/Prepared Data/orders_products_combined.csv")

In [33]:
# Import dataset products_checked.csv
df_prods = pd.read_csv('../02 Data/Prepared Data/products_checked.csv')

### 2. Check the dimensions of the imported dataframes

In [34]:
# Check the output
df_ords_prods_combined.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,0,0,2539329,1,prior,1,2,8,-1.0,196,1,0,both
1,1,0,2539329,1,prior,1,2,8,-1.0,14084,2,0,both
2,2,0,2539329,1,prior,1,2,8,-1.0,12427,3,0,both
3,3,0,2539329,1,prior,1,2,8,-1.0,26088,4,0,both
4,4,0,2539329,1,prior,1,2,8,-1.0,26405,5,0,both


In [35]:
df_ords_prods_combined.shape

(32434489, 13)

In [36]:
# Check the output
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [37]:
# Drop unnecessary columns
df_prods = df_prods.drop(['Unnamed: 0'], axis=1)

In [38]:
# Check the output
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [39]:
df_prods.shape

(49693, 5)

### 3. Determine a suitable way to combine the orders_products_combined dataframe with the products data set¶

#### A suitable way to combine the df_ords_prods_combined dataframe with the df_prods dataframe, despite their different shape, is to merge them thanks to their shared column "product_id". For the purpose of this project, a default inner join is advisable.

In [41]:
df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = 'merge_status')

#### I think the problem here is that the column "merge" already exists in the df_ords_prods_combined dataframe. Therefore, I have to drop it from the dataframe before operating the merging procedure.

In [42]:
# Drop unnecessary columns
df_ords_prods_combined = df_ords_prods_combined.drop(['_merge'], axis=1)

In [43]:
# Check output
df_ords_prods_combined.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,0,0,2539329,1,prior,1,2,8,-1.0,196,1,0
1,1,0,2539329,1,prior,1,2,8,-1.0,14084,2,0
2,2,0,2539329,1,prior,1,2,8,-1.0,12427,3,0
3,3,0,2539329,1,prior,1,2,8,-1.0,26088,4,0
4,4,0,2539329,1,prior,1,2,8,-1.0,26405,5,0


In [44]:
df_ords_prods_combined.shape

(32434489, 12)

In [45]:
# Merge the updated dataframes
df_ords_prods_merge = df_prods.merge(df_ords_prods_combined, on = 'product_id', indicator = True)

In [46]:
# Check the output
df_ords_prods_merge.head()

Unnamed: 0.2,product_id,product_name,aisle_id,department_id,prices,Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge
0,1,Chocolate Sandwich Cookies,61,19,5.8,16802,1987,3139998,138,prior,28,6,11,3.0,5,0,both
1,1,Chocolate Sandwich Cookies,61,19,5.8,16806,1989,1977647,138,prior,30,6,17,20.0,1,1,both
2,1,Chocolate Sandwich Cookies,61,19,5.8,106710,11433,389851,709,prior,2,0,21,6.0,20,0,both
3,1,Chocolate Sandwich Cookies,61,19,5.8,113536,12198,652770,764,prior,1,3,13,-1.0,10,0,both
4,1,Chocolate Sandwich Cookies,61,19,5.8,113575,12200,1813452,764,prior,3,4,17,9.0,11,1,both


In [47]:
df_ords_prods_merge.shape

(32434212, 17)

### 4. Confirm the results of the merge using the merge flag

In [48]:
df_ords_prods_merge['_merge'].value_counts()

_merge
both          32434212
left_only            0
right_only           0
Name: count, dtype: int64

#### The resulting dataframe (after the merge) has 32,404,859 rows, and each of those rows have information found in both input data sets, as we used an inner join for the purposes of this project.

### 5. Export the newly created dataframe as ords_prods_merge in a suitable format

In [61]:
# Export data to csv
df_ords_prods_merge.to_csv(os.path.join(path, 'orders_products_merged.csv')