# Combining_Exporting_Data_Part2

### This script contains the following points:

#### 01. Importing libraries
#### 02. Importing orders_products_combined.pkl (merged) and products_checked.pkl (cleaned) data
#### 03. Examining df_merged_large (merged orders_products_prior.csv (original) and orders_checked.pkl (cleaned) files) and df_prods_checked (products_checked.pkl (cleaned) file) dataframes
#### 04. Exploring how to merge the df_merged_large and df_prods_checked dataframes
* Droping the *'_merge'* column from df_merged_large
* Testing the merge without overwriting on an inner join
* Testing the merge without overwriting on an left join
* Merging on an left join
* Investigating why the results are not as expected i.e. 32334489 rows matched
* Define functions to find the missing values and to check for the missing values in the product_id column of the dataframes

#### 05. Merging the df_merged_large and df_prods_checked dataframes
#### 06. Exporting the df_ords_prior and df_ords_clean merged dataframe as pkl file

## 01. Importing libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

## 02. Importing orders_products_combined.pkl (merged) and products_checked.pkl (cleaned) data

In [2]:
# Assign the main project folder path to the variable path and view
path = r'/Users/elsaekevall/Jupyter_Notebook/Career_Foundry/06_2022_Instacart_Basket_Analysis/'
path

'/Users/elsaekevall/Jupyter_Notebook/Career_Foundry/06_2022_Instacart_Basket_Analysis/'

In [3]:
# Use the os.path.join() function to import the orders_products_combined.pkl file as a pandas dataframe
df_merged_large = pd.read_pickle(os.path.join(path, '02_Data', '02_2_Prepared_Data', 'orders_products_combined.pkl'))
df_prods_checked = pd.read_pickle(os.path.join(path, '02_Data', '02_2_Prepared_Data', 'products_checked.pkl'))

## 03. Examining df_merged_large *(merged orders_products_prior.csv (original) and orders_checked.pkl (cleaned) files)* and df_prods_checked *(products_checked.pkl (cleaned) file)* dataframes

In [4]:
# View the first and last five rows of the dataframe
df_merged_large.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,True,196,1,False,both
1,2539329,1,1,2,8,,True,14084,2,False,both
2,2539329,1,1,2,8,,True,12427,3,False,both
3,2539329,1,1,2,8,,True,26088,4,False,both
4,2539329,1,1,2,8,,True,26405,5,False,both


In [5]:
# View the shape
df_merged_large.shape

(32434489, 11)

In [6]:
# View the first and last five rows of the dataframe
df_prods_checked.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [7]:
# View the shape
df_prods_checked.shape

(49672, 5)

## 04. Exploring how to merge the df_merged_large and df_prods_checked dataframes 

* How many rows and columns do the dataframes to be combined contain? Do they have the same number of columns? <br>
    *Not the same number of columns or rows, therefore can't concatenate or append the data*
* Do the dataframes contain information on the same subject? Do they share a common column or columns? <br>
    *They contain different information and have the column 'product_id' in common.*
* Should the combined dataframe be long format or wide format? <br>
    *The combined dataframe should be wide format to include the new columns from df_prods_checked*
* Is a full match expected after combining the dataframes? <br>
    *No a full match is not expected, it is possible that some of the product_id's may not be the df_merged_large dataframe*
* If not, what type of join should you use? Which part of the data should you keep in the final dataframe? <br>
    *Merge the dataframes using a left join keeping the information in the df_merged_large (left) dataframe and combining it with any information in the df_prods_checked (right) dataframe that can be mapped back to the dataframe on the left.*

### Droping the '_merge' column from df_merged_large
**Using the test merge without overwriting shows an error explaining that the '_merge' column must be dropped from df_merged_large before merging**

In [8]:
# Drop the '_merge' column from df_merged_large and view the first five rows
df_merged_large.drop('_merge', axis=1, inplace=True)
df_merged_large.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,,True,196,1,False
1,2539329,1,1,2,8,,True,14084,2,False
2,2539329,1,1,2,8,,True,12427,3,False
3,2539329,1,1,2,8,,True,26088,4,False
4,2539329,1,1,2,8,,True,26405,5,False


### Testing the merge without overwriting on an inner join

In [9]:
# Test merge without overwriting
pd.merge(df_merged_large,df_prods_checked, on = ['product_id'], indicator = True)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,False,Soda,77,7,9.0,both
1,2398795,1,2,3,7,15.0,False,196,1,True,Soda,77,7,9.0,both
2,473747,1,3,3,12,21.0,False,196,1,True,Soda,77,7,9.0,both
3,2254736,1,4,4,7,29.0,False,196,1,True,Soda,77,7,9.0,both
4,431534,1,5,4,15,28.0,False,196,1,True,Soda,77,7,9.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32404854,1320836,202557,17,2,15,1.0,False,43553,2,True,Orange Energy Shots,64,7,3.7,both
32404855,31526,202557,18,5,11,3.0,False,43553,2,True,Orange Energy Shots,64,7,3.7,both
32404856,758936,203436,1,2,7,,True,42338,4,False,"Zucchini Chips, Pesto",50,19,6.9,both
32404857,2745165,203436,2,3,5,15.0,False,42338,16,True,"Zucchini Chips, Pesto",50,19,6.9,both


**The final data frame has the expected number of columns 15, but only 32404859 rows, whereas the initial dataframe df_merged_large had 32434489 rows**

### Testing the merge without overwriting on an left join

In [10]:
# Test merge without overwriting
pd.merge(df_merged_large,df_prods_checked, on = ['product_id'], how = 'left', indicator = True)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,False,Soda,77,7,9.0,both
1,2539329,1,1,2,8,,True,14084,2,False,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both
2,2539329,1,1,2,8,,True,12427,3,False,Original Beef Jerky,23,19,4.4,both
3,2539329,1,1,2,8,,True,26088,4,False,Aged White Cheddar Popcorn,23,19,4.7,both
4,2539329,1,1,2,8,,True,26405,5,False,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32435054,2977660,206209,13,1,12,7.0,False,14197,5,True,Tomato Paste,9,9,5.6,both
32435055,2977660,206209,13,1,12,7.0,False,38730,6,False,Brownie Crunch High Protein Bar,3,19,5.9,both
32435056,2977660,206209,13,1,12,7.0,False,31477,7,False,High Protein Bar Chunky Peanut Butter,3,19,4.2,both
32435057,2977660,206209,13,1,12,7.0,False,6567,8,False,Chocolate Peanut Butter Protein Bar,3,19,4.9,both


**The final data frame has the expected number of columns 15, but has 32435059 rows, whereas the initial dataframe df_merged_large had 32434489 rows**

### Merging on an left join

In [11]:
# Create new df_merged_large_products dataframe with default join “left” and merge flag then view the first five rows; left join therefore data set will contain all the observations in df_merged_large and any information in the df_prods_checked (right) dataframe that can be mapped back to the dataframe on the left,
df_merged_large_products = df_merged_large.merge(df_prods_checked, on = 'product_id', how = 'left', indicator = True)
df_merged_large_products.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,False,Soda,77,7,9.0,both
1,2539329,1,1,2,8,,True,14084,2,False,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both
2,2539329,1,1,2,8,,True,12427,3,False,Original Beef Jerky,23,19,4.4,both
3,2539329,1,1,2,8,,True,26088,4,False,Aged White Cheddar Popcorn,23,19,4.7,both
4,2539329,1,1,2,8,,True,26405,5,False,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both


In [12]:
# View the shape
df_merged_large_products.shape

(32435059, 15)

In [13]:
# Use value_counts() function to sum up all the values in the “_merge” column to see if there is a full match
df_merged_large_products['_merge'].value_counts()

both          32404859
left_only        30200
right_only           0
Name: _merge, dtype: int64

### Investigating why the results are not as expected i.e. 32334489 rows matched

First check for then missing values, then create a dataframe with only the missing values before doing a data value count on the product_id coulmn to find which product_id's are missing.

In [14]:
# Check for missing values in the df_prods dataframe
df_merged_large_products.isnull().sum()

order_id                        0
user_id                         0
order_number                    0
orders_day_of_week              0
order_hour_of_day               0
days_since_prior_order    2078102
first_order                     0
product_id                      0
add_to_cart_order               0
reordered                       0
product_name                30200
aisle_id                    30200
department_id               30200
prices                      30200
_merge                          0
dtype: int64

In [15]:
# create a data frame for the 30200 missing values in the column product_name
df_nan_prodname = df_merged_large_products[df_merged_large_products['product_name'].isnull() == True]  
df_nan_prodname

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
2963,7099,27,63,3,10,1.0,False,6799,1,False,,,,,left_only
3205,1837192,27,80,2,8,6.0,False,6799,9,True,,,,,left_only
3253,3331846,28,3,0,16,2.0,False,34,6,False,,,,,left_only
3395,1385910,28,21,2,17,1.0,False,34,1,True,,,,,left_only
4694,2873174,38,1,0,10,,True,116,11,False,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32425294,443233,206139,2,1,17,17.0,False,34,2,True,,,,,left_only
32425300,1507543,206139,3,1,8,14.0,False,34,2,True,,,,,left_only
32427098,1282274,206155,4,6,9,3.0,False,2240,18,False,,,,,left_only
32431906,1586626,206194,9,1,20,17.0,False,1511,6,False,,,,,left_only


In [16]:
# Use value_counts() function to see which product_id's are missing
df_nan_prodname['product_id'].value_counts()

1511     13397
34        6536
116       4359
6799      1978
4790      1804
2240      1689
262        179
3230        55
26519       51
1780        39
2586        29
69          19
525         18
4283        17
40440       13
3736         8
3159         5
38183        4
Name: product_id, dtype: int64

**There are 30200 values missing in the df_merged_large_products dataframe with the product_id's shown above in the count. The product_id's have no details related to the cloumns in the products_checked dataframe and are therefore most likely missing in this file**

### Define functions to find the missing values and to check for the missing values in the product_id column of the dataframes

To find out if the df_prods_checked dataframe has the product_id's create a list of the missing product_id's (nan_prodname_list) and a list of the unique product_id's in the df_prods_checked dataframe (df_prods_checked_list) then for every product_id in nan_prodname_list loop through the df_prods_checked_list and if it is there append it to the contain list, otherwise put it in the not contained list. 

In [17]:
# Create a list of the product_id's that are missing and print
def missing_list_function(df):
    missing_list = df['product_id'].unique()
    print("Missing product_id list " + str(missing_list))
    return missing_list

def check_list_function():
    # Create empty lists contained_pc and not_contained_pc for the results 
    print("Results:")
    contained = []
    not_contained = []

    # For every product_id in the df list check if it is in the df1 list and, if yes, append it to the contained list otherwise append it to the not_contained list
    for product_id in df['product_id'].unique():
        if product_id in df1['product_id'].unique():
            contained.append(product_id)    
        else:
            not_contained.append(product_id)
    print("Product_id column in the checked dataframe contains " + str(contained))
    print("Product_id column in the checked dataframe does not contain " + str(not_contained))

#### Check the df_prods_checked dataframe

In [18]:
#Run the missing_list_function to find the missing product_id's
missing_list_function(df_nan_prodname) 

#set df and df1 variables and run the check_list_function to see if df1 contains the missing product_id's
df = df_nan_prodname
df1 = df_prods_checked           
check_list_function()

Missing product_id list ['6799' '34' '116' '1511' '2240' '4790' '26519' '3230' '3159' '262' '1780'
 '525' '38183' '2586' '4283' '40440' '3736' '69']
Results:
Product_id column in the checked dataframe contains []
Product_id column in the checked dataframe does not contain ['6799', '34', '116', '1511', '2240', '4790', '26519', '3230', '3159', '262', '1780', '525', '38183', '2586', '4283', '40440', '3736', '69']


 **The products checked product_id list does not contain any of the missing product_id's hence the product name and other details are missing.**

#### Check the df_merged_large dataframe

In [19]:
#set df and df1 variables and run the check_list_function to see if df1 contains the missing product_id's
df = df_nan_prodname
df1 = df_merged_large           
check_list_function()

Results:
Product_id column in the checked dataframe contains ['6799', '34', '116', '1511', '2240', '4790', '26519', '3230', '3159', '262', '1780', '525', '38183', '2586', '4283', '40440', '3736', '69']
Product_id column in the checked dataframe does not contain []


#### Import the orders_products_prior.csv file and check the df_ords_prior dataframe

In [20]:
# Use the os.path.join() function to import the orders_products_prior.csv and orders_checked.csv files as pandas dataframes without the index column
df_ords_prior = pd.read_csv(os.path.join(path, '02_Data', '02_1_Original_Data', 'orders_products_prior.csv'), index_col = False)

#set df and df1 variables and run the check_list_function to see if df1 contains the missing product_id's
df = df_nan_prodname
df1 = df_ords_prior         
check_list_function()

Results:


  if product_id in df1['product_id'].unique():


Product_id column in the checked dataframe contains []
Product_id column in the checked dataframe does not contain ['6799', '34', '116', '1511', '2240', '4790', '26519', '3230', '3159', '262', '1780', '525', '38183', '2586', '4283', '40440', '3736', '69']


**At this point I would ask if the data for these missing product_id's is available e.g. are they pold products that have now been dropped? With that option not available and the missing 30300 values less than 0.1% of the data the decision was made to dropped the missing records and merge the data on an inner join.**

*#Code to find unmatched foreign keys, filter out the records with 'left only' values suing the "_merge" flag and apply a unique function*<br>
df[df['_merge'].eq('left_only')]['col_name'].unique()

## 05. Merging the df_merged_large and df_prods_checked dataframes 

Due to the missing product_id's an inner join was performed in order to achieve a full match.

In [21]:
# Create new df_merged_orders_products dataframe with default join “inner”, therefore data set will only contain observations included in both input data sets and view the first five rows
df_merged_orders_products = df_merged_large.merge(df_prods_checked, on = 'product_id', indicator = True)
df_merged_orders_products.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,True,196,1,False,Soda,77,7,9.0,both
1,2398795,1,2,3,7,15.0,False,196,1,True,Soda,77,7,9.0,both
2,473747,1,3,3,12,21.0,False,196,1,True,Soda,77,7,9.0,both
3,2254736,1,4,4,7,29.0,False,196,1,True,Soda,77,7,9.0,both
4,431534,1,5,4,15,28.0,False,196,1,True,Soda,77,7,9.0,both


In [22]:
# View the shape
df_merged_orders_products.shape

(32404859, 15)

In [23]:
# Use value_counts() function to sum up all the values in the “_merge” column to see if there is a full match
df_merged_orders_products['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

## 06. Exporting the df_ords_prior and df_ords_clean merged dataframe as pkl file

In [24]:
# Export the df_merged_orders_products dataframe to pkl
df_merged_orders_products.to_pickle(os.path.join(path, '02_Data', '02_2_Prepared_Data', 'orders_products_merged.pkl'))