# Table of Contents

This notebook contains the following: 

* Importing libraries 
* Turning project path into a string 
* Importing data sets 
* Wrangling data (renaming columns, identifying missing values etc)
* Carrying out data quality and consistency checks 
* Converting data types
* Merging data sets 


# This notebook is Part 1 of Task 4.9


## Step 1


## Download the customer data set and add it to your “Original Data” folder.


### Done


## Step 2


## Create a new notebook in your “Scripts” folder for part 1 of this task.


### Done


## Step 3


## Import your analysis libraries, as well as your new customer data set as a dataframe.

### 01. Importing libraries

In [1]:
# Import libraries 

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

### 02. Turning project path into a string

In [2]:
#Turn project folder path into a string

'/Users/aysha/Documents/Instacart Basket Analysis/'

'/Users/aysha/Documents/Instacart Basket Analysis/'

In [3]:
path = r'/Users/aysha/Documents/Instacart Basket Analysis/'

In [4]:
path

'/Users/aysha/Documents/Instacart Basket Analysis/'

### 03. Importing data (customer data as a dataframe)

In [5]:
# Import the “customers" data as a dataframe  

df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [6]:
# Check dataframe

df_customers

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...,...,...
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799


In [7]:
df_customers.shape

(206209, 10)

## Step 4


## Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

### Renaming the following columns:

1. First Name --> first_name
2. Surnam --> surname
3. Gender --> gender
4. STATE --> state
5. Age --> age
6. n_dependants --> number_of_dependents
7. fam_status --> marital_status

In [8]:
# Renaming columns

In [9]:
df_customers.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [10]:
df_customers.rename(columns = {'Surnam' : 'surname'}, inplace = True)

In [11]:
df_customers.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [12]:
df_customers.rename(columns = {'STATE' : 'state'}, inplace = True)

In [13]:
df_customers.rename(columns = {'Age' : 'age'}, inplace = True)

In [14]:
df_customers.rename(columns = {'n_dependants' : 'number_of_dependents'}, inplace = True)

In [15]:
df_customers.rename(columns = {'fam_status' : 'marital_status'}, inplace = True)

In [16]:
# Check dataframe after renaming

df_customers

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependents,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...,...,...
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799


## Step 5


## Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.


### Descriptive Statistics for numeric values

In [17]:
# Checking for descriptive statistics for the numeric values in the dataframe (df_customers)

df_customers.describe()

Unnamed: 0,user_id,age,number_of_dependents,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


### Checking for Data types for each column

In [18]:
# Checking Data types

df_customers.dtypes

user_id                  int64
first_name              object
surname                 object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependents     int64
marital_status          object
income                   int64
dtype: object

### Checking for missing values

In [20]:
# Checking for missing values

df_customers.isnull().sum()

user_id                     0
first_name              11259
surname                     0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependents        0
marital_status              0
income                      0
dtype: int64

In [22]:
# Create a new dataframe, df_nan, containing only those values within the "first_name" column that meet the condition isnull() = True.

df_nan = df_customers[df_customers['first_name'].isnull()==True]

In [23]:
df_nan

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependents,marital_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
...,...,...,...,...,...,...,...,...,...,...
206038,121317,,Melton,Male,Pennsylvania,28,3/31/2020,3,married,87783
206044,200799,,Copeland,Female,Hawaii,52,4/1/2020,2,married,108488
206090,167394,,Frost,Female,Hawaii,61,4/1/2020,1,married,45275
206162,187532,,Floyd,Female,California,39,4/1/2020,0,single,56325


### There are 11,259 missing values in the column titles 'first_name'. This probably will not have an impact on the analysis as we will use the user_id as a key identifier for customers. Therefore, leaving it as is seems like a reasonable solution. 

### Checking for mixed type data

In [24]:
# Checking for mixed type data 

for col in df_customers.columns.tolist():
  weird = (df_customers[[col]].applymap(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_customers[weird]) > 0:
    print (col)

first_name


### The column titled 'first_name' has mixed type data.

### Converting mixed type data for column 'first_name'

In [25]:
# Convert data type for column titled 'first_name' to string

df_customers['first_name'] = df_customers['first_name'].astype('str')

In [26]:
# Re-run checks for: 
# Mixed Type Data 

for col in df_customers.columns.tolist():
  weird = (df_customers[[col]].applymap(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_customers[weird]) > 0:
    print (col)

### No mixed type data now

In [27]:
# Re-run checks for: 
# Data type 

df_customers.dtypes

user_id                  int64
first_name              object
surname                 object
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependents     int64
marital_status          object
income                   int64
dtype: object

### Checking for duplicates

In [28]:
df_dups = df_customers[df_customers.duplicated()]

In [29]:
df_dups

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependents,marital_status,income


### There are no duplicates in the dataframe

## Step 6

## Combine your customer data with the rest of your prepared Instacart data. Tip: Make sure the key columns are of the same data type! Hint: Make sure the key columns are the same data type!

In [30]:
# Check data usage of customers dataframe 

df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   user_id               206209 non-null  int64 
 1   first_name            206209 non-null  object
 2   surname               206209 non-null  object
 3   gender                206209 non-null  object
 4   state                 206209 non-null  object
 5   age                   206209 non-null  int64 
 6   date_joined           206209 non-null  object
 7   number_of_dependents  206209 non-null  int64 
 8   marital_status        206209 non-null  object
 9   income                206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


### Convert and reduce data types to save memory before merge

In [31]:
# Convert data type for column titled 'user_id' to string

df_customers['user_id'] = df_customers['user_id'].astype('str')

In [32]:
# Reduce data type for columns titled 'age' and 'number_of_dependents' from int64 to to int8 (as it seems logical)

df_customers['age'] = df_customers['age'].astype('int8')
df_customers['number_of_dependents'] = df_customers['number_of_dependents'].astype('int8')

In [33]:
# Reduce data type for column titled 'income' from int64 to to int32 (as it seems logical)

df_customers['income'] = df_customers['income'].astype('int32')

In [34]:
# Rechecking data usage of dataframe

df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   user_id               206209 non-null  object
 1   first_name            206209 non-null  object
 2   surname               206209 non-null  object
 3   gender                206209 non-null  object
 4   state                 206209 non-null  object
 5   age                   206209 non-null  int8  
 6   date_joined           206209 non-null  object
 7   number_of_dependents  206209 non-null  int8  
 8   marital_status        206209 non-null  object
 9   income                206209 non-null  int32 
dtypes: int32(1), int8(2), object(7)
memory usage: 12.2+ MB


### Merge customers dataset with the rest of your prepared Instacart data

In [35]:
# Import rest of the prepared Instacart dataset 

df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_updated_4_8.pkl'))

In [37]:
# Check shape of imported df

df_ords_prods_merged.shape

(32404859, 25)

In [38]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,first_time_customers,product_id,add_to_cart_order,reordered,...,price_range_loc,busiest day,Busiest days,Busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,Frequency_of_customer,order_frequency_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer


In [39]:
# Check columns of imported df

df_ords_prods_merged.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_time_of_day', 'days_since_prior_order', 'first_time_customers',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', '_merge', 'price_range_loc',
       'busiest day', 'Busiest days', 'Busiest_period_of_day', 'max_order',
       'loyalty_flag', 'average_price', 'spending_flag',
       'Frequency_of_customer', 'order_frequency_flag'],
      dtype='object')

In [40]:
# Drop column name '_merge' from imported df

df_ords_prods_merged.drop(columns = ['_merge'])

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,first_time_customers,product_id,add_to_cart_order,reordered,...,price_range_loc,busiest day,Busiest days,Busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,Frequency_of_customer,order_frequency_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,False,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,False,196,1,1,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low Spender,20.5,Non-frequent customer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32404854,1320836,202557,17,2,15,1.0,False,43553,2,1,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.905655,Low Spender,8.0,Frequent customer
32404855,31526,202557,18,5,11,3.0,False,43553,2,1,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.905655,Low Spender,8.0,Frequent customer
32404856,758936,203436,1,2,7,,True,42338,4,0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,3,New customer,7.631579,Low Spender,15.0,Regular customer
32404857,2745165,203436,2,3,5,15.0,False,42338,16,1,...,Mid-range product,Regularly busy,Slowest days,Fewest orders,3,New customer,7.631579,Low Spender,15.0,Regular customer


In [41]:
# Re-check columns

df_ords_prods_merged.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_time_of_day', 'days_since_prior_order', 'first_time_customers',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', '_merge', 'price_range_loc',
       'busiest day', 'Busiest days', 'Busiest_period_of_day', 'max_order',
       'loyalty_flag', 'average_price', 'spending_flag',
       'Frequency_of_customer', 'order_frequency_flag'],
      dtype='object')

In [42]:
# Drop column name '_merge' from imported df and assign to new df 'df_ords_prods_merged_updated'

df_ords_prods_merged_updated = df_ords_prods_merged.drop(columns = ['_merge'])

In [43]:
# Re-check columns

df_ords_prods_merged_updated.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_time_of_day', 'days_since_prior_order', 'first_time_customers',
       'product_id', 'add_to_cart_order', 'reordered', 'product_name',
       'aisle_id', 'department_id', 'prices', 'price_range_loc', 'busiest day',
       'Busiest days', 'Busiest_period_of_day', 'max_order', 'loyalty_flag',
       'average_price', 'spending_flag', 'Frequency_of_customer',
       'order_frequency_flag'],
      dtype='object')

In [44]:
df_ords_prods_merged_updated.shape

(32404859, 24)

In [45]:
# Check data type of merged dataframe (df_ords_prods_merged_updated)

df_ords_prods_merged_updated.dtypes

order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_time_of_day           int64
days_since_prior_order    float64
first_time_customers         bool
product_id                  int64
add_to_cart_order           int64
reordered                   int64
product_name               object
aisle_id                    int64
department_id               int64
prices                    float64
price_range_loc            object
busiest day                object
Busiest days               object
Busiest_period_of_day      object
max_order                   int64
loyalty_flag               object
average_price             float64
spending_flag              object
Frequency_of_customer     float64
order_frequency_flag       object
dtype: object

### As we'll be merging on 'user_id', therefore data type needs to be same as per the hint from the step mentioned in the task. 

In [46]:
# Convert data type for column titled 'user_id' to string

df_ords_prods_merged_updated['user_id'] = df_ords_prods_merged_updated['user_id'].astype('str')

In [47]:
# Re-check data type of df (df_ords_prods_merged_updated)

df_ords_prods_merged_updated.dtypes

order_id                    int64
user_id                    object
order_number                int64
orders_day_of_week          int64
order_time_of_day           int64
days_since_prior_order    float64
first_time_customers         bool
product_id                  int64
add_to_cart_order           int64
reordered                   int64
product_name               object
aisle_id                    int64
department_id               int64
prices                    float64
price_range_loc            object
busiest day                object
Busiest days               object
Busiest_period_of_day      object
max_order                   int64
loyalty_flag               object
average_price             float64
spending_flag              object
Frequency_of_customer     float64
order_frequency_flag       object
dtype: object

### Merge both dataframes on 'user_id'

In [48]:
# Merging dataframes (df_customers to the df_ords_prods_merged_updated dataframe)

df_ords_prods_custs = df_ords_prods_merged_updated.merge(df_customers, on = 'user_id', indicator = True)

In [49]:
# Check shape of merged df

df_ords_prods_custs.shape

(32404859, 34)

There has been an addition of 10 new rows to the dataframe which is correct.  

In [50]:
# Print head of merged df

df_ords_prods_custs.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_time_of_day,days_since_prior_order,first_time_customers,product_id,add_to_cart_order,reordered,...,first_name,surname,gender,state,age,date_joined,number_of_dependents,marital_status,income,_merge
0,2539329,1,1,2,8,,True,196,1,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,2398795,1,2,3,7,15.0,False,196,1,1,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,473747,1,3,3,12,21.0,False,196,1,1,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,2254736,1,4,4,7,29.0,False,196,1,1,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,431534,1,5,4,15,28.0,False,196,1,1,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both


## Step 7 

### Ensure your notebook contains logical titles, section headings, and descriptive code comments.

### Done

## Step 8

### Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.

In [51]:
# Export dataframe as a pickle file (pkl)

df_ords_prods_custs.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_customers_all_data.pkl'))

## Step 9 

### Save your notebook so that you can send it to your tutor for review after completing part 2.