# **4.9 IC Intro to Data Visualisation with Python – Part 1**

# **Incorporating Customer Data into Instacart Project**
In this notebook, I incorporate the new customer dataset into the Instacart project. I wrangle the customer data, check for inconsistencies and duplicates, and merge it with the previously prepared orders and products dataset to create a single comprehensive dataframe.

## Table of Contents  
- [1. Import Libraries](#1-import-libraries)  
- [2. Load Data](#2-load-data)
- [3. Wrangle Data](#3-wrangle-data)  
- [4. Data Quality Checks](#4-data-quality-checks)  
- [5. Merge Customer Data with Orders & Products](#5-merge-customer-data-with-orders-&-products)  
- [6. Export Merged Data](#6-export-merged-data)

---

## 1. Import Libraries
I begin by importing the libraries needed for this analysis:
- `pandas` and `numpy` for data handling and numerical operations
- `os` for file handling

In [1]:
# Import libraries 
import pandas as pd
import numpy as np
import os

---

## 2. Import Data
I define the project path and import the new customer dataset into a dataframe. This will allow me to incorporate customer details into my analysis.

In [2]:
# Import customer dataframe
path = r'/Users/yaseminmustafa/Desktop/CareerFoundry/Exercise 4/15-05-2025_Instacart Basket Analysis'

In [3]:
df_customers = pd.read_csv(os.path.join(path,"02_Data/Original Data/customers.csv"), index_col = False)

---

## 3. Wrangle Data
The dataset has inconsistent column names. Here, I rename the columns to be more intuitive and consistent. This will make later analysis and merging with other datasets easier.

In [4]:
# Check output
df_customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [5]:
# Rename 'Surnam' column
df_customers.rename(columns = {'Surnam' : 'surname'}, inplace = True)

In [6]:
# Rename 'First Name' column
df_customers.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [7]:
# Rename 'STATE' column
df_customers.rename(columns = {'STATE' : 'state'}, inplace = True)

In [8]:
# Rename 'n_dependants' column
df_customers.rename(columns = {'n_dependants' : 'number_of_dependants'}, inplace = True)

In [9]:
# Rename 'fam_status' column
df_customers.rename(columns = {'fam_status' : 'family_status'}, inplace = True)

In [10]:
# Rename 'income' column
df_customers.rename(columns = {'income' : 'income'}, inplace = True)

In [11]:
# Rename 'Age' column
df_customers.rename(columns = {'Age' : 'age'}, inplace = True)

In [12]:
# Rename 'Gender' column
df_customers.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [13]:
# Check output
df_customers.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [14]:
# Check shape
df_customers.shape

(206209, 10)

---

## 4. Data Quality Checks
### 4a. Check for Missing Values

I check for missing values to understand the completeness of the dataset. Missing values can affect analysis and need to be addressed.

In [15]:
# Check for missing values
df_customers.isnull().sum()

user_id                     0
first_name              11259
surname                     0
gender                      0
state                       0
age                         0
date_joined                 0
number_of_dependants        0
family_status               0
income                      0
dtype: int64

### 4b. Check for Mixed Data Types
I verify that each column contains consistent data types. Mixed types in a column can cause errors in analysis or calculations.

In [16]:
# Check for mixed types 
for col in df_customers.columns.tolist():
  weird = (df_customers[[col]].map(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_customers[weird]) > 0:
    print (col)

first_name


### 4c. Handle Missing Values
The first_name column has some missing values. I replace them with 'Unknown' to maintain consistency and convert the column to string type.

In [17]:
# Inspect the types of data in 'First Name'
df_customers['first_name'].map(type).value_counts()

first_name
<class 'str'>      194950
<class 'float'>     11259
Name: count, dtype: int64

*This is explained by the fact that names are identified as strings e.g. 'Alice'. Missing values (e.g. NaN) are being identified as floats. Thes emising values will be replaced by the placeholder 'Unknown'.*

In [18]:
# Replace NaN in 'First_Name' column with placeholder 'Unknown'
df_customers['first_name'] = df_customers['first_name'].fillna('Unknown')

In [19]:
# Convert 'First_Name' column to string
df_customers['first_name'] = df_customers['first_name'].astype(str)

In [20]:
# Re-check for mixed types 
for col in df_customers.columns.tolist():
  weird = (df_customers[[col]].map(type) != df_customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_customers[weird]) > 0:
    print (col)

In [21]:
# Check shape
df_customers.shape

(206209, 10)

### 4d. Check for Duplicates
I check for duplicate rows to avoid double-counting customers in our analysis.

In [22]:
# Checking for duplicates 
df_customers.duplicated().sum()

0

---

## 5. Merge Customer Data with Orders & Products
Next, I import the prepared ords_prods_merge dataframe (which combines orders and products) and merge it with the customer dataframe on user_id. This ensures every order is linked with the correct customer details.

In [23]:
# Import df_ords_prods_merge
df_ords_prods_merge = pd.read_pickle(os.path.join(path,"02_Data/Prepared Data/ords_prods_merge_grouped.pkl"))

In [24]:
# Merging df_customers and df_prods_merge
df_ords_prods_all = df_customers.merge(df_ords_prods_merge, on = 'user_id')

In [25]:
# Check output
df_ords_prods_all.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,family_status,income,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_spending,spending_flag,median_order,order_frequency_flag
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,...,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.205882,Low spender,19.0,Regular customer
1,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,...,Mid-range product,Regularly busy,Regularly busy,Most orders,8.0,New customer,8.205882,Low spender,19.0,Regular customer
2,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,...,Low-range product,Regularly busy,Regularly busy,Most orders,8.0,New customer,8.205882,Low spender,19.0,Regular customer
3,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,...,Mid-range product,Regularly busy,Regularly busy,Most orders,8.0,New customer,8.205882,Low spender,19.0,Regular customer
4,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665,...,Low-range product,Regularly busy,Busiest days,Average orders,8.0,New customer,8.205882,Low spender,19.0,Regular customer


In [26]:
# Check shape
df_ords_prods_all.shape

(30328763, 34)

## 6. Export Merged Data
Finally, I export the merged dataframe as a pickle file. Using pickle preserves the dataframe structure and data types, which is important for the next notebook where we will create visualizations and perform further analysis.

In [27]:
# Export data to pkl
df_ords_prods_all.to_pickle(os.path.join(path, '02_Data','Prepared Data', 'ords_prods_all.pkl'))