## Introduction
In this task, I will incorporate customer data into my Instacart dataset and perform exploratory data analysis using visualizations. 
The goal is to uncover insights that can help Instacart stakeholders make data-driven marketing and business decisions. 

#### Table of Contents
1. Introduction
2. Importing Libraries and Loading Data
3. Data Quality and Cleaning
    - Handling Missing Values
    - Handling Duplicates
4. Checking and Fixing Data Types
5. Merging Customer Data with Instacart Data
    - Checking Key Columns Before Merging
    - Merging the Datasets
6. Exporting the Merged Dataset
7. Next Steps

### **Key Objectives:**
- Merge customer data with the Instacart dataset.
- Perform data cleaning and consistency checks.
- Generate **four types of visualizations**: histograms, bar charts, scatterplots, and line charts.
- Analyze relationships between customer demographics and purchasing behavior.
- Save visualizations for final report documentation.

## Table of Contents
1. **Importing Libraries and Loading Data**
   - Import necessary libraries  
   - Load customer data and Instacart data  
   - Initial data exploration  

3. **Data Cleaning and Wrangling**  
   - Checking for missing values and duplicates  
   - Renaming columns and correcting data types  
   - Merging customer data with Instacart dataset

4. **Exploratory Data Analysis & Visualizations**  
   - Creating a histogram for order times  
   - Creating a bar chart for customer loyalty distribution  
   - Creating a line chart for spending behavior by hour  
   - Creating a scatterplot for income vs. age  

5. **Insights & Conclusions**  
   - Summary of findings  
   - Key takeaways for Instacart stakeholders  

6. **Exporting Visualizations & Saving Notebook**  
   - Save plots as PNG files  
   - Export cleaned dataset for future use


## 1. Importing Libraries and Loading Data
Before I can start working with the data, I need to set up my environment by importing the libraries I'll be using. These will help me handle data, perform calculations, and create visualizations later on.  

### **Libraries I'll Be Using:**  
- **pandas** → for handling and manipulating data  
- **numpy** → for numerical operations  
- **matplotlib & seaborn** → for data visualization  

In [45]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make sure all columns are visible when displaying data
pd.set_option('display.max_columns', None)

### Loading the Customer Data  

Now that the libraries are ready, it's time to load the customer dataset. This data contains information about Instacart users, which I'll be merging with my existing Instacart order data later.  

First, I'll load the file and check the first few rows to make sure everything looks good.  

In [20]:
# Define the file path to the customer dataset
file_path = r"D:\YVC\Data Analytics (CF)\🐍 Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Original Data\customers.csv"

# Load the dataset into a DataFrame
df_customers = pd.read_csv(file_path)

# Display the first few rows to confirm the data loaded correctly
df_customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## 2. Data Quality and Cleaning  

Before I merge the customer dataset with my Instacart data, I need to make sure everything is clean and ready to go. That means checking for:  
- **Missing values** → to see if any important data is missing.  
- **Duplicates** → to avoid errors from repeated data.  
- **Data types** → to make sure everything is formatted correctly.  

Once I know what (if anything) needs fixing, I’ll clean up the dataset before moving forward.

In [22]:
# Check for missing values in the customer dataset
df_customers.isnull().sum()

user_id             0
First Name      11259
Surnam              0
Gender              0
STATE               0
Age                 0
date_joined         0
n_dependants        0
fam_status          0
income              0
dtype: int64

### Handling Missing Values  

Now that I’ve checked for missing values, I need to decide what to do with them. If any important columns have missing values, I might need to fill them in or drop them.

In [24]:
# Check for duplicate rows in the customer dataset
df_customers.duplicated().sum()

0

### Handling Duplicates  

If I find any duplicate rows, I’ll remove them to make sure my dataset is accurate.

In [27]:
# Check data types of each column
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


Since the "First Name" column has **11,259 missing values**, I need to decide whether to drop it or fill in the blanks.  

- **Dropping the column** makes sense if it’s not useful for my analysis.  
- **Filling missing values** with something like `"Unknown"` could help if I need to keep it.  

Since names aren’t critical for my analysis, I’m going to drop the column to keep my dataset clean.

In [None]:
# Drop the 'First Name' column since it's not needed
df_customers.drop(columns=['First Name'], inplace=True)

# Verify that the column was removed
df_customers.head()

## 4. Checking and Fixing Data Types  

Now that my dataset is cleaned up, I need to make sure all the columns have the correct data types. This is important because:  
- If a numeric column is stored as text, calculations won’t work properly.  
- If a date column is stored as text, I can’t use date-based functions.  
- If my key column (**user_id**) isn’t a string, merging with other datasets could cause errors.  

I’ll check the data types now and fix any issues before moving on.

In [33]:
# Check data types of each column
df_customers.dtypes

user_id          int64
First Name      object
Surnam          object
Gender          object
STATE           object
Age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

### Fixing Incorrect Data Types  

Now that I’ve checked the data types, I’ll fix any issues to make sure my dataset is in the best shape before merging.  
- **user_id** should be a string, not a number.  
- **date_joined** should be in a proper date format.  

In [35]:
# Convert 'user_id' to a string
df_customers['user_id'] = df_customers['user_id'].astype(str)

# Convert 'date_joined' to a datetime format
df_customers['date_joined'] = pd.to_datetime(df_customers['date_joined'])

# Verify the changes
df_customers.dtypes

user_id                 object
First Name              object
Surnam                  object
Gender                  object
STATE                   object
Age                      int64
date_joined     datetime64[ns]
n_dependants             int64
fam_status              object
income                   int64
dtype: object

## 5. Merging Customer Data with Instacart Data  

Now that my customer data is cleaned and formatted correctly, I’m going to merge it with my **Instacart orders dataset**.  

### **Why This Merge is Important:**  
- It allows me to connect **customer details** with their **order history**.  
- I can analyze **spending habits**, **loyalty trends**, and **demographics**.  

Before merging, I need to make sure the key column (**user_id**) is the same in both datasets to avoid errors.

In [47]:
# Define file path for the Instacart orders dataset
orders_path = r"D:\YVC\Data Analytics (CF)\🐍 Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\cleaned_orders.csv"

# Load the dataset
df_orders = pd.read_csv(orders_path)

# Display first few rows to confirm it loaded correctly
df_orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


### Checking Key Columns Before Merging  

To make sure the merge works properly, I need to check that **user_id** is formatted correctly in both datasets.  

- **Both columns should be the same data type (string).**  
- If not, I’ll convert them to match.  

In [49]:
# Ensure user_id is a string in both datasets
df_orders['user_id'] = df_orders['user_id'].astype(str)
df_customers['user_id'] = df_customers['user_id'].astype(str)

# Verify the data types again
print(df_orders.dtypes['user_id'], df_customers.dtypes['user_id'])

object object


### Merging the Datasets  

Now that everything is formatted correctly, I can merge the datasets using **user_id** as the key.  
- I’ll use a **left join** so that all existing Instacart orders stay in the dataset, even if some users don’t have matching customer details.  

In [51]:
# Merge orders with customer data
df_merged = df_orders.merge(df_customers, on='user_id', how='left')

# Display first few rows of the merged dataset
df_merged.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,2539329,1,prior,1,2,8,,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,prior,2,3,7,15.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,prior,3,3,12,21.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,prior,4,4,7,29.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,prior,5,4,15,28.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423


## 6. Exporting the Merged Dataset  

Now that my dataset is fully cleaned and merged, I’ll save it so I can use it later for data visualization.  

### **Why Exporting is Important?**  
- It keeps my work **organized** and **reusable**.  
- I won’t have to redo the merge every time I reopen my notebook.  
- I can quickly load this dataset in **Part 2** to create visualizations.  

I’ll save it in **pickle format (.pkl)** since it loads faster than a CSV file.

In [53]:
# Define file path for export
export_path = r"D:\YVC\Data Analytics (CF)\🐍 Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\merged_orders_customers.pkl"

# Save the merged dataset as a pickle file
df_merged.to_pickle(export_path)

# Confirm the export
print("Dataset successfully saved as a pickle file!")

Dataset successfully saved as a pickle file!


### Next Steps  

Now that my dataset is saved, I can move on to **Part 2**, where I’ll create visualizations to analyze trends in the data.