# Task 4.9 – Part 1: Data Integration and Cleaning
This notebook covers the import, cleaning, and merging of customer data into the main Instacart dataset for further visualization and analysis.

In [1]:
import pandas as pd
import numpy as np

## Importing Customer Data
I begin by importing the customer dataset into our notebook for cleaning and preparation.

In [2]:
# Load customer dataset
customer_data = pd.read_csv('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Original Data/customers.csv')
customer_data.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## Reviewing Column Names
I'll check and rename columns that are unclear or poorly formatted.

In [3]:
# Check current column names
customer_data.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [4]:
# Rename columns for clarity
customer_data.rename(columns={
    'First Name': 'first_name',
    'Surnam': 'last_name',
    'AGE': 'age',
    'n_dependants': 'dependants',
    'fam_status': 'marital_status',
    'income': 'income'
}, inplace=True)

customer_data.head()

Unnamed: 0,user_id,first_name,last_name,Gender,STATE,Age,date_joined,dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## Dropping Unnecessary Columns
I'll drop columns that are not needed for our analysis.

In [5]:
# Drop columns that won't be used
customer_data = customer_data.drop(columns=['first_name', 'last_name'])
customer_data.head()

Unnamed: 0,user_id,Gender,STATE,Age,date_joined,dependants,marital_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


## Checking for Missing Values and Data Types
I will check for NaNs and confirm data types before merging.

In [6]:
# Check for missing values
print(customer_data.isnull().sum())

# Check data types
print(customer_data.dtypes)

user_id           0
Gender            0
STATE             0
Age               0
date_joined       0
dependants        0
marital_status    0
income            0
dtype: int64
user_id            int64
Gender            object
STATE             object
Age                int64
date_joined       object
dependants         int64
marital_status    object
income             int64
dtype: object


In [7]:
# Drop rows with missing user_id or critical columns (if any)
customer_data = customer_data.dropna(subset=['user_id'])

In [8]:
# Reset index if rows were dropped
customer_data = customer_data.reset_index(drop=True)

## Checking for Duplicate Rows
I'll remove any duplicate customer records if they exist.

In [9]:
# Drop duplicate rows
customer_data = customer_data.drop_duplicates()

## Loading Main Orders-Products Dataset
I'll load the full Instacart dataset to merge with the customer information.

In [10]:
ords_prods_merge = pd.read_pickle('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Prepared Data/ords_prods_merge.pkl')
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,True,196,1,0,both,Soda,77,7,9.0
1,2539329,1,1,2,8,,True,14084,2,0,both,Organic Unsweetened Vanilla Almond Milk,91,16,12.5
2,2539329,1,1,2,8,,True,12427,3,0,both,Original Beef Jerky,23,19,4.4
3,2539329,1,1,2,8,,True,26088,4,0,both,Aged White Cheddar Popcorn,23,19,4.7
4,2539329,1,1,2,8,,True,26405,5,0,both,XL Pick-A-Size Paper Towel Rolls,54,17,1.0


## Ensuring Consistent Key Column Types
I'll check that the `user_id` column has the same data type in both dataframes.

In [11]:
# Check and convert types
ords_prods_merge['user_id'] = ords_prods_merge['user_id'].astype(int)
customer_data['user_id'] = customer_data['user_id'].astype(int)

## Merging Datasets
I'll merge the orders-products data with the customer information.

In [12]:
df_merged = ords_prods_merge.merge(customer_data, on='user_id', how='left')
df_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,...,aisle_id,department_id,prices,Gender,STATE,Age,date_joined,dependants,marital_status,income
0,2539329,1,1,2,8,,True,196,1,0,...,77,7,9.0,Female,Alabama,31,2/17/2019,3,married,40423
1,2539329,1,1,2,8,,True,14084,2,0,...,91,16,12.5,Female,Alabama,31,2/17/2019,3,married,40423
2,2539329,1,1,2,8,,True,12427,3,0,...,23,19,4.4,Female,Alabama,31,2/17/2019,3,married,40423
3,2539329,1,1,2,8,,True,26088,4,0,...,23,19,4.7,Female,Alabama,31,2/17/2019,3,married,40423
4,2539329,1,1,2,8,,True,26405,5,0,...,54,17,1.0,Female,Alabama,31,2/17/2019,3,married,40423


## Exporting Merged Dataset
I'll export the combined dataframe as a .pkl file for use in Part 2.

In [13]:
df_merged.to_pickle('df_merged_4_9.pkl')