### **Task 4.9**
#### **Part 1:** Customer data set

#### **Question 3:** Import analysis libraries and **customer data set**

In [1]:
# Import libraries
import pandas as pd
import numpy as nm
import os

In [2]:
# Set display options for better viewing
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 100)  # Limit columns
pd.set_option('display.max_rows', 50)      # Limit rows

In [3]:
# Create shortcut for data file
path= r'/Users/anjanpakhrin/Documents/Instacart Basket Analysis'

In [4]:
# Import data set as dataframe "customer"
df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_4-8.pkl'))

In [5]:
# Check output
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [6]:
ords_prods_merge['prices'].describe()

count    3.240486e+07
mean     7.791052e+00
std      4.100602e+00
min      1.000000e+00
25%      4.200000e+00
50%      7.400000e+00
75%      1.130000e+01
max      2.500000e+01
Name: prices, dtype: float64

#### **Question 4:** Wrangling customer data
#### Steps:
1. Dropping columns
2. Renaming columns
3. Changing data types
4. Transposing

**1. Dropping columns:**
Names typically offer little value for analyzing customer behavior or demographics. Therefore we are dropping the 'First Name' and 'Surnam' columns. The remaining columns are all essential for our planned work.

In [7]:
# Dropping columns "First Name" and "Surnam"
df_cust = df_cust.drop(columns = ['First Name', 'Surnam'])

In [8]:
# Checking columns
df_cust.head()

Unnamed: 0,user_id,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


**2. Renaming columns:** Some of columns required renaming to make column names clean and consistent. Renaming could be done shown below.

* Gender -> gender
* STATE -> state
* Age -> age
* n_dependants -> num_dependents
* fam_status -> family_status

All other columns names are already clean and consistent.

In [9]:
# Renaming columns
df_cust.rename(columns = {
    'Gender' : 'gender',
    'STATE' : 'state',
    'Age' : 'age',
    'n_dependants' : 'num_dependents',
    'fam_status' : 'family_status'
}, inplace = True)

In [10]:
# Checking output
df_cust.head()

Unnamed: 0,user_id,gender,state,age,date_joined,num_dependents,family_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


**3. Changing data types:**

In [11]:
# Recalling data types of columns
df_cust.dtypes

user_id            int64
gender            object
state             object
age                int64
date_joined       object
num_dependents     int64
family_status     object
income             int64
dtype: object

Data types for some of columns need to be converted to appropriate type to make them consistent for further analysis.
* user_id: int64 --> string
* date_joined: object --> datetime
* Data types for all other columns remain as it is. 

In [12]:
# Convert user_id to int32
df_cust['user_id'] = df_cust['user_id'].astype('int32')

# Convert date_joined to datetime
df_cust['date_joined'] = pd.to_datetime(df_cust['date_joined'])

# Verify the changes
print(df_cust.dtypes)

user_id                    int32
gender                    object
state                     object
age                        int64
date_joined       datetime64[ns]
num_dependents             int64
family_status             object
income                     int64
dtype: object


In [13]:
df_cust.head()

Unnamed: 0,user_id,gender,state,age,date_joined,num_dependents,family_status,income
0,26711,Female,Missouri,48,2017-01-01,3,married,165665
1,33890,Female,New Mexico,36,2017-01-01,0,single,59285
2,65803,Male,Idaho,35,2017-01-01,2,married,99568
3,125935,Female,Iowa,40,2017-01-01,0,single,42049
4,130797,Female,Maryland,26,2017-01-01,1,married,40374


**4. Transposing data:** Not required

#### **Question 5:** Quality and consistency checks for customer data

In [14]:
# Investigating accuracy in columns with descriptive statistics
df_cust.describe().round(2)

Unnamed: 0,user_id,age,date_joined,num_dependents,income
count,206209.0,206209.0,206209,206209.0,206209.0
mean,103105.0,49.5,2018-08-17 03:06:30.029532928,1.5,94632.85
min,1.0,18.0,2017-01-01 00:00:00,0.0,25903.0
25%,51553.0,33.0,2017-10-23 00:00:00,0.0,59874.0
50%,103105.0,49.0,2018-08-16 00:00:00,1.0,93547.0
75%,154657.0,66.0,2019-06-10 00:00:00,3.0,124244.0
max,206209.0,81.0,2020-04-01 00:00:00,3.0,593901.0
std,59527.56,18.48,,1.12,42473.79


**1. Checking and addressing missing values**

In [15]:
# Finding missing values in **df_cust**
df_cust.isnull().sum()

user_id           0
gender            0
state             0
age               0
date_joined       0
num_dependents    0
family_status     0
income            0
dtype: int64

**No missing values found --> no further action required**

**2. Checking and addressing duplicates**

In [16]:
# Subset of duplicates in df_cust
df_cuts_dups = df_cust[df_cust.duplicated()]

df_cuts_dups

Unnamed: 0,user_id,gender,state,age,date_joined,num_dependents,family_status,income


**No duplicates found --> no further action required**

In [17]:
# Checking dimensions fo customer data after wrangling and consistency check
df_cust.shape

(206209, 8)

In [18]:
# Exporting cleaned & checked customer data as "pkl"
df_cust.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'customers_checked.pkl'))

#### **Question 6:** Combining cleaned customer data with prepared Instacart data

In [19]:
# Create shortcut for data file
path = r'/Users/anjanpakhrin/Documents/Instacart Basket Analysis'

In [20]:
# Import checked data set for customer as dataframe "df_cust_checked"

df_cust_checked = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'customers_checked.pkl'))

In [21]:
# Checking shape of "orders_products_combined" and "df_prods" (products)
print('Shape of df_cust:', df_cust.shape)
print('Shape of ords_prods_merge:', ords_prods_merge.shape)

Shape of df_cust: (206209, 8)
Shape of ords_prods_merge: (32404859, 24)


##### **Comparing data types**

In [22]:
# Checking data types for ords_prods_merge
ords_prods_merge.dtypes

order_id                    int32
user_id                     int32
order_number                int32
order_day_of_week           int32
order_hour_of_day           int32
days_since_prior_order    float32
first_order                  bool
product_id                  int32
add_to_cart_order           int32
reordered                   int32
product_name               object
aisle_id                    int32
department_id               int32
prices                    float32
price_range_loc            object
busiest_day                object
busiest_days               object
busiest_period_of_day      object
max_order                   int32
loyalty_flag               object
user_mean_prices          float32
spending_flag              object
median_frequency          float32
frequency_flag             object
dtype: object

In [23]:
# checking data types for cusotmer data
df_cust.dtypes

user_id                    int32
gender                    object
state                     object
age                        int64
date_joined       datetime64[ns]
num_dependents             int64
family_status             object
income                     int64
dtype: object

In [24]:
# Changing data types of customer data to match the data types of ords_prods_merge
df_cust['num_dependents'] = df_cust['num_dependents'].astype('int32')

In [25]:
# Verifying the changes
df_cust.dtypes

user_id                    int32
gender                    object
state                     object
age                        int64
date_joined       datetime64[ns]
num_dependents             int32
family_status             object
income                     int64
dtype: object

##### **Identifying "key column"**

In [26]:
# Identify common columns
common_columns = list(set(df_cust.columns) & set(ords_prods_merge.columns))
print('Common column/s between both DataFrames:', common_columns)

Common column/s between both DataFrames: ['user_id']


##### **Combining customer data with the rest of prepared Instacart data with common key "user_id"**

In [27]:
# Merging both dataframes "df_cust" and "ords_prods_merge" --> "df_instacart_merged"
df_instacart_merged = df_cust.merge(ords_prods_merge, on = 'user_id', indicator = True)

In [28]:
# Checking output
df_instacart_merged.head()

Unnamed: 0,user_id,gender,state,age,date_joined,num_dependents,family_status,income,order_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,user_mean_prices,spending_flag,median_frequency,frequency_flag,_merge
0,26711,Female,Missouri,48,2017-01-01,3,married,165665,518967,1,2,9,,True,6184,1,0,Clementines,32,4,4.3,Low-range product,Regularly busy,Regularly busy,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
1,26711,Female,Missouri,48,2017-01-01,3,married,165665,423547,2,2,9,14.0,False,38928,1,0,0% Greek Strained Yogurt,120,16,12.6,Mid-range product,Regularly busy,Regularly busy,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
2,26711,Female,Missouri,48,2017-01-01,3,married,165665,2524893,3,3,11,30.0,False,38928,1,1,0% Greek Strained Yogurt,120,16,12.6,Mid-range product,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
3,26711,Female,Missouri,48,2017-01-01,3,married,165665,2524893,3,3,11,30.0,False,6184,2,1,Clementines,32,4,4.3,Low-range product,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both
4,26711,Female,Missouri,48,2017-01-01,3,married,165665,2524893,3,3,11,30.0,False,47402,3,0,Fuji Apples,24,4,7.1,Mid-range product,Regularly busy,Slowest days,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer,both


In [29]:
# Check matching criteria - checking merge flag frequency
df_instacart_merged['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

In [30]:
# Drop the "_merge" column
df_instacart_merged = df_instacart_merged.drop(columns=['_merge'])

In [31]:
df_instacart_merged.head(1)

Unnamed: 0,user_id,gender,state,age,date_joined,num_dependents,family_status,income,order_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,user_mean_prices,spending_flag,median_frequency,frequency_flag
0,26711,Female,Missouri,48,2017-01-01,3,married,165665,518967,1,2,9,,True,6184,1,0,Clementines,32,4,4.3,Low-range product,Regularly busy,Regularly busy,Most orders,8,New customer,7.988889,Low spender,19.0,Regular customer


In [32]:
# Check shape of instacart merged data
df_instacart_merged.shape

(32404859, 31)

#### **Question 8:** Export merged Instacart data as pickle

In [33]:
# Export data as pkl
df_instacart_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'customers_merged.pkl'))