# Customers - Data Wrangling and Data Consistency Checks:

1. Importing libraries and dataset
2. Checking for columns, datatype, shape using .info()
3. Removing columns with PII using .drop()
4. Renaming columns using .rename()
5. Addressing missing values
6. Addressing duplicates
7. Checking for mixed datatype
8. Changing datatypes to reduce memory usage
9. Performing Descriptive Analysis
10. Exporting wrangled, consistency checked dataframe

## 1. Importing libraries and dataset

In [1]:
# Importing libraries

import pandas as pd
import os

In [2]:
# Accessing EnvFile for path

%run EnvFile.ipynb

Stored 'path' (str)


In [3]:
# Importing orders.csv to dataframe

df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [4]:
# Checking the head

df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## 2. Checking for columns, datatype, shape using .info()

In [5]:
# Checking the info for columns, datatypes, shape of dataframe

df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


#### The shape of df_cust before consistency checks is (206209, 10) with memory usage of 15.7+ MB.

## 3. Removing columns with PII using .drop()

In [6]:
# Removing unnecessary column eval_set

df_cust.drop(columns = ['First Name', 'Surnam'], inplace = True)

## 4. Renaming columns using .rename()

In [7]:
# Renaming columns with appropriate names

df_cust.rename(columns = {'Gender' : 'gender', 'STATE' : 'state', 'Age' : 'age', 'n_dependants' : 'number_of_dependants', 'fam_status' : 'family_status'}, inplace = True)

## 5. Addressing missing values

In [8]:
# Finding missing values

df_cust.isnull().sum()

user_id                 0
gender                  0
state                   0
age                     0
date_joined             0
number_of_dependants    0
family_status           0
income                  0
dtype: int64

#### There are no missing values.

## 6. Addressing duplicates

In [9]:
# Check for duplicates

df_cust[df_cust.duplicated()]

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income


There are no duplicates. Hence no changes to be made.

## 7. Checking for mixed datatype

In [10]:
# Check for mixed-type data in your df_ords dataframe.

for col in df_cust.columns.tolist():
    mixeddata = (df_cust[[col]].applymap(type) != df_cust[[col]].iloc[0].apply(type)).any(axis = 1)
    if len(df_cust[mixeddata]) > 0:
        print(col)

There is no mixed type data

## 8. Changing datatypes to reduce memory usage

In [11]:
# Change datatypes to reduce memory usage

df_cust['user_id'] = df_cust['user_id'].astype('int32')
df_cust['age'] = df_cust['age'].astype('int8')
df_cust['number_of_dependants'] = df_cust['number_of_dependants'].astype('int8')
df_cust['income'] = df_cust['income'].astype('int32')

In [12]:
# Checking for memory usage reduction using info

df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 8 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   user_id               206209 non-null  int32 
 1   gender                206209 non-null  object
 2   state                 206209 non-null  object
 3   age                   206209 non-null  int8  
 4   date_joined           206209 non-null  object
 5   number_of_dependants  206209 non-null  int8  
 6   family_status         206209 non-null  object
 7   income                206209 non-null  int32 
dtypes: int32(2), int8(2), object(4)
memory usage: 8.3+ MB


#### The shape of df_cust after consistency checks is (206209, 8) with memory usage of 8.3+ MB.

## 9. Performing Descriptive Analysis

In [13]:
# Checking descriptive analysis

df_cust.describe()

Unnamed: 0,user_id,age,number_of_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


#### There is nothing off about the data. All the min and max values are as expected.

## 10. Exporting wrangled, consistency checked dataframe

In [14]:
# Importing prepared Instacart data

df_ords_prods_dept = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_dept_aggcolumns.pkl'))

In [15]:
df_ords_prods_dept.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404289 entries, 0 to 32404288
Data columns (total 23 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   order_id                       int32  
 1   user_id                        int32  
 2   order_number                   int8   
 3   order_day_of_week              int8   
 4   order_hour_of_day              int8   
 5   days_since_prior_order         float16
 6   product_id                     int32  
 7   add_to_cart_order              int32  
 8   reordered                      int8   
 9   product_name                   object 
 10  aisle_id                       int8   
 11  department_id                  int8   
 12  prices                         float64
 13  department                     object 
 14  price_range                    object 
 15  day_busyness_level             object 
 16  hour_busyness_level            object 
 17  max_order                      int8   
 18  

In [16]:
# Merging both dfs

df_all = df_ords_prods_dept.merge(df_cust, on = 'user_id', indicator = True)

In [17]:
# Checking the frequency of '_merge' column values

df_all['_merge'].value_counts()

both          32404289
left_only            0
right_only           0
Name: _merge, dtype: int64

In [18]:
# Dropping the _merge indicator

df_all.drop(columns = ['_merge'], inplace = True)

In [19]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404289 entries, 0 to 32404288
Data columns (total 30 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   order_id                       int32  
 1   user_id                        int32  
 2   order_number                   int8   
 3   order_day_of_week              int8   
 4   order_hour_of_day              int8   
 5   days_since_prior_order         float16
 6   product_id                     int32  
 7   add_to_cart_order              int32  
 8   reordered                      int8   
 9   product_name                   object 
 10  aisle_id                       int8   
 11  department_id                  int8   
 12  prices                         float64
 13  department                     object 
 14  price_range                    object 
 15  day_busyness_level             object 
 16  hour_busyness_level            object 
 17  max_order                      int8   
 18  

The shape of df_all after merging is (32434489, 29) with memory usage of 4.5+ GB.

In [20]:
# Checking the ouput of merged df

df_all.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,spending_flag,median_days_since_prior_order,order_frequency_flag,gender,state,age,date_joined,number_of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,...,Low Spender,20.5,Non-frequent Customer,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Low Spender,20.5,Non-frequent Customer,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Low Spender,20.5,Non-frequent Customer,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Low Spender,20.5,Non-frequent Customer,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Low Spender,20.5,Non-frequent Customer,Female,Alabama,31,2/17/2019,3,married,40423


In [21]:
# Export final merged dataframe

df_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_dept_customers.pkl'))