### This script contains the following points:
Note: Corresponding steps for exercise task shown as (Step x)

#### 1. (Step 3a) Importing libraries, defining project path, and importing dataset "customers.csv" (as df_customers).
#### 2. (Step 3b) Reviewing df_customers (size, shape, info, descriptive stats)
#### 3. (Step 4) Wrangling customer dataset
#### 4. (Step 4a) Reviewing and renaming column names.
#### 5. (Step 4b) Dropping column names with customer information (as df_customers_2).
#### 6. (Step 5) Consistency checks customer dataset
#### 7. (Step 5a) Checking for mixed data types (none found).
#### 8. (Step 5b) Checking for missing values (none found).
#### 9. (Step 5c) Checking for duplicate values (none found).
#### 10. (Step 5d) Reviewing frequency tables of columns 'state' and 'fam_status'.
#### 11. (Step 5e) Exporting "df_customers_2" as "customers_checked.pkl".

## 1. (Step 3a) Importing Libraries, Defining Project Path, and Importing Datasets

In [1]:
# Importing pandas, numpy, and os
import pandas as pd
import numpy as np
import os

In [2]:
# Defining project folder path
path = r'C:\Users\prena\03-2023 Instacart Basket Analysis'

In [3]:
# Importing customers.csv dataset
df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

## 2. (Step 3b) Reviewing df_customers

In [4]:
df_customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [5]:
df_customers.tail()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799
206208,80148,Cynthia,Noble,Female,New York,55,4/1/2020,1,married,57095


In [6]:
df_customers.shape

(206209, 10)

In [7]:
df_customers.dtypes

user_id          int64
First Name      object
Surnam          object
Gender          object
STATE           object
Age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

In [8]:
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


In [9]:
df_customers.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


## 3. (Step 4) Wrangling df_customers

### 4. (Step 4a) Reviewing column names, and renaming where necessary

In [10]:
df_customers.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

Based on this information, I'll change the columns to the following:
- 'First Name' to 'first_name'
- 'Surnam' to 'last_name'
- 'Gender' to 'gender'
- 'STATE' to 'state'
- 'Age' to 'age'
- 'n_dependants' to 'dependants_count'

In [11]:
df_customers.rename(columns = {'First Name' : 'first_name', 'Surnam' : 'last_name', 'Gender' : 'gender', 'STATE' : 'state', 'Age' : 'age', 'n_dependants' : 'dependants_count'}, inplace = True)

In [12]:
df_customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,dependants_count,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


### 5. (Step 4b) Dropping columns where necessary

The user_id allows us to merge our data with our existing orders_products_merged dataset. All other columns allow us to understand the variety of customers in the Instacart database. The first_name and last_name columns could be dropped in order to protect identifying customer information. I'll create a new dataframe now (called df_customers_2) to omit these columns:

In [13]:
# Dropping first_name and last_name columns from customers.csv
df_customers.drop(columns = ['first_name', 'last_name'])

Unnamed: 0,user_id,gender,state,age,date_joined,dependants_count,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...
206204,168073,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Female,California,27,4/1/2020,1,married,99799


In [14]:
# Set new dataframe as df_customers_2
df_customers_2 = df_customers.drop(columns = ['first_name', 'last_name'])

In [15]:
df_customers_2.head()

Unnamed: 0,user_id,gender,state,age,date_joined,dependants_count,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


In [16]:
df_customers_2.shape

(206209, 8)

Confirmed that the number of rows for our dataset was decreased from 10 to 8. 

## 6. (Step 5) Complete data quality and consistency checks

### 7. (Step 5a) Checking for mixed data types

In [17]:
# Run code to check for mixed data types
for col in df_customers_2.columns.tolist():
  weird = (df_customers_2[[col]].applymap(type) != df_customers_2[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_customers_2[weird]) > 0:
    print (col)

#### No mixed data types were found.

### 8. (Step 5b) Checking for missing values and addressing where necessary

In [18]:
# Search for missing values
df_customers_2.isnull().sum()

user_id             0
gender              0
state               0
age                 0
date_joined         0
dependants_count    0
fam_status          0
income              0
dtype: int64

#### No null values where found in the dataframe.

### 9. (Step 5c) Checking for duplicate values and addressing where necessary

In [19]:
# Search for full duplicates
df_dups = df_customers_2[df_customers_2.duplicated()]

In [20]:
df_dups

Unnamed: 0,user_id,gender,state,age,date_joined,dependants_count,fam_status,income


#### No full duplicate values were found in the dataframe.

### 10. (Step 5d) Check frequency tables.

In [21]:
df_customers_2['state'].value_counts(dropna = False).sort_index()

Alabama                 4044
Alaska                  4044
Arizona                 4044
Arkansas                4044
California              4044
Colorado                4044
Connecticut             4044
Delaware                4044
District of Columbia    4044
Florida                 4044
Georgia                 4044
Hawaii                  4044
Idaho                   4044
Illinois                4044
Indiana                 4044
Iowa                    4044
Kansas                  4043
Kentucky                4043
Louisiana               4043
Maine                   4043
Maryland                4043
Massachusetts           4043
Michigan                4043
Minnesota               4043
Mississippi             4043
Missouri                4043
Montana                 4043
Nebraska                4043
Nevada                  4043
New Hampshire           4043
New Jersey              4043
New Mexico              4043
New York                4043
North Carolina          4043
North Dakota  

In [22]:
df_customers_2['fam_status'].value_counts(dropna = False)

married                             144906
single                               33962
divorced/widowed                     17640
living with parents and siblings      9701
Name: fam_status, dtype: int64

## 11. (Step 5e) Export 'df_customers_2' as 'customers_checked.pkl'

In [23]:
# Export customer data to pkl
df_customers_2.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'customers_checked.pkl'))

In [24]:
df_customers_2.shape