### Contents:
    01 Importing libraries and data
    02 Data exploration
    03 Cleaning
        a clarify confusing column names
        b down sample data types
        c outliers
        d missing values
        e duplicates
    04 Export

# Cleaning customers df

## 01 Importing libraries and data

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
path = r'/Users/Emily/Documents/CF Data Analysis Program/Immersion 4/Instacart Basket Analysis'

In [3]:
df = pd.read_csv(os.path.join(path, '02 Data', 'original data', 'customers.csv'), index_col = False)

## 02 Initial exploration

In [4]:
# view the top 5 rows and all column names
df.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [5]:
# check out the shape of the df (rows and columns)
df.shape

(206209, 10)

In [6]:
# check out the descriptive stats of whole df
# df.describe() would have just shown info for the numeric columns
df.describe(include = 'all')

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
count,206209.0,194950,206209,206209,206209,206209.0,206209,206209.0,206209,206209.0
unique,,207,1000,2,51,,1187,,4,
top,,Marilyn,Hamilton,Male,Florida,,9/17/2018,,married,
freq,,2213,252,104067,4044,,213,,144906,
mean,103105.0,,,,,49.501646,,1.499823,,94632.852548
std,59527.555167,,,,,18.480962,,1.118433,,42473.786988
min,1.0,,,,,18.0,,0.0,,25903.0
25%,51553.0,,,,,33.0,,0.0,,59874.0
50%,103105.0,,,,,49.0,,1.0,,93547.0
75%,154657.0,,,,,66.0,,3.0,,124244.0


Some notes:
- why is the count of First Name low?
- what are the 51 states? Including DC?
- Surnam instead of 'Last name'
- general inconsistent column names

In [7]:
# check out the data type of each column
# can also use df.dtypes
# the date_joined column is an object... should be 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


## 03 Cleaning up

### a) Column names

In [8]:
# renaming of columns for clarity
df = df.rename(columns = {'First Name': 'first_name', 'Surnam': 'last_name', 'STATE': 'state',
                          'Gender': 'gender', 'Age': 'age'})

### b) Data types

In [9]:
# Check for mixed-type data
for col in df.columns.tolist():
  weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df[weird]) > 0:
    print (col)

first_name


In [10]:
# it's mixed type because of all the NaNs
df['first_name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
Ned          186
David        186
Name: first_name, Length: 208, dtype: int64

In [11]:
# missing values check
df['first_name'].isnull().sum()

11259

In [12]:
# see if anything looks weird with the missing first names
df.loc[df['first_name'].isnull() == True]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,n_dependants,fam_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
...,...,...,...,...,...,...,...,...,...,...
206038,121317,,Melton,Male,Pennsylvania,28,3/31/2020,3,married,87783
206044,200799,,Copeland,Female,Hawaii,52,4/1/2020,2,married,108488
206090,167394,,Frost,Female,Hawaii,61,4/1/2020,1,married,45275
206162,187532,,Floyd,Female,California,39,4/1/2020,0,single,56325


### c) Outlier identification

In [13]:
# this isn't sorted correctly, so it's tough to tell what's going on
df['date_joined'].value_counts().sort_index(ascending = False)

9/9/2019     181
9/9/2018     174
9/9/2017     186
9/8/2019     158
9/8/2018     164
            ... 
1/10/2017    192
1/1/2020     153
1/1/2019     153
1/1/2018     147
1/1/2017     159
Name: date_joined, Length: 1187, dtype: int64

In [14]:
# listed incomes are all distinct
df['income'].value_counts().sort_index(ascending = False)

593901    1
592409    1
591089    1
590790    1
584097    1
         ..
25955     1
25941     1
25937     1
25911     1
25903     1
Name: income, Length: 108012, dtype: int64

In [15]:
# DC is the 51st state
# equal number of customers in each state (for this data set)
df['state'].value_counts().sort_index(ascending = False)

Wyoming                 4043
Wisconsin               4043
West Virginia           4043
Washington              4043
Virginia                4043
Vermont                 4043
Utah                    4043
Texas                   4043
Tennessee               4043
South Dakota            4043
South Carolina          4043
Rhode Island            4043
Pennsylvania            4043
Oregon                  4043
Oklahoma                4043
Ohio                    4043
North Dakota            4043
North Carolina          4043
New York                4043
New Mexico              4043
New Jersey              4043
New Hampshire           4043
Nevada                  4043
Nebraska                4043
Montana                 4043
Missouri                4043
Mississippi             4043
Minnesota               4043
Michigan                4043
Massachusetts           4043
Maryland                4043
Maine                   4043
Louisiana               4043
Kentucky                4043
Kansas        

### d) Missing values

In [16]:
# find which column any missing values are in
# good to know it's only in the first_name column
df.isnull().sum()

user_id             0
first_name      11259
last_name           0
gender              0
state               0
age                 0
date_joined         0
n_dependants        0
fam_status          0
income              0
dtype: int64

### d) Duplicates

In [17]:
# check to see if any records are exact duplicates
# No dups!
df[df.duplicated()]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,n_dependants,fam_status,income


In [18]:
df[df.duplicated(subset='user_id')]

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,n_dependants,fam_status,income


## 04 Export clean df

In [19]:
df.to_csv(os.path.join(path, '02 Data', 'prepared data', 'customers_clean.csv'))