# Data preprocessing

Data preprocessing is the process of making raw data to clean data. This is the most crucial part of data the science. In this section, we will explore data first then we remove unwanted columns, remove duplicates, handle missing data, etc. After this step, we get clean data from raw data.


# Data cleaning

After the explore our datasets may need to clean them for better analysis. Data coming in from multiple sources so It's possible to have an error in some values. This is where data cleaning becomes extremely important. In this section, we will delete unwanted columns, rename columns, correct appropriate data types, etc.

In [558]:
import numpy as np
import pandas as pd

In [559]:
# read csv file

df = pd.read_csv('online_store_customer_data.csv')
df.head(3)

Unnamed: 0,Transaction_date,Transaction_ID,Gender,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent
0,1/1/2019,151200,Female,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36
1,1/1/2019,151201,Male,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04
2,1/1/2019,151202,Male,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6


In [560]:
# Drop unwanted columns
df.drop(['Transaction_ID'], axis=1, inplace=True)

In [561]:
# create new df_col dataframe from df.copy() method. 
df_col = df.copy()

# rename columns name
df_col.rename(columns={"Transaction_date": "Date", "Gender": "Sex"}, inplace=True)
df_col.head(3)

Unnamed: 0,Date,Sex,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent
0,1/1/2019,Female,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36
1,1/1/2019,Male,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04
2,1/1/2019,Male,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6


In [562]:
# Add a new ajusted column which value will be amount_spent * 100
df_col['new_col'] = df_col['Amount_spent'] * 100

In [563]:
df_col.head(3)

Unnamed: 0,Date,Sex,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent,new_col
0,1/1/2019,Female,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36,205136.0
1,1/1/2019,Male,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04,54404.0
2,1/1/2019,Male,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6,157260.0


In [564]:
# changing Female to Women and Male to Man in Sex column.
#first argument in loc function is condition and second one is columns name. 
df_col.loc[df_col.Sex == "Female", 'Sex'] = 'Women' 
df_col.loc[df_col.Sex == "Male", 'Sex'] = 'Man'

In [565]:
df_col.head(3)

Unnamed: 0,Date,Sex,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent,new_col
0,1/1/2019,Women,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36,205136.0
1,1/1/2019,Man,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04,54404.0
2,1/1/2019,Man,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6,157260.0


Now Sex columns values are changed Female to Woman and Male to Man

In [566]:
#Datatypes change

df_col.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2512 entries, 0 to 2511
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              2512 non-null   object 
 1   Sex               2484 non-null   object 
 2   Age               2470 non-null   float64
 3   Marital_status    2512 non-null   object 
 4   State_names       2512 non-null   object 
 5   Segment           2512 non-null   object 
 6   Employees_status  2486 non-null   object 
 7   Payment_method    2512 non-null   object 
 8   Referal           2357 non-null   float64
 9   Amount_spent      2270 non-null   float64
 10  new_col           2270 non-null   float64
dtypes: float64(4), object(7)
memory usage: 216.0+ KB


In our Date columns, it's object type so now we will convert this to date types, and also we will convert Referal columns float64 to float32.

In [567]:
# change object type to datetime64 format
df_col['Date'] = df_col['Date'].astype('datetime64[ns]')

# change float64 to float32 of Referal columns
df_col['Referal'] = df_col['Referal'].astype('float32')

In [568]:
df_col.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2512 entries, 0 to 2511
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date              2512 non-null   datetime64[ns]
 1   Sex               2484 non-null   object        
 2   Age               2470 non-null   float64       
 3   Marital_status    2512 non-null   object        
 4   State_names       2512 non-null   object        
 5   Segment           2512 non-null   object        
 6   Employees_status  2486 non-null   object        
 7   Payment_method    2512 non-null   object        
 8   Referal           2357 non-null   float32       
 9   Amount_spent      2270 non-null   float64       
 10  new_col           2270 non-null   float64       
dtypes: datetime64[ns](1), float32(1), float64(3), object(6)
memory usage: 206.2+ KB


##  Remove duplicate

In [569]:
# Display duplicated entries 
df.duplicated().sum()

12

In [570]:
# duplicate rows dispaly, keep arguments will--- 'first', 'last' and False
duplicate_value = df.duplicated(keep='first')

df.loc[duplicate_value, :]

Unnamed: 0,Transaction_date,Gender,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent
64,1/25/2019,Male,73.0,Married,West Virginia,Basic,Employees,PayPal,0.0,1397.09
65,1/26/2019,Male,55.0,Married,Kansas,Basic,Employees,Other,1.0,1277.64
66,1/26/2019,Female,72.0,Married,Iowa,Silver,Unemployment,PayPal,,515.77
67,1/26/2019,Male,15.0,Married,South Carolina,Basic,self-employed,Other,1.0,790.1
68,1/27/2019,Female,63.0,Single,Texas,Gold,Employees,Card,1.0,1218.56
109,2/6/2019,Male,60.0,Married,Utah,Silver,Unemployment,Other,1.0,433.2
110,2/7/2019,Female,45.0,Married,Missouri,Platinum,workers,Other,1.0,929.89
111,2/8/2019,Male,33.0,Single,Arizona,Silver,workers,PayPal,0.0,2560.26
112,2/8/2019,Male,24.0,Married,South Carolina,Basic,Unemployment,Other,0.0,
113,2/8/2019,Female,53.0,Single,Colorado,Basic,self-employed,Other,1.0,1888.69


In [571]:
# dropping ALL duplicate values
df.drop_duplicates(keep = 'first', inplace = True)

In [572]:
df_col.head(3)

Unnamed: 0,Date,Sex,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent,new_col
0,2019-01-01,Women,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36,205136.0
1,2019-01-01,Man,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04,54404.0
2,2019-01-01,Man,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6,157260.0


## Handling missing values

Handling missing values in the common task in the data pre-processing part. For many reasons most of the time we will encounter missing values. Without dealing with this we can't do the proper model building. For this section first, we will find out missing values then we decided how to handle them. We can handle this by removing affected columns or rows or replacing appropriate values there.

In [573]:
df.isna().sum().sort_values(ascending=False)

Amount_spent        241
Referal             154
Age                  42
Gender               28
Employees_status     26
Transaction_date      0
Marital_status        0
State_names           0
Segment               0
Payment_method        0
dtype: int64

If we have less Nan value then we can delete entire rows by dropna() function. For this function, we will add columns name in subset parameter.

In [574]:
# df copy to df_copy
df_new = df.copy()

In [575]:
df_new.head(2)

Unnamed: 0,Transaction_date,Gender,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent
0,1/1/2019,Female,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36
1,1/1/2019,Male,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04


In [576]:
#Delete Nan rows of Job Columns
df_new.dropna(subset = ["Employees_status"], inplace=True)

#Delete entire columns

If we have a large number of nan values in particular columns then dropping those columns might be a good decision rather than imputing.

In [577]:
df_new.drop(columns=['Amount_spent'], inplace=True)

In [578]:
df_new.isna().sum().sort_values(ascending=False)

Referal             153
Age                  42
Gender               27
Transaction_date      0
Marital_status        0
State_names           0
Segment               0
Employees_status      0
Payment_method        0
dtype: int64

#Impute missing values

Sometimes if we delete entire columns that will be not the appropriate approach. Delete columns can affect our model building because we will lose our main features. For imputing we have many approaches so here are some of the most popular techniques.

Method 1 - Impute fixed value like 0, 'Unknown' or 'Missing' etc. We inpute Unknown in Gender columns

In [579]:
df['Gender'].fillna('Unknown', inplace=True)

Method 2 - Impute Mean, Median and Mode

In [580]:
# Impute Mean in Amount_spent columns
mean_amount_spent = df['Amount_spent'].mean()
df['Amount_spent'].fillna(mean_amount_spent, inplace=True)

#Impute Median in Age column
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# Impute Mode in Employees_status column
mode_emp = df['Employees_status'].mode().iloc[0]
df['Employees_status'].fillna(mode_emp, inplace=True)

Method 3 - Imputing forward fill or backfill by ffill and bfill. In ffill missing value impute from the value of the above row and for bfill it's taken from the below rows value.

In [581]:
df['Referal'].fillna(method='ffill', inplace=True)

In [582]:
df.isna().sum().sum()

0

In [583]:
df.head(3)

Unnamed: 0,Transaction_date,Gender,Age,Marital_status,State_names,Segment,Employees_status,Payment_method,Referal,Amount_spent
0,1/1/2019,Female,19.0,Single,Kansas,Basic,Unemployment,Other,1.0,2051.36
1,1/1/2019,Male,49.0,Single,Illinois,Basic,self-employed,Card,0.0,544.04
2,1/1/2019,Male,63.0,Married,New Mexico,Basic,workers,PayPal,1.0,1572.6


Now we deal with all missing values with different methods. So now we haven't any null values.

# Memory management

When we work on large datasets, There we get one big issue is a memory problem. We need too large resources for dealing with this. But there are some methods in pandas to deal with this. Here are some methods or strategies to deal with this problem with help of pandas.

In [584]:
df_memory = df.copy()

In [585]:
memory_usage = df_memory.memory_usage(deep=True)
memory_usage_in_mbs = round(np.sum(memory_usage / 1024 ** 2), 3)
print(f" Total memory taking df_memory dataframe is : {memory_usage_in_mbs:.2f} MB ")

 Total memory taking df_memory dataframe is : 1.15 MB 


Change object to category datatypes

In [586]:
# Object datatype to category convert
df_memory[df_memory.select_dtypes(['object']).columns] = df_memory.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [587]:
# convert object to category
df_memory.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 2511
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Transaction_date  2500 non-null   category
 1   Gender            2500 non-null   category
 2   Age               2500 non-null   float64 
 3   Marital_status    2500 non-null   category
 4   State_names       2500 non-null   category
 5   Segment           2500 non-null   category
 6   Employees_status  2500 non-null   category
 7   Payment_method    2500 non-null   category
 8   Referal           2500 non-null   float64 
 9   Amount_spent      2500 non-null   float64 
dtypes: category(7), float64(3)
memory usage: 189.1 KB


Now its reduce 1.15 megabytes to 216.6 kb. It's almost reduced 5.5 times.

Change int64 or float64 to int 32, 16, or 8
By default, pandas store numeric values to int64 or float64. Which takes more memory. If we have to store small numbers then we can change to 64 to 32, 16, and so on. For example, our Referal columns have only 0 and 1 values so for that we don't need to store at float64. so now we change it to float16.

In [588]:
# Change Referal column datatypes
df_memory['Referal'] = df_memory['Referal'].astype('float32')

In [589]:
# convert object to category
df_memory.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 2511
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Transaction_date  2500 non-null   category
 1   Gender            2500 non-null   category
 2   Age               2500 non-null   float64 
 3   Marital_status    2500 non-null   category
 4   State_names       2500 non-null   category
 5   Segment           2500 non-null   category
 6   Employees_status  2500 non-null   category
 7   Payment_method    2500 non-null   category
 8   Referal           2500 non-null   float32 
 9   Amount_spent      2500 non-null   float64 
dtypes: category(7), float32(1), float64(2)
memory usage: 179.3 KB


After changing only one column's data types we reduce 216 kb to 179 kb.

In [590]:
df.to_excel('online_store_customer_data_cleaned.xlsx')