# Content
     
**Data Preprocessing**  
   - Loading Data
   - Glossary
   - Dealing with missing Values
   - Fixing Data Types
   - Dealing Bad Values
   - Feature Engieenier

# Data Preprossesing

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
from datetime import datetime, date, timedelta
from tabulate import tabulate
from IPython.display import HTML
import notebook_settings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Loading Data

In [2]:
data_raw = pd.read_csv('../data/ecommerce.csv', encoding='iso-8859-1')
data = data_raw.copy()

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
 8   Unnamed: 8   0 non-null       float64
dtypes: float64(3), int64(1), object(5)
memory usage: 37.2+ MB


## Glossary

In [4]:
glossary = [['Columns', 'Meaning'],
            ['InvoiceNo', 'Unique Identifier of each transaction'],
            ['StockCode', 'Internal item code'],
            ['Description', 'Item description/resume'],
            ['Quantity', 'Quantity of each item per transaction'],
            ['InvoiceDate', 'The day of transaction'],
            ['UnitPrice', 'Product price per unit'],
            ['CustomerID', 'Unique Identifier of Customer'],
            ['Country', 'Customer\'s country of residence']
           ]
#print(tabulate(glossary, headers='firstrow', stralign='left', tablefmt='simple'))

## Dealing with missing values

In [5]:
data.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
Unnamed: 8     541909
dtype: int64

In [6]:
data = data.drop('Unnamed: 8', axis=1)
data = data.dropna(subset=['Description','CustomerID'])

In [7]:
#data["IsCancelled"]=np.where(data.InvoiceNo.apply(lambda l: l[0]=="C"), True, False)
#data.IsCancelled.value_counts() / data.shape[0] * 100 , data.IsCancelled.value_counts()
#data[data["InvoiceNo"].str.startswith("C")]

As the purpose of this project is to group customers, then it makes no sense to classify unidentified customers. To simplify the study, we will initially ignore unidentified customers, which are those who purchased but we do not know who they are because at the time of purchase, he or she was not a registered user or was not informed at the time of purchase.

## Fixing Data Types

In [8]:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format='%d-%b-%y') #format='%m/%d/%Y %H:%M'
data['CustomerID'] = data['CustomerID'].astype(int)

data['Total'] = data['Quantity'] * data['UnitPrice']

In [9]:
#data['date'] = pd.to_datetime(data['date'], format="%Y-%m-%d")
# data['year'] = pd.DatetimeIndex(data['InvoiceDate']).year
# data['month'] = pd.DatetimeIndex(data['InvoiceDate']).month
# data['week_of_year'] = data['InvoiceDate'].dt.isocalendar().week
# data['day'] = pd.DatetimeIndex(data['InvoiceDate']).day

In [10]:
data.InvoiceDate.min(), data.InvoiceDate.max()

(Timestamp('2016-11-29 00:00:00'), Timestamp('2017-12-07 00:00:00'))

## Dealing Bad Values

In [11]:
# data.loc[data['CustomerID'] == 12346] 
# data.loc[data['CustomerID'] == 16446] 

data = data[~data['CustomerID'].isin([12346, 16446])]
data = data.loc[~(data['UnitPrice'] < 0.04)]

In [12]:
data.groupby("StockCode")["Description"].nunique()[data.groupby("StockCode")["Description"].nunique() != 1]

StockCode
16156L    2
17107D    3
20622     2
20725     2
20914     2
         ..
85184C    2
85185B    2
90014A    2
90014B    2
90014C    2
Name: Description, Length: 213, dtype: int64

Absurd purchases followed by cancellations, purchase values close to or below zero will be considered as bad input values and thus will be deleted. They can even be useful in the EDA stage to generate insights, but for the machine learning model they significantly interfere with performance.

As this database is a temporal cut of the company's sales, we will find purchase cancellations but we will not find the purchase related to this cancellation, this is a big problem. One of the ways to solve this is to identify the cancellations one by one and delete this line, another way is to delete the customers that on average the company owes them. I preferred to choose the second way because it is simpler to perform, later in the code this will be done.

## Feature Engeenier

In [13]:
df2 = data.copy()

In [14]:
df_purchase = data.loc[data['Quantity'] >= 0]
df_returns = data.loc[data['Quantity'] < 0]

In [15]:
data_client_extended = df2.drop(['InvoiceNo','StockCode','Description', 'Quantity','InvoiceDate','UnitPrice','Country'],axis=1).drop_duplicates(ignore_index=True)

In [16]:
# GrossRevenue
df2['GrossRevenuePartial'] = df2['Quantity'] * df2['UnitPrice']
aux_revenue = df2[['CustomerID', 'GrossRevenuePartial']].groupby('CustomerID').sum().reset_index().rename(columns={'GrossRevenuePartial':'GrossRevenueTotal'})
data_client_extended = pd.merge(data_client_extended,aux_revenue, how='left',on='CustomerID')

In [17]:
# Recency - Last day purchase
last_day = data.InvoiceDate.max() + dt.timedelta(days = 1)
aux_recency = df_purchase[['CustomerID','InvoiceDate']].groupby('CustomerID').max().reset_index()
aux_recency['RecencyDays'] = (last_day - aux_recency['InvoiceDate']).dt.days
data_client_extended = pd.merge(data_client_extended, aux_recency[['CustomerID','RecencyDays']], on ='CustomerID', how='left')

In [18]:
# Frequency
aux_freq = df_purchase[['CustomerID','InvoiceNo']].drop_duplicates('InvoiceNo').groupby('CustomerID').count().reset_index().rename(columns={'InvoiceNo':'Frequency'})
data_client_extended = pd.merge(data_client_extended, aux_freq, on='CustomerID',how='left')

In [19]:
# Avarage Ticket
aux_ticket = df2[['CustomerID','GrossRevenuePartial']].groupby('CustomerID').mean().reset_index().rename(columns={'GrossRevenuePartial':'AvarageTicket'})
data_client_extended = pd.merge(data_client_extended, aux_ticket,on='CustomerID',how='left')

In [20]:
# Número of Purchases
aux_prod = df_purchase.loc[:,['CustomerID', 'StockCode']].groupby('CustomerID').count().reset_index().rename(columns={'StockCode':'NumberProducts'})
data_client_extended = pd.merge(data_client_extended, aux_prod, on='CustomerID', how='left')

In [21]:
# Number Of Returns
aux_return = df_returns[['CustomerID', 'Quantity']].groupby('CustomerID').sum().reset_index().rename(columns={'Quantity':'NumberReturn'})
aux_return['NumberReturn'] = -1*aux_return['NumberReturn']
aux_return['NumberReturn'] = aux_return['NumberReturn'].fillna(0)
data_client_extended = pd.merge(data_client_extended, aux_return, on='CustomerID', how='left')

In [22]:
data_client_extended = data_client_extended.set_index('CustomerID')

In [23]:
data_client_extended.sample(10)

Unnamed: 0_level_0,Total,GrossRevenueTotal,RecencyDays,Frequency,AvarageTicket,NumberProducts,NumberReturn
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
16361,10.4,896.66,10.0,4.0,8.459057,106.0,
17949,432.0,52750.84,2.0,45.0,667.732152,70.0,2975.0
14953,19.8,289.82,26.0,1.0,5.269455,55.0,
16133,-1.65,14305.66,4.0,33.0,39.301264,338.0,150.0
17086,25.5,2050.08,8.0,6.0,21.355,96.0,
14967,59.4,463.8,50.0,1.0,77.3,6.0,
16050,15.12,137.9,174.0,1.0,13.79,10.0,
15239,19.5,764.34,11.0,2.0,14.698846,49.0,3.0
15764,19.8,3245.47,89.0,6.0,17.263138,180.0,14.0
16303,99.0,5305.83,26.0,3.0,31.771437,162.0,38.0


In [24]:
#Some clients has more returns then purchases because of date
data_client_extended.loc[data_client_extended['NumberReturn'].isna(), 'NumberReturn'] = 0
data_client_extended=data_client_extended.dropna()
data_client_extended=data_client_extended.loc[~(data_client_extended['GrossRevenueTotal'] < 0.01)]

In [25]:
data_client_extended

Unnamed: 0_level_0,Total,GrossRevenueTotal,RecencyDays,Frequency,AvarageTicket,NumberProducts,NumberReturn
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
17850,15.30,5288.63,373.0,34.0,16.950737,297.0,40.0
17850,20.34,5288.63,373.0,34.0,16.950737,297.0,40.0
17850,22.00,5288.63,373.0,34.0,16.950737,297.0,40.0
17850,15.30,5288.63,373.0,34.0,16.950737,297.0,40.0
17850,25.50,5288.63,373.0,34.0,16.950737,297.0,40.0
...,...,...,...,...,...,...,...
13777,51.84,25748.35,1.0,33.0,117.572374,197.0,93.0
13777,88.80,25748.35,1.0,33.0,117.572374,197.0,93.0
15804,30.00,3848.55,1.0,13.0,14.097253,262.0,52.0
12680,10.20,862.81,1.0,4.0,16.592500,52.0,0.0


In [26]:
data_client_resume = data_client_extended[['GrossRevenueTotal','RecencyDays','Frequency']]