# 0.0. BUSINESS PROBLEM

## 0.1. Challenge

Find who are most valuable customers to be part of "Insiders" group program. 

## 0.2. Business Questions

1. Who are the customers elegible to participate of Insiders group?
2. How many customers will be part of the group?
3. What are main characteristics of these customers?
4. What is the revenue participation percentaga from Insiders group?
5. What is the expected revenue for the next few months from Insiders group?
6. What conditions a customer must meet to be an Insider?
7. What condisions a customer must meet to be excluded from Insiders group?
8. How to ensure that Insiders group is better than the rest of the customer base?
9. What can Marketing do to increase revenue?

# 1.0. FUNCTIONS & LIBS

## 1.1. Imports

In [98]:
import datetime
import numpy  as np
import pandas as pd

## 1.2. Helper Functions

# 2.0. DATA DESCRIPTION

## 2.1. Variables

**Invoice Number** - unique identifier of each transaction

**Stock Code Product** - item code

**Description Product** - item name

**Quantity** - quantity of each item from a transaction

**Invoice Date** - transaction day

**Unit Price** - item price

**Customer ID** - unque customer identifier

**Country** - country where customer lives

## 2.2. Load Data

In [2]:
df_raw = pd.read_csv('../data/raw/Ecommerce.csv', encoding='iso-8859-1')
df_raw = df_raw.drop(columns = 'Unnamed: 8')

## 2.3. Rename Columns

In [3]:
cols_new = ['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date', 'unit_price', 'customer_id', 'country']
df_raw.columns = cols_new

## 2.4. DF Dimensions

In [4]:
print('Number of rows: {}'.format(df_raw.shape[0]))
print('Number of columns: {}'.format(df_raw.shape[1]))

Number of rows: 541909
Number of columns: 8


## 2.5. Check NA and Duplicates

In [5]:
# check NA
print('Number of NAs:')
print(df_raw.isna().sum())

# check duplicates
print('\nNumber of Duplicated:')
print(df_raw.duplicated().sum())

Number of NAs:
invoice_no           0
stock_code           0
description       1454
quantity             0
invoice_date         0
unit_price           0
customer_id     135080
country              0
dtype: int64

Number of Duplicated:
5269


In [6]:
# df_raw[df_raw.duplicated(keep=False)].sort_values(by=['customer_id','invoice_no', 'stock_code']).tail(15)
# # after investigating duplicated rows, it was decided to consider them as authentic transactions

### 2.5.1. Replace NA

In [7]:
# first cycle, remove all NA without further analysis
# make a new dataframe
df_raw_1 = df_raw.dropna(subset=['description', 'customer_id'])
print('Removed data: {:.2f}%'.format((1-(df_raw_1.shape[0]/df_raw.shape[0]))*100))

# # check NA
# print('Number of NAs:')
# print(df_raw_1.isna().sum())

Removed data: 24.93%


## 2.6. Check DTypes

In [8]:
df_raw_1.dtypes

invoice_no       object
stock_code       object
description      object
quantity          int64
invoice_date     object
unit_price      float64
customer_id     float64
country          object
dtype: object

### 2.6.1 Change DTypes

In [11]:
df_raw_1 = df_raw_1.astype({'customer_id': 'int64'})
df_raw_1['invoice_date'] = pd.to_datetime(df_raw_1['invoice_date'], format='%d-%b-%y')

In [17]:
df_raw_1.dtypes

invoice_no              object
stock_code              object
description             object
quantity                 int64
invoice_date    datetime64[ns]
unit_price             float64
customer_id              int64
country                 object
dtype: object

## 2.7. Descriptive Analysis

# 3.0. FEATURE ENGINEERING

In [120]:
df_raw_3 = df_raw_1.copy()

## 3.1. Features Creation

In [121]:
# separate unique customers
df_ref = df_raw_3.drop(['invoice_no', 'stock_code', 'description', 'quantity', 'invoice_date',
       'unit_price', 'country'], axis=1).drop_duplicates(ignore_index=True)

### 3.1.1. Gross Revenue - total money spent by customer

In [122]:
df_raw_3['gross_revenue'] = df_raw_3['quantity']*df_raw_3['unit_price']
df_gross_revenue = df_raw_3[['customer_id', 'gross_revenue']].groupby('customer_id').sum().reset_index()
df_ref = pd.merge(df_ref, df_gross_revenue, on='customer_id', how='left')

### 3.1.2. Recency - number of days since last purchase

In [123]:
# recency - last day of purchase
df_raw_3['recency'] = (df_raw_3['invoice_date'].max()-df_raw_3['invoice_date']).dt.days+1
df_recency = df_raw_3[['customer_id', 'recency']].groupby('customer_id').min().reset_index()
df_ref = pd.merge(df_ref, df_recency, on='customer_id', how='left')

### 3.1.3. Frequency - number of purchases in last 365 days

In [124]:
# frequency - number of purchases in last 365 days
dd = datetime.timedelta(days=365)
df_frequency = df_raw_3[df_raw_3['invoice_date']>=(df_raw_3['invoice_date'].max() - dd)][['invoice_no', 'customer_id']].groupby('customer_id').count().reset_index().rename(columns={'invoice_no':'frequency'})
df_ref = pd.merge(df_ref, df_frequency, on='customer_id', how='left')

## 3.2. Check NA and DTypes after Feature Creation

In [137]:
# NANs
df_ref.isna().sum()

customer_id      0
gross_revenue    0
recency          0
frequency        0
dtype: int64

In [141]:
# DTypes
df_ref.dtypes

customer_id        int64
gross_revenue    float64
recency            int64
frequency          int64
dtype: object

### 3.2.1. Replace NA in df_ref

In [136]:
# nan values were assigned to customes with 0 purchases in the last 365 days
df_ref['frequency'] = df_ref['frequency'].fillna(0)

### 3.2.2. Change DTypes of df_ref

In [140]:
df_ref = df_ref.astype({'frequency': 'int64'})

# 4.0. DATA FILTERING

# 5.0. EDA (EXPLORATORY DATA ANALYSIS)

# 6.0. DATA PREPARATION

# 7.0. HYPER-PARAMETER FINE TUNING