## Solution Planning

# Input

    Business Problem
        Select most valuable customers to create a loyalty program called Insiders

    Data
        One year of e-commerce sales

# Output

    A list of customers that will be part of Insiders
    A report answering business questions
        Who are the eligible customers to participate in the Insiders program?
        How many customers will be part of the program?
        What are the main characteristics of these customers?
        What revenue percentage comes from Insiders?
        What is the Insiders' expected revenue for the coming months?
        What are the conditions for a customer to be eligible for the Insiders program?
        What are the conditions for a customer to be removed from the Insiders program?
        What is the guarantee that the Insiders program is better than the regular customer database?
        What actions can the marketing team make to increase the revenue?

# Tasks

    A report answering business questions:

        Who are the eligible customers to participate in the Insiders program?
            Understand the criteria to a eligible customer.
            Criteria examples:
                Revenue
                    High average ticket
                    High LTV (lifetime value)
                    Low recency
                    High basket size
                    Low churn probability
                Expenses
                    Return rate
                Buying Experience
                    High average notes on reviews

        How many customers will be part of the program?
            Calculate the percentage of customers that belong to Insiders program over the total number of customers.

        What are the main characteristics of these customers?
            Indicate customer characteristics:
                Age
                City
                Education level
                Localization, etc.
            Indicate consumption characteristics:
                Clusters attributes

        What revenue percentage comes from Insiders?
            Calculate the percentage of Insiders revenue over the total revenue.

        What is the Insiders' expected revenue for the coming months?
            Calculate Insiders' LTV
            Calculate Cohort Analysis.

        What are the conditions for a customer to be eligible for the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be similar to a customer on Insiders.

        What are the conditions for a customer to be removed from the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be dissimilar to a customer on Insiders.

        What is the guarantee that the Insiders program is better than the regular customer database?
            Perform A/B Test
            Perform A/B Bayesian Test
            Perform Hypothesis Test

        What actions can the marketing team make to increase the revenue?
            Discount
            Buying preferences
            Shipping options
            Promote a visit to the company, etc.



# Imports and helper functions

In [15]:
import dtype_diet
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import gridspec
from IPython.display import display, HTML
from IPython.display import Image
import seaborn as sns
import csv
from dython import nominal
from scipy.stats import chi2_contingency
import dataframe_image as dfi
import re
import snakecase
from datetime import datetime as dt
def data_description(df):
    print('Variables:\n\n{}'.format(df.dtypes), end='\n\n')
    print('Number of rows {}'.format(df.shape[0]), end='\n\n')
    print('Number of columns {}'.format(df.shape[1]), end='\n\n')
    print('NA analysis'.format(end='\n') )
    for i in df.columns:
        print('column {}: {} {}'.format(i,df[i].isna().any(), df[i].isna().sum() ) )

# Loading data

In [115]:
df = pd.read_csv('../high_value_customer_identification/data.csv',low_memory=True)

# Data Description

In [94]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom


In [95]:
data_description(df)

Variables:

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

Number of rows 541909

Number of columns 8

NA analysis
column InvoiceNo: False 0
column StockCode: False 0
column Description: True 1454
column Quantity: False 0
column InvoiceDate: False 0
column UnitPrice: False 0
column CustomerID: True 135080
column Country: False 0


# Data Wrangling

In [116]:
for i in df.columns:
    df = df.rename(columns={i:snakecase.convert(i) } )


## Categorical attributes analysis

### invoice_no

In [117]:
#There is some invoices with a letter C, probably is return invoices. They will be removed to just focus on buying invoices
#for the moment
df['invoice_no'].unique()


df =  df[~df['invoice_no'].astype(str).apply(lambda x: bool(re.search('[^0-9]+',x) ) ) ]

### stock_code

In [118]:
#There is some stockCode with just a letter (D, M,m), some with letter and number and others with numbers. Also, there is some stockcode with just letters (POST,DOT...).
#The POST,DOT,BANK CHARGES, PADS and AMAZONFEE will be removed as they do not represents products but other operations.

np.set_printoptions(threshold=sys.maxsize)

np.array(df[df['stock_code'].astype(str).apply(lambda x: bool(re.search('[A-Z]{3,}',x) ) ) ]['stock_code'].unique())

df = df[~df['stock_code'].isin(['DOT','BANK CHARGES','AMAZONFEE','PADS','POST','M','D','m'])]

## country

In [99]:
#92% of customers is from UK and since data is very unbalanced this columns would not be used for now   
df['country'].value_counts(normalize=True)

United Kingdom         0.92
Germany                0.02
France                 0.02
EIRE                   0.01
Spain                  0.00
Netherlands            0.00
Switzerland            0.00
Belgium                0.00
Portugal               0.00
Australia              0.00
Norway                 0.00
Channel Islands        0.00
Italy                  0.00
Finland                0.00
Cyprus                 0.00
Unspecified            0.00
Sweden                 0.00
Austria                0.00
Denmark                0.00
Poland                 0.00
Japan                  0.00
Israel                 0.00
Hong Kong              0.00
Singapore              0.00
Iceland                0.00
USA                    0.00
Canada                 0.00
Greece                 0.00
Malta                  0.00
United Arab Emirates   0.00
European Community     0.00
RSA                    0.00
Lebanon                0.00
Lithuania              0.00
Brazil                 0.00
Czech Republic      

### invoice_date

In [119]:
df['invoice_date'] = df['invoice_date'].apply(lambda x: dt.strptime(x,'%d-%b-%y') )

## Quantitative analysis

### Descriptive statistics

In [101]:
df[['quantity','unit_price']].describe()

Unnamed: 0,quantity,unit_price
count,530433.0,530433.0
mean,10.25,3.26
std,159.87,4.44
min,-9600.0,0.0
25%,1.0,1.25
50%,3.0,2.08
75%,10.0,4.13
max,80995.0,649.5


In [120]:
# There is some itens with negative quantities and unit_price equals to zero. Looking at description could be products that were in stock but something happened such as they were lost, damaged or other situations.  These situations does not represents customer purchases. They would be removed.
# There are purchases with high quantities per item but it appears to be normal purchases so they would be kept in data. 

df[df['quantity'] < 0].tail(50)
df = df[df['quantity'] > 0]

In [121]:
# there are products with unit_price equals to zero and most of these lines do not have neither product descriptions nor customer id. These do not seems to be normal purchases and since this situations represents less than 1% of the total data they will be removed
df = df[df['unit_price'] != 0]

## Checking/Replace NAs

In [122]:
#In order to not loosing more than 20% of whole data because some customers do not have ID, we will create artifical IDs starting with 25000
df_aux = pd.DataFrame(df[df['customer_id'].isna()]['invoice_no'].drop_duplicates() )
df_aux['customer_id'] = np.arange(20000, 20000 + len(df_aux), 1)

df = pd.merge(df,df_aux, how='left', on='invoice_no')
df['customer_id'] = df['customer_id_x'].combine_first(df['customer_id_y'] )
df = df.drop(columns=['customer_id_x','customer_id_y'], axis=1)
df['customer_id'] = df['customer_id'].astype(int)

## Feature filtering

In [123]:
df = df.drop(columns=['description','country'])

# Feature Engineering

In [124]:
df1 = df.copy()

In [None]:
# ticket_size - quantity x unit_price  average per purchase
# basket_size - sum of all quantities per purchase
# unique_basket_size - quantity of distincts product per purchase 
# gross_revenue (monetary) - quantity x unit_price
# Recency - time in days since last purchase
# Average Recency days - average time between purchases
# Frequency - How many times customer purchase
# Quantity of purchase per month - How many purchases a customer did in a month

In [125]:
df1.head()

Unnamed: 0,invoice_no,stock_code,quantity,invoice_date,unit_price,customer_id
0,536365,85123A,6,2016-11-29,2.55,17850
1,536365,71053,6,2016-11-29,3.39,17850
2,536365,84406B,8,2016-11-29,2.75,17850
3,536365,84029G,6,2016-11-29,3.39,17850
4,536365,84029E,6,2016-11-29,3.39,17850


## Ticket Size

In [108]:
df1_aux = df1.loc[:,['invoice_no','quantity','unit_price']]
df1_aux['ticket_size'] = df1['quantity']*df1['unit_price']
df1_aux

Unnamed: 0,invoice_no,quantity,unit_price,ticket_size
0,536365,6,2.55,15.30
1,536365,6,3.39,20.34
2,536365,8,2.75,22.00
3,536365,6,3.39,20.34
4,536365,6,3.39,20.34
...,...,...,...,...
527927,581587,12,0.85,10.20
527928,581587,6,2.10,12.60
527929,581587,4,4.15,16.60
527930,581587,4,4.15,16.60


In [110]:
df1_aux[['invoice_no','ticket_size']].groupby('invoice_no').agg({'ticket_size':'mean'})

Unnamed: 0_level_0,ticket_size
invoice_no,Unnamed: 1_level_1
536365,19.87
536366,11.10
536367,23.23
536368,17.51
536369,17.85
...,...
581583,62.30
581584,70.32
581585,15.67
581586,84.80
