## Solution Planning

# Input

    Business Problem
        Select most valuable customers to create a loyalty program called Insiders

    Data
        One year of e-commerce sales

# Output

    A list of customers that will be part of Insiders
    A report answering business questions
        Who are the eligible customers to participate in the Insiders program?
        How many customers will be part of the program?
        What are the main characteristics of these customers?
        What revenue percentage comes from Insiders?
        What is the Insiders' expected revenue for the coming months?
        What are the conditions for a customer to be eligible for the Insiders program?
        What are the conditions for a customer to be removed from the Insiders program?
        What is the guarantee that the Insiders program is better than the regular customer database?
        What actions can the marketing team make to increase the revenue?

# Tasks

    A report answering business questions:

        Who are the eligible customers to participate in the Insiders program?
            Understand the criteria to a eligible customer.
            Criteria examples:
                Revenue
                    High average ticket
                    High LTV (lifetime value)
                    Low recency
                    High basket size
                    Low churn probability
                Expenses
                    Return rate
                Buying Experience
                    High average notes on reviews

        How many customers will be part of the program?
            Calculate the percentage of customers that belong to Insiders program over the total number of customers.

        What are the main characteristics of these customers?
            Indicate customer characteristics:
                Age
                City
                Education level
                Localization, etc.
            Indicate consumption characteristics:
                Clusters attributes

        What revenue percentage comes from Insiders?
            Calculate the percentage of Insiders revenue over the total revenue.

        What is the Insiders' expected revenue for the coming months?
            Calculate Insiders' LTV
            Calculate Cohort Analysis.

        What are the conditions for a customer to be eligible for the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be similar to a customer on Insiders.

        What are the conditions for a customer to be removed from the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be dissimilar to a customer on Insiders.

        What is the guarantee that the Insiders program is better than the regular customer database?
            Perform A/B Test
            Perform A/B Bayesian Test
            Perform Hypothesis Test

        What actions can the marketing team make to increase the revenue?
            Discount
            Buying preferences
            Shipping options
            Promote a visit to the company, etc.



# Imports and helper functions

In [75]:
import dtype_diet
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import gridspec
from IPython.display import display, HTML
from IPython.display import Image
import seaborn as sns
import csv
from dython import nominal
from scipy.stats import chi2_contingency
import dataframe_image as dfi
import re

def data_description(df):
    print('Variables:\n\n{}'.format(df.dtypes), end='\n\n')
    print('Number of rows {}'.format(df.shape[0]), end='\n\n')
    print('Number of columns {}'.format(df.shape[1]), end='\n\n')
    print('NA analysis'.format(end='\n') )
    for i in df.columns:
        print('column {}: {} {}'.format(i,df[i].isna().any(), df[i].isna().sum() ) )

# Loading data

In [121]:
df = pd.read_csv('../high_value_customer_identification/data.csv',low_memory=True)
df_otimized = dtype_diet.report_on_dataframe(df)
df = dtype_diet.optimize_dtypes(df, df_otimized)

# Data Description

In [122]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,29-Nov-16,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,29-Nov-16,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,29-Nov-16,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,29-Nov-16,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,29-Nov-16,3.39,17850.0,United Kingdom


In [38]:
data_description(df)

Variables:

InvoiceNo      category
StockCode      category
Description    category
Quantity          int32
InvoiceDate    category
UnitPrice       float64
CustomerID      float32
Country        category
dtype: object

Number of rows 541909

Number of columns 8

NA analysis
column InvoiceNo: False 0
column StockCode: False 0
column Description: True 1454
column Quantity: False 0
column InvoiceDate: False 0
column UnitPrice: False 0
column CustomerID: True 135080
column Country: False 0


# Data Wrangling

## Categorical attributes analysis

### InvoiceNo

In [128]:
#There is some invoices with a letter C, probably is return invoices. They will be removed to just focus on buying invoices.
df['InvoiceNo'].unique()


df =  df[~df['InvoiceNo'].astype(str).apply(lambda x: bool(re.search('[^0-9]+',x) ) ) ]

### StockCode

In [187]:
#There is some stockCode with just a letter (D, M,m), some with letter and number and others with numbers. Also, there is some stockcode with just letters (POST,DOT...).
#The POST,DOT,BANK CHARGES, PADS and AMAZONFEE will be removed as they do not represents products but other operations.

np.set_printoptions(threshold=sys.maxsize)

np.array(df[df['StockCode'].astype(str).apply(lambda x: bool(re.search('[A-Z]{3,}',x) ) ) ]['StockCode'].unique())

df = df[~df['StockCode'].isin(['DOT','BANK CHARGES','AMAZONFEE','PADS','POST'])]

## Checking/Replace NAs

In [123]:
df[df['InvoiceNo'].astype(str).apply(lambda x: bool(re.search('[^0-9]+',x) ) )]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,29-Nov-16,27.50,14527.00,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,29-Nov-16,4.65,15311.00,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,29-Nov-16,1.65,17548.00,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,29-Nov-16,0.29,17548.00,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,29-Nov-16,0.29,17548.00,United Kingdom
...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,7-Dec-17,0.83,14397.00,United Kingdom
541541,C581499,M,Manual,-1,7-Dec-17,224.69,15498.00,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,7-Dec-17,10.95,15311.00,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,7-Dec-17,1.25,17315.00,United Kingdom


In [108]:
df[df['CustomerID'].isna()] [df[df['CustomerID'].isna()]['InvoiceNo'].astype(str).apply(lambda x: bool(re.search('[^0-9]+',x) ) )]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
11502,C537251,22429,ENAMEL MEASURING JUG CREAM,-2,4-Dec-16,4.25,,United Kingdom
11503,C537251,22620,4 TRADITIONAL SPINNING TOPS,-8,4-Dec-16,1.25,,United Kingdom
11504,C537251,21890,S/6 WOODEN SKITTLES IN COTTON BAG,-2,4-Dec-16,2.95,,United Kingdom
11505,C537251,22564,ALPHABET STENCIL CRAFT,-5,4-Dec-16,1.25,,United Kingdom
11506,C537251,21891,TRADITIONAL WOODEN SKIPPING ROPE,-3,4-Dec-16,1.25,,United Kingdom
...,...,...,...,...,...,...,...,...
492207,C578097,22112,CHOCOLATE HOT WATER BOTTLE,-48,20-Nov-17,4.25,,United Kingdom
514984,C579757,47469,ASSORTED SHAPES PHOTO CLIP SILVER,-24,28-Nov-17,0.65,,United Kingdom
516454,C579907,22169,FAMILY ALBUM WHITE PICTURE FRAME,-2,29-Nov-17,7.65,,EIRE
524601,C580604,AMAZONFEE,AMAZON FEE,-1,3-Dec-17,11586.50,,United Kingdom


In [81]:
re.match(df[df['CustomerID'].isna()]['InvoiceNo'].astype(str), '^C')

TypeError: unhashable type: 'Series'