# Solution Planning

## Input

    Business Problem
        Select most valuable customers to create a loyalty program called Insiders

    Data
        One year of e-commerce sales

## Output

    A list of customers that will be part of Insiders
    A report answering business questions
        Who are the eligible customers to participate in the Insiders program?
        How many customers will be part of the program?
        What are the main characteristics of these customers?
        What revenue percentage comes from Insiders?
        What are the conditions for a customer to be eligible for the Insiders program?
        What are the conditions for a customer to be removed from the Insiders program?
        What is the guarantee that the Insiders program is better than the regular customer database?
        What actions can the marketing team make to increase the revenue?

## Tasks

    A report answering business questions:

        Who are the eligible customers to participate in the Insiders program?
            Understand the criteria to a eligible customer.
            Criteria examples:
                Revenue
                    High average ticket
                    High LTV (lifetime value)
                    Low recency
                    High basket size
                    Low churn probability
                Expenses
                    Return rate
                Buying Experience
                    High average notes on reviews

        How many customers will be part of the program?
            Calculate the percentage of customers that belong to Insiders program over the total number of customers.

        What are the main characteristics of these customers?
            Indicate customer characteristics:
                Age
                City
                Education level
                Localization, etc.
            Indicate consumption characteristics:
                Clusters attributes

        What revenue percentage comes from Insiders?
            Calculate the percentage of Insiders revenue over the total revenue.

        What is the Insiders' expected revenue for the coming months?
            Calculate Insiders' LTV
            Calculate Cohort Analysis.

        What are the conditions for a customer to be eligible for the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be similar to a customer on Insiders.

        What are the conditions for a customer to be removed from the Insiders program?
            Define verification periodicity (monthly, quarterly, etc.)
            The customer must be dissimilar to a customer on Insiders.

        What is the guarantee that the Insiders program is better than the regular customer database?
            Perform A/B Test
            Perform A/B Bayesian Test
            Perform Hypothesis Test

        What actions can the marketing team make to increase the revenue?
            Discount
            Buying preferences
            Shipping options
            Promote a visit to the company, etc.



# Imports and helper functions

In [42]:
import sqlalchemy
import sqlite3
from scipy.cluster import hierarchy
import umap
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from datetime import datetime as dt
import snakecase
import re
import numpy as np
import dtype_diet
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import joblib
import s3fs
from dotenv import load_dotenv
import os


def data_description(df):
    print('Variables:\n\n{}'.format(df.dtypes), end='\n\n')
    print('Number of rows {}'.format(df.shape[0]), end='\n\n')
    print('Number of columns {}'.format(df.shape[1]), end='\n\n')
    print('NA analysis'.format(end='\n'))

    for i in df.columns:
        print('column {}: {} {}'.format(
            i, df[i].isna().any(), df[i].isna().sum()))

# Loading data

In [None]:
AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')
s3 = s3fs.S3FileSystem(
    anon=False, key=AWS_ACCESS_KEY_ID, secret=AWS_SECRET_ACCESS_KEY)

In [4]:
with s3.open('s3://insiders-customers-dataset/data.csv', 'rb') as file:
    df = pd.read_csv(file)

# Data Description

In [6]:
data_description(df)

Variables:

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

Number of rows 541909

Number of columns 8

NA analysis
column InvoiceNo: False 0
column StockCode: False 0
column Description: True 1454
column Quantity: False 0
column InvoiceDate: False 0
column UnitPrice: False 0
column CustomerID: True 135080
column Country: False 0


# Data Wrangling

In [7]:
for i in df.columns:
    df = df.rename(columns={i: snakecase.convert(i)}
    )

## Categorical attributes analysis

### invoice_no

In [8]:
# There is some invoices with a letter C, probably is return invoices.
# They will be removed to just focus on buying invoices for the moment.
df['invoice_no'].unique()
df = df[~df['invoice_no']
    .astype(str)
    .apply(lambda x: bool(re.search('[^0-9]+', x)))
]

### stock_code

In [9]:
# There is some stockCode with just a letter (D, M,m), some with letter and number
# and others with numbers. Also, there is some stockcode with just letters (POST,DOT...).
# The POST,DOT,BANK CHARGES, PADS and AMAZONFEE will be removed as
# they do not represents products but other operations.

np.array(df[df['stock_code']
                .astype(str)
                .apply(lambda x: bool(re.search('[A-Z]{3,}', x)))]['stock_code']
                .unique()
)
df = df[~df['stock_code'].isin(
    ['DOT', 'BANK CHARGES', 'AMAZONFEE', 'PADS', 'POST', 'M', 'D', 'm']
    )
]

### invoice_date

In [10]:
df['invoice_date'] = df['invoice_date'].apply(
    lambda x: dt.strptime(x, '%d-%b-%y')
)

## Quantitative analysis

In [11]:
# There is some itens with negative quantities and unit_price equals to zero.
# Looking at description could be products that were in stock but something
# happened such as they were lost, damaged or other situations.  
# These situations does not represents customer purchases. They would be removed.
# There are purchases with high quantities per item but it 
# appears to be normal purchases so they would be kept in data.

df = df[df['quantity'] > 0]

In [12]:
# There are products with unit_price equals to zero and most of these lines
# do not have neither product descriptions nor customer id. 
# These do not seems to be normal purchases and since this situations 
# represents less than 1% of the total data they will be removed.

df = df[df['unit_price'] != 0]

## Checking/Replace NAs

In [13]:
# In order to not loosing more than 20% of whole data because some
# customers do not have ID, we will create artifical IDs starting with 25000
df_aux = pd.DataFrame(
    df[df['customer_id'].isna()]['invoice_no']
    .drop_duplicates()
    )
df_aux = df_aux.assign(customer_id=np.arange(20000, 20000 + len(df_aux), 1))   


df = pd.merge(df,df_aux, how='left', on='invoice_no')
df['customer_id'] = (df['customer_id_x']
                            .combine_first(df['customer_id_y'])
                            .astype(int)
)
df = df.drop(columns=['customer_id_x', 'customer_id_y'], axis=1)

# Feature Engineering

In [14]:
df1 = df.drop(columns=['description', 'country'])
del(df)

### Gross Revenue

In [15]:
df_purchases = df1[['customer_id','invoice_no']].drop_duplicates()
df1_aux = (df1.loc[:,['invoice_no','quantity','unit_price']]
            .assign(gross_revenue=df1['quantity']*df1['unit_price'])
)                   
df1_aux = (df1_aux[['invoice_no','gross_revenue']]
            .groupby('invoice_no')
            .sum()
            .reset_index()
)
df_purchases = pd.merge(df_purchases,df1_aux, how='left', on='invoice_no')
df1_1 = (df_purchases[['customer_id','gross_revenue']]
        .groupby('customer_id')
        .sum()
        .reset_index()
)

###  Recency

In [16]:
df1_aux = (df1[['customer_id', 'invoice_date']]
           .groupby('customer_id')
           .max()
           .reset_index()
)
df1_aux['recency_days'] = (df1['invoice_date'].max()
                                                - df1_aux['invoice_date']).dt.days 

df1_1 = pd.merge(
    df1_1, df1_aux[['customer_id', 'recency_days']],
    on='customer_id', how='left')

### Quantity of purchases

In [17]:
df1_aux = (df1[['customer_id','invoice_no']]
            .groupby('customer_id')
            .nunique()
            .reset_index()
            .rename(columns={'invoice_no': 'qtd_purchases'})
)
df1_1 = pd.merge(df1_1, df1_aux, on='customer_id', how='left')

### Quantity of products

In [18]:
df1_aux = (df1[['customer_id', 'stock_code']]
            .groupby('customer_id')
            .nunique()
            .reset_index()
            .rename(columns={'stock_code':'qtd_products'})
)
df1_1 = pd.merge(df1_1, df1_aux, on='customer_id', how='left')

# Data Preparation

In [19]:
df2 = df1_1.copy()
df2 = df2.drop(columns='customer_id')

In [104]:
rs = RobustScaler()
df2['gross_revenue'] = rs.fit_transform(
    df2[['gross_revenue']].values)
pickle.dump(rs,s3.open(
    's3://insiders-customers-dataset/gross_revenue_scaler.pkl', 'wb')
)

df2['recency_days'] = rs.fit_transform(
    df2[['recency_days']].values)
pickle.dump(rs,s3.open(
    's3://insiders-customers-dataset/recency_days_scaler.pkl', 'wb')
)


df2['qtd_purchases'] = rs.fit_transform(
    df2[['qtd_purchases']].values)
pickle.dump(rs,s3.open(
    's3://insiders-customers-dataset/recency_days_scaler.pkl', 'wb')
)



df2['qtd_products'] = rs.fit_transform(
    df2[['qtd_products']].values)
pickle.dump(rs,open('qtd_products.pkl', 'wb'))




# df2['gross_revenue'] = pickle.load(
#     open('gross_revenue_scaler.pkl','rb').transform(
#     df2[['qtd_products']].values)

# df2['recency_days'] = pickle.load(
#     open('recency_days_scaler.pkl','rb').transform(
#     df2[['qtd_products']].values)

# df2['qtd_purchases'] = pickle.load(
#     open('qtd_purchases.pkl','rb').transform(
#     df2[['qtd_products']].values)

# df2['qtd_products'] = pickle.load(
#     open('qtd_products.pkl','rb').transform(
#     df2[['qtd_products']].values)

In [69]:
df_umap

Unnamed: 0,embedding_x,embedding_y
0,-2.75,-5.88
1,-2.69,4.14
2,3.35,18.98
3,2.89,18.68
4,17.17,4.84
...,...,...
5702,15.09,-6.80
5703,-0.74,2.33
5704,-0.95,0.57
5705,-5.50,7.91


In [65]:
joblib.load('embedding_umap.pkl').transform(df2)

array([[-6.9431505, 12.602888 ],
       [-5.040854 ,  8.553329 ],
       [-2.364027 ,  4.0677004],
       ...,
       [-5.815553 , 15.085253 ],
       [-0.0329408,  6.804495 ],
       [-2.6905107,  9.012781 ]], dtype=float32)

In [63]:
pipeline = Pipeline(
    steps = [
        ('preprocessor', RobustScaler()),
        ('umap_reducer', umap.UMAP(random_state=42))
    ]
)
embedding_umap = pipeline.fit(df2)
joblib.dump(embedding_umap,'embedding_umap.pkl')

df_umap = pd.DataFrame()
df_umap['embedding_x'] = embedding_umap[:, 0]
df_umap['embedding_y'] = embedding_umap[:, 1]
df_umap

TypeError: unhashable type: 'slice'

# Embedding space analysis

In [67]:
df3 = df2.copy()

## UMAP

In [68]:
reducer = umap.UMAP(random_state=42)
embedding_umap = reducer.fit_transform(df3)
pickle.dump(reducer, open('umap_embedding_space.pkl','wb'))
df_umap = pd.DataFrame()
df_umap['embedding_x'] = embedding_umap[:, 0]
df_umap['embedding_y'] = embedding_umap[:, 1]
df_umap.to_csv('umap_embedding_space.csv', index=False)
df3 = df_umap

# Model Training

## Final model

In [38]:
df4 = df_umap

In [39]:
hc_model = hierarchy.linkage(df4, 'ward')
# Model predict
labels = hierarchy.fcluster(hc_model, 6, criterion='maxclust')

# Cluster analysis

## Cluster profile

In [40]:
#removing fake customers
df5 = df1_1.copy()
df5['cluster'] = labels
df5 = df5[df5['customer_id'] < 20000]

In [41]:
df_cluster = (df5[['customer_id', 'cluster']]
                .groupby('cluster')
                .count()
                .reset_index()
)
df_cluster = df_cluster.assign(
    perc_customer=100*(df_cluster['customer_id'] / 
                                            df_cluster['customer_id'].sum())
)
df_avg_gross_revenue = (df5[['cluster', 'gross_revenue']]
                                    .groupby('cluster')
                                    .median()
                                    .reset_index()
)
df_cluster = pd.merge(
    df_cluster, df_avg_gross_revenue, how='inner', on='cluster')

# Avg recency
df_recency = df5[['cluster', 'recency_days']].groupby(
    'cluster').median().reset_index()
df_cluster = pd.merge(
    df_cluster, df_recency, how='inner', on='cluster')

# Avg quantity of purchases
df_recency = (df5[['cluster', 'qtd_purchases']]
                .groupby('cluster')
                .median()
                .reset_index()
)
df_cluster = pd.merge(
    df_cluster, df_recency, how='inner', on='cluster')

# Avg quantity of products
df_qtd_products = (df5[['cluster', 'qtd_products']]
                            .groupby('cluster')
                            .median()
                            .reset_index()
)
df_cluster = pd.merge(
    df_cluster, df_qtd_products, how='inner', on='cluster')

df_cluster_result = df_cluster.sort_values(
    by='gross_revenue', ascending=False)
display(df_cluster_result)

Unnamed: 0,cluster,customer_id,perc_customer,gross_revenue,recency_days,qtd_purchases,qtd_products
3,4,1007,23.23,3122.04,15.0,8.0,117.0
5,6,1036,23.9,1001.66,36.0,4.0,50.0
4,5,826,19.06,538.01,63.0,2.0,29.0
0,1,657,15.16,290.66,46.0,1.0,19.0
2,3,700,16.15,229.09,239.0,1.0,14.0
1,2,108,2.49,212.87,366.0,1.0,12.5


Cluster 4 (Candidate of insiders)

- Number of customers: 1007 (23.23%)
- median of gross_revenue: £2671.46
- median of Recency: 24 days
- median of quantity of purchases in one year: 6 purchases
- median of quantity of distinct products bought: 150 products

Cluster 6 (Cluster more products)

- Number of customers: 1036 (23.90%)
- median of gross_revenue: £547.06
- median of Recency: 40 days
- median of quantity of purchases in one year: 3 purchases
- median of quantity of distinct products bought: 53 products


Cluster 5 (Cluster even more products)

- Number of customers: 826 (19.06%)
- median of gross_revenue: £1059.97
- median of Recency: 64 days
- median of quantity of purchases in one year: 2 purchases
- median of quantity of distinct products bought: 29 products


Cluster 1 (Cluster more purchases )

- Number of customers: 657 (15.16%)
- Average gross_revenue: £232.14
- Average Recency: 45 days
- Average of quantity of purchases in one year: 1 purchase
- Average of quantity of distinct products bought: 15 products


Cluster 3 (Cluster decrease recency days)
- Number of customers: 700 (2.49%)
- median of gross_revenue: £196.43
- median of Recency: 232 days
- median of quantity of purchases in one year: 1 purchase
- median of quantity of distinct products bought: 14 products


Cluster 2 (Cluster decrease even more recency days)

- Number of customers: 108 (22.%)
- median gross_revenue: £105.72
- median Recency: 364 days
- median of quantity of purchases in one year: 1 purchase
- median of quantity of distinct products bought: 9 products

# Deploy to production

## Insert into SQLITE

In [153]:
# query_create_table = """
#     CREATE TABLE insiders (
#     customer_id INTEGER,
#     gross_revenue REAL,
#     recency_days INTEGER,
#     qtd_purchases INTEGER,
#     qtd_products INTEGER,
#     cluster INTEGER
#     )
# """

load_dotenv()
 
host = os.getenv('HOST')
user = os.getenv('USER')
password = os.getenv('PASSWORD')

endpoint = 'postgresql://{}:{}@{}'.format(user, password, host)

conn = sqlalchemy.create_engine(endpoint)
# conn.connect()
# conn.execute(sqlalchemy.text(query_create_table))
# conn.commit()
# conn.close()


df5.to_sql(
    'insiders', con=conn, if_exists='append', index=False)

334