# From Data Analysis to Data Science, a marketing use case (Participant)

This noteboook is a support to follow th workshop. The full notebook will be released at the end of the workshop.

This tutorial's objective is to **illustrate how simple usage of data analysis and data science tools can help** in business environments. The marketing field is naturally prone to use data. Indeed, when the user base grows guessing which product would interest someone becomes more and more complicated.

This tutorial is suitable for any audience and proposes only an illustration of some possible analysis.

We will study here the issue of **upselling**. Online services propose free products to acquire free users hoping one day they will pay for products making the company profitable. This means that finding among the users, who is likely to convert to paying products is economically tremendous:

1. To make money with those users and prioritize them for customer support, etc...
2. To give up efforts on users that will never be converted and maybe give less money  to the source of these non convertible users

In the following study we assume that we want to convert people to a premium package. However, the methodology presented can be applied on any other paying product.

******************************************

## 0. Introduction

### Context 

You are working for a company developing a mobile payment app with 300k users. Among those users, **29%** of users could be interested in premium. We suppose than **40%** of the users that could be interested do not know premium and they can be converted with a marketing  campaign. Among people that are effectively premium, we assume that **5%** are premium but shouldn't be regarding their profile (special offers, etc...). 

You are asked to build a target of users for a marketing campaign in two settings:
- You are allowed to contact N users. (N=3000)
- You have to maximize your ROI. (Cost: 5 euro / Revenue: 20 euros)

So for the whole experiment we define here two business metrics to compare our methods :
- Lift and notably the lift at 1%
- The Max ROI

### Libraries

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp
from time import time
from sklearn import manifold
from sklearn.metrics import euclidean_distances
from sklearn.cluster import MiniBatchKMeans, KMeans, Birch
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


def find_max_roi(test):
    roi_computations = test.query('index in @users_targeted.index.tolist()').copy()
    roi_computations['dum'] = 1
    roi_computations = roi_computations.sort_values('nb_card', ascending=False)
    roi_computations = roi_computations[['premium_target','dum']].cumsum()
    roi_computations['roi'] = roi_computations['premium_target']*20 - 10*roi_computations['dum']
    max_roi = roi_computations.reset_index(drop=True).roi.idxmax()
    
    plt.figure()
    plt.title("Revenues by Nb of users targeted")
    plt.plot(roi_computations['roi'].values)
    plt.show()

    print(f'We should then target {round(max_roi/len(roi_computations)*100, 2)}% of the users targeted')

    
def find_max_roi(test_target):
    roi_computations = test_target.copy()
    roi_computations['dum'] = 1

    roi_computations = roi_computations[['premium_target','dum']].cumsum()
    roi_computations['roi'] = roi_computations['premium_target']*20 - 10*roi_computations['dum']
    max_roi = roi_computations.reset_index(drop=True).roi.idxmax()
    
    plt.figure()
    plt.title("Revenues by Nb of users targeted")
    plt.plot(roi_computations['roi'].values)
    plt.show()

    print(f'We should then target {round(max_roi/len(roi_computations)*100, 2)}% of the users targeted')
    return round(max_roi/len(roi_computations)*100, 2)

def compute_roi(users_targeted, ratio):
    users_targeted_sub = users_targeted.head(int(ratio*len(users_targeted)))
    users_targeted_sub = users_targeted_sub.query("premium_target==0")
    revenue = users_targeted_sub.premium.sum()*20 - len(users_targeted_sub)*10
    return revenue, users_targeted_sub

def compute_lift(users_targeted, df_real):
    res = pd.DataFrame(range(1, 100), columns=['split_value'])
    res['split'] = 1
    res['nb_targeted'] = len(users_targeted)
    for idx, row in res.iterrows():
        split_value = row['split_value']/100
        coeff = df_real.query('premium_target==0').premium.sum()/len(df_real.query('premium_target==0'))
        nb_selected = int(split_value*len(df_real.query('premium_target==0')))
        nb_found = users_targeted.query('premium_target==0').head(nb_selected).premium.sum()
        res.loc[res['split_value']==row['split_value'],'split'] = nb_found/(nb_selected*coeff)
        res.loc[res['split_value']==row['split_value'],'nb_targeted'] = nb_selected

    res = res.query('split>=1.')
    res.split.plot.bar(x="split_value", y="split")
    plt.title('Lift')
    plt.show()

    nb_selected=3000
    nb_found = users_targeted.query('premium_target==0').head(nb_selected).premium.sum()
    print(f"Premium Found : {nb_found} users")

### Data

The data has been generated automatically following the centered sampling procedure. A conditional dependency graph between features has been created to generate the dataset. The code to generate the data will be available at the end of this workshop.

The target represents if a user is premium or not. This target has been generated automatically conditionnaly to the features created. To let the data reflect more the real life data we have supposed that people supposed to be premium are at 40% not aware that they could be interested in the premium features. We also suppose than 5% of the premium users have not a typical profile to be premium.

Your objective now is to make the people that could be interested in premium aware of this product. You have the data without the real target at your diposal. For evaluation purpose we will use the real target. All the material will be released tomorrow.

In [2]:
df = pd.read_csv('data/workshop_marketing_use_case_wo_real_target.csv')

Unnamed: 0,index,phone_price,age,phone_model_ios,nb_p2p,nb_card,place,model_age,card_type,bank,connection_hour,session_time,premium_target
236533,236533,474.203742,18.007359,1,1,8,2,7.525404,2,1,21.178519,2.865003,0
164970,164970,731.311845,21.166237,0,1,5,2,0.0,1,1,12.093523,0.852273,0
23964,23964,278.262866,22.570296,1,2,16,3,4.841244,1,1,8.755906,2.287618,0
180167,180167,209.89657,22.643436,0,2,24,2,2.524249,2,1,9.913262,13.81291,0
201905,201905,209.835367,22.290948,0,6,28,3,7.037153,1,1,5.633749,13.521256,1
134510,134510,355.020629,24.669966,0,2,20,3,6.802363,1,1,5.378326,11.71922,1
135003,135003,476.233339,19.371665,1,0,21,3,4.910793,3,1,22.235677,4.515787,1
57535,57535,275.668196,19.412672,0,11,50,3,0.0,1,1,21.299278,6.702291,1
143494,143494,313.292704,29.726442,1,5,13,3,7.267049,2,1,5.26049,4.482642,0
57622,57622,450.403157,24.970813,0,0,7,3,0.0,1,3,11.252887,3.959148,0


## I. Naive Exploration

## II. Data Analysis

## II. Clustering Analysis

## III. Supervised Analysis