# Занятие 1. Введение в uplift-моделирование

Описание клиента до рассылки

* Recency: Months since last purchase.
* History_Segment: Categorization of dollars spent in the past year.
* History: Actual dollar value spent in the past year.
* Mens: 1/0 indicator, 1 = customer purchased Mens merchandise in the past year.
* Womens: 1/0 indicator, 1 = customer purchased Womens merchandise in the past year.
* Zip_Code: Classifies zip code as Urban, Suburban, or Rural.
* Newbie: 1/0 indicator, 1 = New customer in the past twelve months.
* Channel: Describes the channels the customer purchased from in the past year.


В этой переменной указано то, к какой группе был отнесен клиент
* Segment ("Mens E-Mail", "Womens E-Mail", "No E-Mail")

Переменные, описывающие клиента в течение 2 недель после получения e-mail

* Visit: 1/0 indicator, 1 = Customer visited website in the following two weeks.
* Conversion: 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.
* Spend: Actual dollars spent in the following two weeks.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [14]:
from typing import List

In [17]:
import catboost as cb
import pylift
import causalml.metrics as cmetrics

from causalml.inference.tree import UpliftRandomForestClassifier
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'causalml'

In [19]:
git clone https://github.com/uber/causalml.git
cd causalml
python setup.py build_ext --inplace
python setup.py install

SyntaxError: invalid syntax (2835810986.py, line 1)

# Required libs

In [23]:
def dict_coalesce(left_dict: dict, right_dict: dict) -> None:
    for key, value in right_dict.items():
        if key not in left_dict:
            left_dict[key] = value

class FunctionWrapper(object):
    def __init__(self, function, **params):
        self.params = params
        self.function = function

    def __call__(self, *args, **kwargs):
        dict_coalesce(kwargs, self.params)
        return self.function(*args, **kwargs)

In [24]:
def get_shap_values_(upmodel: pylift.TransformedOutcome):
    return upmodel.model.get_feature_importance(
        data=cb.Pool(
            data=upmodel.x_test,
            label=upmodel.transformed_y_test
        ),
        fstr_type='ShapValues'
    )

# Load data

In [3]:
data = pd.read_csv('data/part3/less1/1 Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')

In [4]:
data.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend
0,10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,0,0.0
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0
2,7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,0,0.0
3,9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0.0
4,2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,0,0.0


In [5]:
data['segment'].value_counts()

segment
Womens E-Mail    21387
Mens E-Mail      21307
No E-Mail        21306
Name: count, dtype: int64

In [7]:
data_womens = data.query('segment in ("No E-Mail", "Womens E-Mail")')

In [8]:
data_womens['treatment'] = (data_womens['segment'] == "Womens E-Mail").astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_womens['treatment'] = (data_womens['segment'] == "Womens E-Mail").astype('int32')


In [9]:
data_liar = data.query('segment in ("No E-Mail", "Womens E-Mail")')
data_liar['treatment'] = (data_liar['segment'] == "Womens E-Mail").astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_liar['treatment'] = (data_liar['segment'] == "Womens E-Mail").astype('int32')


In [10]:
data_womens.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend,treatment
0,10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,0,0.0,1
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0,0
2,7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,0,0.0,1
4,2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,0,0.0,1
5,6,2) $100 - $200,134.83,0,1,Surburban,0,Phone,Womens E-Mail,1,0,0.0,1


# Explore data

In [11]:
data_womens.shape

(42693, 13)

In [12]:
data_womens['zip_code'].value_counts()

zip_code
Surburban    19275
Urban        17098
Rural         6320
Name: count, dtype: int64

In [13]:
data_womens['channel'].value_counts()

channel
Phone           18781
Web             18727
Multichannel     5185
Name: count, dtype: int64

# Transform data

In [25]:
def one_hot_encode(data: pd.DataFrame, cols: List[str] = None) -> pd.DataFrame:
    if cols is None:
        cols = data.columns
    result = pd.concat([data, pd.get_dummies(data[cols])], axis=1)
    return result

In [26]:
def transform_(data: pd.DataFrame) -> pd.DataFrame:
    zipcode_num_dict = {
        'Urban': 0,
        'Surburban': 1,
        'Rural': 2
    }
    data['zip_code_num'] = data['zip_code'].apply(lambda s: zipcode_num_dict[s])
    
    channel_num_dict = {
        'Web': 0,
        'Multichannel': 1,
        'Phone': 2
    }
    data['channel_num'] = data['channel'].apply(lambda s: channel_num_dict[s])
    data['history_segment__label'] = data['history_segment'].apply(lambda s: s[0])
    data = one_hot_encode(data, cols=['zip_code', 'channel', 'history_segment__label'])
    return data

In [27]:
data_womens = transform_(data_womens)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['zip_code_num'] = data['zip_code'].apply(lambda s: zipcode_num_dict[s])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['channel_num'] = data['channel'].apply(lambda s: channel_num_dict[s])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['history_segment__label'] = data['history_segmen

In [28]:
data_womens.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,...,channel_Multichannel,channel_Phone,channel_Web,history_segment__label_1,history_segment__label_2,history_segment__label_3,history_segment__label_4,history_segment__label_5,history_segment__label_6,history_segment__label_7
0,10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,...,False,True,False,False,True,False,False,False,False,False
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,...,False,False,True,False,False,True,False,False,False,False
2,7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,...,False,False,True,False,True,False,False,False,False,False
4,2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,...,False,False,True,True,False,False,False,False,False,False
5,6,2) $100 - $200,134.83,0,1,Surburban,0,Phone,Womens E-Mail,1,...,False,True,False,False,True,False,False,False,False,False
