Let's do some model development.

We're trying to prod people to spend money. Assuming no long term effects like: retention rates, customer annoyance, long term habit building, customer satisfaction. So we're trying to prod them to spend money over the short term.

Business scenarios:

- Assuming no transaction history built into model.
    - New customer, no demo info, what to offer them.
        - Basically no info at all. Offer aggregate best, or in model solely with customer length.
    - New customer, demo info, what to offer them.
        - Use model based on age, gender, income, possibly in model with customer length.
    - Existing customer, no demo info, what to offer them.
        - Use model based on customer length. Possibly by year as bin.
    - Existing customer, demo info, what to offer them.
        - Use model based on age, gender, income, customer length.
        
So we're looking for a way to pick out which offer to give a customer.

In our data, customers are only exposed to a maximum of 6 offers, with a median of 4 unique offers.


In [1]:
# Imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Sometimes use display instead of print
from IPython.display import display

# debugging
from IPython.core.debugger import set_trace

In [2]:
# Read the cleaned data
portfolio = pd.read_csv('./data/portfolio_clean.csv')
profile = pd.read_csv('./data/profile_clean.csv')
transcript = pd.read_csv('./data/transcript_clean.csv')

In [3]:
display(portfolio.head())
display(profile.head())
display(transcript.head())

Unnamed: 0,offer_id,web,email,mobile,social,offer_type,duration,difficulty,reward
0,1,0,1,1,1,bogo,168,10,10
1,2,1,1,1,1,bogo,120,10,10
2,3,1,1,1,0,informational,96,0,0
3,4,1,1,1,0,bogo,168,5,5
4,5,1,1,0,0,discount,240,20,5


Unnamed: 0,customer_id,gender,age,income,became_member_on
0,1,,,,2017-02-12
1,2,F,55.0,112000.0,2017-07-15
2,3,,,,2018-07-12
3,4,F,75.0,100000.0,2017-05-09
4,5,,,,2017-08-04


Unnamed: 0,customer_id,time,event,amount,reward,offer_id
0,4,0,offer_received,,,4.0
1,5,0,offer_received,,,5.0
2,6,0,offer_received,,,10.0
3,7,0,offer_received,,,7.0
4,8,0,offer_received,,,2.0


How to define success?

Base line behaviour

Split by customers? yes.

In [None]:
# Merge everything first.
# df = transcript.merge(profile, how='left', on='customer_id').merge(portfolio, how='left', on='offer_id')

In [None]:
# df = df.rename(columns={'reward_x':'reward_transaction', 'reward_y':'offer_reward'})

In [None]:
# df.head()

In [None]:
# A list of individual df's from grouping by customer id.
# train_customers, test_customers = train_test_split([e[1] for e in df.groupby('customer_id')], test_size=0.3, random_state=7)

In [None]:
# display(len(train_customers))
# display(len(test_customers))

In [None]:
# def split_transactions_and_offers(customer_list_of_df, transaction_key='transaction'):
#     """
#     Filters a agglomerated dataframe into transactions and offers.
    
#     Input:
#     customers_list_of_df - individual customer dfs in a list
#     transaction_key      - str for transaction events
    
#     Returns:
#         List of tuples of transaction and offer event dfs by customer id.
#     """
#     output = []
#     # Iterate through the list and split
#     for customer in customer_list_of_df:
#         # Mask to get transactions
#         select = customer.event == transaction_key
#         # Filter for transactions and 
#         output.append((customer[select], customer[~select]))
    
#     return output
    

In [None]:
# train_event_split = split_transactions_and_offers(train_customers)

In [None]:
# display(train_event_split[1][0])
# print('\n'*4)
# display(train_event_split[1][1])

# Simple KNN or something for only one type of offer? e.g. offer completion.

If I had to get a very simple classifier working to:
- predict whether off number 1, a bogo offer, was completed or not.
- predict based on demographic data.
    
To do that I would need to classify whether someone:
- got offer 1
- completed offer 1 within the specified duration

I'll split customers into training and validation.

To figure out if customers completed an offer or not, I'll need to:
- filter the transcript for 'offer_received' events for offer 1
- for each 'offer_received' event:
    - filter the transcript for 'offer_completed' events with time >= time of the received event and time <= t + duration

It's like

1) Grab the data by offer
2) Prep the data
3) Do the modelling and report results.
    - KNN pipeline
    - other model pipeline
4) Overall function.

In [14]:
def build_offer_with_demo_action_models(transcript_df,
                                        profile_df,
                                        portfolio_df,
                                        action_dict={'bogo': 'offer_completed',
                                                     'discount': 'offer_completed',
                                                     'informational': 'offer_viewed'},
                                        model_type='knn',
                                        test_size=0.3,
                                        verbosity=3,
                                        random_state=7):
    """
    Get models for offer viewing/completion.
    
    Input:
    transcript_df - Transaction/event transcript dataframe.
    profile_df    - Customer profiles.
    portfolio_df  - Offer portfolio.
    action_dict   - The action should the model predict for per offer.
    model_type    - Classifier to use. One of 'knn' (k nearest neighbours)
                    or 'rfc' (random forest classifier).
    test_size     - Split size for training/test sets.
    verbosity     - GridSearchCV verbosity level.
    random_state  - random_state to pass down to model building etc.
                  
    
    Returns:
    A dict containing keys for the offer id and values that are dicts with:
        model   - GridSearchCV model for the offer.
        X_train - Training data features.
        X_test  - Test data features.
        y_train - Training data targets.
        y_test  - Test data targets.
    """
    output={}
    
    # Get offer ids and offer types,
    # then generate action type for based on offer type
    # (e.g. informational offers can only be viewed, not completed).
    offer_ids, offer_types = portfolio_df['offer_id'], portfolio_df['offer_type']
    offer_actions = [action_dict[offer_type] for offer_type in offer_types]
    
    for offer_id, action in zip(offer_ids, offer_actions):
        # Get combined customer profile + action completion df
        offer_profile = get_offer_action_data(transcript_df, profile_df, offer_id, action, contain_demo=True)
        # Get X and y 
        X, y = separate_x_y(offer_profile)
        # Split training and test
        X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)
        # Switch to build and cross validate a model.
        if model_type == 'knn':
            model = build_KNN_pipeline_and_fit_CV(X_train, y_train, verbosity=verbosity)
        elif model_type == 'rfc':
            model = build_RFC_pipeline_and_fit_CV(X_train, y_train, verbosity=verbosity)
        else:
            raise ValueError(f"Model of type {model_type} not implemented.")
        
        # A output dict is generated for each offer id.
        output[offer_id] = {'model': model, 
                              'X_train': X_train,
                              'X_test': X_test,
                              'y_train': y_train,
                              'y_test': y_test
                             }
        
    return output

In [5]:
def get_offer_action_data(transcript_df, profile_df, offer_id, action_type, contain_demo=True):
    """
    For a given offer id, generates a df combining customer profiles with 
    whether they viewed/completed the offer (if they received the offer) over
    the course of the experiment.
    
    Also dummies gender data and converts "became_member_on" to 
    durations of how long a customer has been a customer.
    
    Input:
    transcript_df - Transaction/event transcript dataframe.
    profile_df    - Customer profiles.
    offer_id      - Offer id to process.
    action_type   - 'offer_completed' or 'offer_viewed'.
    contain_demo  - If True, returns only customers having age/gender/income demographic data.
                    If False, returns only customers without demographic data.
    
    Returns:
    offer_profile - Dataframe containing customer profiles and offer view/completion status.
    """
    tr = transcript_df
    pro = profile_df
    
    # Get set of offering completing and non-completing customer ids
    offer_customers = set(tr[(tr.offer_id == offer_id) & (tr.event == 'offer_received')].customer_id)
    offer_completed = set(tr[(tr.offer_id == offer_id) & (tr.event == action_type)].customer_id)
    offer_incomplete = offer_customers - offer_completed
    
    # Appending 0/1 for incomplete/complete to customer profile data
    profile_complete = pro[pro.customer_id.isin(offer_completed)].assign(offer_complete = 1)
    profile_incomplete = pro[pro.customer_id.isin(offer_incomplete)].assign(offer_complete = 0)
    
    offer_profile = pd.concat([profile_complete, profile_incomplete]).sort_values('customer_id')
    
    # Customers w/ demographic data
    if contain_demo == True:
        offer_profile = offer_profile.dropna()
    # Customers missing demographic data
    else:
        offer_profile = offer_profile[offer_profile.isna().any(axis=1)]
        
    # Clean the data
    ## Get dummies for gender
    offer_profile = pd.concat([offer_profile,
                               pd.get_dummies(offer_profile.gender, prefix="gender")],
                              axis=1)
    # Change membership date to duration in years of how long customer
    # has been a customer.
    customer_duration = pd.to_datetime(offer_profile.became_member_on)
    customer_duration = (customer_duration.max() - customer_duration).dt.days/365
    offer_profile['customer_duration'] = customer_duration
    
    # Drop unnecessary columns
    offer_profile = offer_profile.drop(columns=['gender', 'became_member_on', 'customer_id'])
    
    return offer_profile
    

In [None]:
offer1_profile_demo_info = get_offer_completion_data(transcript, profile, offer_id=1, contain_demo=True)

In [None]:
offer1_profile_demo_info.head()

In [6]:
def separate_x_y(df, y_key='offer_complete'):
    """
    Takes a df and separates it into X and y by y_key.
    
    Input:
    df     - A dataframe.
    y_key  - str of the y target column.
    
    Returns:
    X      - Dataframe without y_key column.
    y      - Series from y_key column.
    """
    # Get X and standardize it.
    # i.e. mean = 0, unit variance
    X = df.drop(columns=y_key)    
    y = df[y_key]
    
    return X, y

In [None]:
X, y = separate_x_y(offer1_profile_demo_info)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=7)

In [7]:
def build_KNN_pipeline_and_fit_CV(X_train, y_train, verbosity=3, n_neighbors_grid=[1,5,10,20,40,80,160,320,640,1000]):
    """
    Builds a KNN model and fits on training data with cross validation.
    
    Standardizes X data first before feeding into a KNN model.
    
    Input:
    X_train          - Training data features.
    y_train          - Training data targets.
    verbosity        - 0, 1, 2, or 3 to control GridSearchCV output.
    n_neighbors_grid - Search grid for the number of neighbors for the KNN model.
    
    Returns:
    model            - a GridSearchCV model.
    
    """
    pipe = Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier())])
    
    param_grid = [{'knn__n_neighbors': n_neighbors_grid}]
    model = GridSearchCV(pipe, scoring='f1', param_grid=param_grid, cv=5, refit=True, verbose=verbosity, return_train_score=True)
    
    model.fit(X_train, y_train)
    
    print(f"Best params: {model.best_params_}.")
    print(f"Best score: {round(model.best_score_, 5)}.")
    
    return model

In [8]:
def get_CV_model_scores(model):
    """
    Gets training and cross validation scores form a model.
    
    Input:
    model - an sklearn GridSearchCV model.
    
    Returns:
    mean_train_score - GridSearchCV model mean training scores.
    mean_test_scores - GridSearchCV model mean cross validation scores.
    """
    return model.cv_results_['mean_train_score'], model.cv_results_['mean_test_score']

In [15]:
knn_models = build_offer_with_demo_action_models(transcript, profile, portfolio, model_type='knn', verbosity=1, random_state=7)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 160}.
Best score: 0.75268.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 160}.
Best score: 0.71826.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 160}.
Best score: 0.70603.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 1000}.
Best score: 0.7945.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 320}.
Best score: 0.7185.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 320}.
Best score: 0.86337.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 320}.
Best score: 0.87384.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best params: {'knn__n_neighbors': 20}.
Best score: 0.94924.
Fitting 5 folds for each o

In [41]:
rfc_models = build_offer_with_demo_action_models(transcript, profile, portfolio, model_type='rfc', verbosity=1, random_state=7)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best params: {'rfc__max_depth': None, 'rfc__min_samples_leaf': 8, 'rfc__min_samples_split': 20, 'rfc__n_estimators': 10}.
Best score: 0.75759.
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best params: {'rfc__max_depth': 100, 'rfc__min_samples_leaf': 8, 'rfc__min_samples_split': 5, 'rfc__n_estimators': 100}.
Best score: 0.72039.
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best params: {'rfc__max_depth': 100, 'rfc__min_samples_leaf': 8, 'rfc__min_samples_split': 20, 'rfc__n_estimators': 10}.
Best score: 0.68864.
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best params: {'rfc__max_depth': None, 'rfc__min_samples_leaf': 8, 'rfc__min_samples_split': 10, 'rfc__n_estimators': 100}.
Best score: 0.79049.
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best params: {'rfc__max_depth': 100, 'rfc__min_samples_leaf': 8, 'rfc__min_samples_split': 20, 'rfc__n_estimators': 100

In [18]:
knn_models[1].keys()

dict_keys(['model', 'X_train', 'X_test', 'y_train', 'y_test'])

In [None]:
knn

In [22]:
knn_models[1]['X_test']

Unnamed: 0,age,income,gender_F,gender_M,gender_O,customer_duration
11645,54.0,55000.0,0,1,0,2.331507
4380,41.0,53000.0,0,1,0,0.320548
7325,51.0,62000.0,1,0,0,0.265753
4217,47.0,30000.0,1,0,0,2.961644
13002,63.0,117000.0,0,1,0,0.093151
...,...,...,...,...,...,...
15207,80.0,74000.0,0,1,0,0.583562
16753,60.0,98000.0,0,1,0,1.339726
10634,38.0,44000.0,0,1,0,0.887671
403,49.0,104000.0,0,1,0,0.356164


In [30]:
test = np.array([[32, 0, 0, 1, 0, 5]])

In [31]:
test

array([[32,  0,  0,  1,  0,  5]])

In [36]:
for offer in knn_models.keys():
    print(f"Offer: {offer}. Prediction: {knn_models[offer]['model'].predict(test)[0]}")

Offer: 1. Prediction: 0
Offer: 2. Prediction: 0
Offer: 3. Prediction: 1
Offer: 4. Prediction: 1
Offer: 5. Prediction: 0
Offer: 6. Prediction: 1
Offer: 7. Prediction: 1
Offer: 8. Prediction: 1
Offer: 9. Prediction: 1
Offer: 10. Prediction: 1


In [39]:
test2 = np.array([[32, 0, 1, 0, 0, 5]])

In [40]:
for offer in knn_models.keys():
    print(f"Offer: {offer}. Prediction: {knn_models[offer]['model'].predict(test2)[0]}")

Offer: 1. Prediction: 0
Offer: 2. Prediction: 0
Offer: 3. Prediction: 0
Offer: 4. Prediction: 1
Offer: 5. Prediction: 1
Offer: 6. Prediction: 1
Offer: 7. Prediction: 1
Offer: 8. Prediction: 1
Offer: 9. Prediction: 1
Offer: 10. Prediction: 1


In [23]:
test = knn_models[1]['X_test'].iloc[0,:]

In [25]:
test.iloc[0] = pd.Series([32, 0, 0, 1, 0, 5])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


ValueError: setting an array element with a sequence.

In [None]:
knn_model = build_KNN_pipeline_and_fit_CV(X_train, y_train, verbosity=1)

In [None]:
train_score, validation_score = get_CV_model_scores(knn_model)

In [None]:
n_neighbors = [e['knn__n_neighbors'] for e in model.cv_results_['params']]

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
param = 'knn__n_neighbors'
sns.lineplot(ax=ax, x=n_neighbors, y=train_score, label='train')
sns.lineplot(ax=ax, x=n_neighbors, y=validation_score, label='validation')
ax.axvline(model.best_params_[param])
ax.set_xlabel(param)
ax.set_ylabel('f1 score')

In [20]:
def build_RFC_pipeline_and_fit_CV(X_train,
                                  y_train,
                                  verbosity=3,
                                  param_grid=[{'rfc__n_estimators': [10, 100],
                                               'rfc__max_depth': [100, None],
                                               'rfc__min_samples_split': [2, 5, 10, 20],
                                               'rfc__min_samples_leaf': [1, 2, 4, 8]
                                             }],
                                  random_state=7):
    """
    Builds a random forest classifier and fits on training data with cross validation.
    
    Standardizes X data first before feeding into a RF model.
    
    See https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
    for hyperparameter tuning example.
    
    Input:
    X_train          - Training data features.
    y_train          - Training data targets.
    verbosity        - 0, 1, 2, or 3 to control GridSearchCV output.
    param_grid       - Search grid for the RF model. Prefix params with 'rfc__'.
    random_state     - random int for the RF model.
    
    Returns:
    model            - a GridSearchCV model.
    """
    pipe = Pipeline([('scaler', StandardScaler()),
                         ('rfc', RandomForestClassifier())])
    
 
    
    model = GridSearchCV(pipe, scoring='f1', param_grid=param_grid, cv=5, refit=True, verbose=verbosity, return_train_score=True)
    
    model.fit(X_train, y_train)
    
    print(f"Best params: {model.best_params_}.")
    print(f"Best score: {round(model.best_score_, 5)}.")
    
    return model

In [None]:
rfc_model = build_RFC_pipeline_and_fit_CV(X_train, y_train, verbosity=1)

In [None]:
train_score, validation_score = get_CV_model_scores(rfc_model)

In [None]:
n_neighbors = [e['knn__n_neighbors'] for e in model.cv_results_['params']]

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
param = 'knn__n_neighbors'
sns.lineplot(ax=ax, x=n_neighbors, y=train_score, label='train')
sns.lineplot(ax=ax, x=n_neighbors, y=validation_score, label='validation')
ax.axvline(model.best_params_[param])
ax.set_xlabel(param)
ax.set_ylabel('f1 score')

## References

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74