Este notebook tem o intuito de construir um dataset estruturado para a modelagem

# Setting up

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.pandas import merge_asof
from pyspark.pandas import DataFrame as ps
from pyspark.sql import Window

from pathlib import Path

# current repo path 
repo_path = Path().resolve().parent

spark = SparkSession.builder.appName('Spark Demo').master('local[*]').getOrCreate()


your 131072x1 screen size is bogus. expect trouble
25/05/15 11:20:07 WARN Utils: Your hostname, George-Book3 resolves to a loopback address: 127.0.1.1; using 172.23.250.106 instead (on interface eth0)
25/05/15 11:20:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/15 11:20:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Carregando dados

In [None]:
transactions_full = spark.read.json((repo_path / 'data' / 'processed' / 'transactions_full').as_posix())

                                                                                

# Construção de target

Aqui temos algumas opções de modelos e targets que poderiam ser escolhidos. Para as targets, podemos pensar em:

 - uma oferta enviada, foi comprada? (binária)
 - uma oferta enviada e aberta, foi comprada? (binária)
 - qual o valor transacionado graças aquela oferta (regressão, pensando em aumentar o valor transacionado por oferta e não simplesmente o ato da transação)
 - uplift (impacto incremental) de enviar uma oferta vs não enviar. 

Para a parte de modelagem, podemos pensar em:

- Multi-Class: predizer qual oferta o cliente tem mais chance de converter
- Classificador (binário) por oferta: um modelo separado por oferta 
- Uplift model: um método mais robusto que diz qual cliente de fato precisa de uma oferta para transacionar. (library econml ajuda neste caso).
- Reinforcement learning/ multi armed bandits: para balancear o teste de novas ofertas vs "exploração" (usar a mais performática), resolvendo com thompson sampling.

Para fins de simplicidade e não me alongar muito no case, irei escolher a target como se uma oferta foi enviada, ela foi convertida?Para a modelagem, irei:

 - Criar um dataset de treino com as features de clientes, das oferta e transações passadas do cliente (até aquela oferta para não haver leakage)
 - Na hora da inferência, irei variar as informações (features) das ofertas para ver a chance daquele cliente converter aquela oferta, iterando sobre todas

Como a base é em um formato de eventos, primeiro devemos construir um dataset indicando se aquela offer pra aquele customer foi bem sucedido (target = 1) ou não (target = 0). Temos que ter o cuidado que a mesma offer pode ser enviada varias vezes em tempos distintos, ou seja, temos que mapear o tempo que ela foi enviada. A dimensão da base é cliente-oferta-tempo

In [16]:
df = (
    transactions_full.filter('event = "offer received"')
    .select("account_id", "offer_id", "time_since_test_start")
    .distinct()
    .orderBy("time_since_test_start","account_id")
)

df.show()

+--------------------+--------------------+---------------------+
|          account_id|            offer_id|time_since_test_start|
+--------------------+--------------------+---------------------+
|0011e0d4e6b944f99...|3f207df678b143eea...|                  0.0|
|0020c2b971eb4e918...|fafdcd668e3743c1b...|                  0.0|
|003d66b6608740288...|5a8bc65990b245e5a...|                  0.0|
|00426fe3ffde4c6b9...|5a8bc65990b245e5a...|                  0.0|
|005500a7188546ff8...|ae264e3637204a6fb...|                  0.0|
|0056df74b63b42988...|9b98b8c7a33c4b65b...|                  0.0|
|00715b6e55c3431cb...|ae264e3637204a6fb...|                  0.0|
|0082fd87c18f45f2b...|5a8bc65990b245e5a...|                  0.0|
|00840a2ca5d2408e9...|2906b810c7d441179...|                  0.0|
|00857b24b13f4fe0a...|4d5c57ea9a6940dd8...|                  0.0|
|008d7088107b46889...|f19421c1d4aa40978...|                  0.0|
|0091d2b6a5ea4defa...|4d5c57ea9a6940dd8...|                  0.0|
|0092a132e

Pegando a target: qual oferta foi bem sucedida

In [17]:
target1 = (
    transactions_full.filter('event = "offer completed"')
    .select("account_id", "offer_id", "time_since_test_start")
    .distinct()
    .withColumn("target", F.lit(1))
    .orderBy("time_since_test_start", "account_id")
)



In [30]:
dfpd = merge_asof(
    left=ps(df),
    right=ps(target1),
    on="time_since_test_start",
    by=["account_id", "offer_id"],
    direction="forward",
    allow_exact_matches=True
)

In [46]:
dfpd['target'] = dfpd['target'].fillna(0)
df = dfpd.to_spark()



In [59]:
# Calculate historical metrics per account up to each offer received
window = Window.partitionBy("account_id").orderBy("time_since_test_start")

total_past_offers = (
    transactions_full.filter('event = "offer received"')
    .withColumn(
        "num_past_offers",
        F.count("offer_id").over(window.rangeBetween(Window.unboundedPreceding, -1)),
    )
    .select("account_id", "offer_id", "time_since_test_start", "num_past_offers")
)

total_past_offers.show()

+--------------------+--------------------+---------------------+---------------+
|          account_id|            offer_id|time_since_test_start|num_past_offers|
+--------------------+--------------------+---------------------+---------------+
|0020ccbbb6d84e358...|2298d6c36e964ae4a...|                  7.0|              0|
|0020ccbbb6d84e358...|f19421c1d4aa40978...|                 14.0|              1|
|0020ccbbb6d84e358...|5a8bc65990b245e5a...|                 17.0|              2|
|0020ccbbb6d84e358...|9b98b8c7a33c4b65b...|                 21.0|              3|
|00426fe3ffde4c6b9...|5a8bc65990b245e5a...|                  0.0|              0|
|00426fe3ffde4c6b9...|fafdcd668e3743c1b...|                  7.0|              1|
|00426fe3ffde4c6b9...|0b1e1539f2cc45b7b...|                 14.0|              2|
|00426fe3ffde4c6b9...|2906b810c7d441179...|                 17.0|              3|
|00426fe3ffde4c6b9...|2906b810c7d441179...|                 24.0|              4|
|004b041fbfe4485

In [61]:

transactions.withColumn(
    "num_past_viewed",
    F.sum(F.when(F.col("event") == "offer viewed", 1).otherwise(0))
    .over(window.rangeBetween(Window.unboundedPreceding, -1))
).show()


[Stage 626:>                                                      (0 + 10) / 10]

+--------------------+------+---------------+--------------------+------+---------------------+---------------+
|          account_id|amount|          event|            offer_id|reward|time_since_test_start|num_past_viewed|
+--------------------+------+---------------+--------------------+------+---------------------+---------------+
|0020ccbbb6d84e358...| 16.27|    transaction|                NULL|  NULL|                 1.75|           NULL|
|0020ccbbb6d84e358...|  NULL| offer received|2298d6c36e964ae4a...|  NULL|                  7.0|              0|
|0020ccbbb6d84e358...|  NULL|   offer viewed|2298d6c36e964ae4a...|  NULL|                  7.0|              0|
|0020ccbbb6d84e358...| 11.65|    transaction|                NULL|  NULL|                 9.25|              1|
|0020ccbbb6d84e358...|  NULL|offer completed|2298d6c36e964ae4a...|   3.0|                 9.25|              1|
|0020ccbbb6d84e358...| 13.86|    transaction|                NULL|  NULL|                 10.0|         

                                                                                

In [None]:

# Join with transactions to get transaction metrics
transactions_agg = (
    transactions_full
    .filter(F.col("event") == "transaction")
    .groupBy("account_id")
    .agg(
        F.count("amount").alias("total_transactions"),
        F.sum("amount").alias("total_amount"),
        F.sum("reward").alias("total_reward")
    )
)

df = df.join(transactions_agg, "account_id", "left")

# Fill nulls with 0 for accounts with no transactions
df = df.na.fill({
    "total_transactions": 0,
    "total_amount": 0.0,
    "total_reward": 0.0
})







In [52]:
transactions = spark.read.json((repo_path / 'data' / 'processed' / 'transactions').as_posix())

In [53]:
transactions.filter('account_id = "00a794f62b9a48beb58f8f6c02c2f1a6"').show(100,False)

+--------------------------------+------+---------------+--------------------------------+------+---------------------+
|account_id                      |amount|event          |offer_id                        |reward|time_since_test_start|
+--------------------------------+------+---------------+--------------------------------+------+---------------------+
|00a794f62b9a48beb58f8f6c02c2f1a6|NULL  |offer received |3f207df678b143eea3cee63160fa8bed|NULL  |0.0                  |
|00a794f62b9a48beb58f8f6c02c2f1a6|6.03  |transaction    |NULL                            |NULL  |0.75                 |
|00a794f62b9a48beb58f8f6c02c2f1a6|NULL  |offer viewed   |3f207df678b143eea3cee63160fa8bed|NULL  |1.0                  |
|00a794f62b9a48beb58f8f6c02c2f1a6|15.47 |transaction    |NULL                            |NULL  |1.5                  |
|00a794f62b9a48beb58f8f6c02c2f1a6|15.88 |transaction    |NULL                            |NULL  |2.25                 |
|00a794f62b9a48beb58f8f6c02c2f1a6|17.15 

In [38]:
transactions_full.filter('account_id = "00a794f62b9a48beb58f8f6c02c2f1a6"').show(100,False)

+--------------------------------+---+------+----------------------------+-----------------+--------------+--------+---------------+------+---------+--------------------------------+-------------+-------------+------+---------------------+
|account_id                      |age|amount|channels                    |credit_card_limit|discount_value|duration|event          |gender|min_value|offer_id                        |offer_type   |registered_on|reward|time_since_test_start|
+--------------------------------+---+------+----------------------------+-----------------+--------------+--------+---------------+------+---------+--------------------------------+-------------+-------------+------+---------------------+
|00a794f62b9a48beb58f8f6c02c2f1a6|88 |NULL  |[web, email, mobile]        |54000.0          |0             |4.0     |offer received |F     |0        |3f207df678b143eea3cee63160fa8bed|informational|2015-10-24   |NULL  |0.0                  |
|00a794f62b9a48beb58f8f6c02c2f1a6|88 |6.

In [12]:
transactions_full.show(10)

+--------------------+----+------+--------------------+-----------------+--------------+--------+--------------+------+---------+--------------------+-------------+-------------+------+---------------------+
|          account_id| age|amount|            channels|credit_card_limit|discount_value|duration|         event|gender|min_value|            offer_id|   offer_type|registered_on|reward|time_since_test_start|
+--------------------+----+------+--------------------+-----------------+--------------+--------+--------------+------+---------+--------------------+-------------+-------------+------+---------------------+
|78afa995795e4d85b...|  75|  NULL|[web, email, mobile]|         100000.0|             5|     7.0|offer received|     F|        5|9b98b8c7a33c4b65b...|         bogo|   2017-05-09|  NULL|                  0.0|
|a03223e636434f42a...|NULL|  NULL|        [web, email]|             NULL|             5|    10.0|offer received|  Nulo|       20|0b1e1539f2cc45b7b...|     discount|   2

In [9]:
transactions_full.groupBy('account_id').count().orderBy('count', ascending=True).show(120,False)

+--------------------------------+-----+
|account_id                      |count|
+--------------------------------+-----+
|da7a7c0dcfcb41a8acc7864a53cf60fb|1    |
|3045af4e98794a04a5542d3eac939b1f|2    |
|df9fc9a86ca84ef5aedde8925d5838ba|2    |
|1bfe13d2453c4185a6486c6817e0d568|2    |
|3a4e53046c544134bb1e7782248631d1|2    |
|afd41b230f924f9ca8f5ed6249616114|2    |
|7ecfc592171f4844bdc05bdbb48d3847|2    |
|cae5e211053f4121a389a7da4d631f7f|2    |
|fccc9279ba56411f80ffe8ce7e0935cd|2    |
|e63e42480aae4ede9f07cac49c8c3f78|2    |
|912b9f623b9e4b4eb99b6dc919f09a93|2    |
|22617705eec442e0b7b43e5c5f56fb17|2    |
|76341c0cc6684b3eb23661e195dfc9a3|3    |
|83abd8407034461782483fb32d3d5f5c|3    |
|bc0c484263b94b0896f20c5e4fdf3585|3    |
|f67a6524092d48a788a415c453bd2e00|3    |
|3a4874d8f0ef42b9a1b72294902afea9|3    |
|d727102ac242449ab15f1bd1af28e6ff|3    |
|af63cf0ed6ad4c458a03cb321927b463|3    |
|19a0510b9ce24b9da44618f7161ae72d|3    |
|08e7c2a166ff44e4a009a18b5e8e4b81|3    |
|2e9660f6e83b49b

                                                                                

In [10]:
transactions_full.filter(F.col('account_id') == 'afd41b230f924f9ca8f5ed6249616114').show()

+--------------------+---+------+--------------------+-----------------+--------------+--------+--------------+------+---------+--------------------+----------+-------------+------+---------------------+
|          account_id|age|amount|            channels|credit_card_limit|discount_value|duration|         event|gender|min_value|            offer_id|offer_type|registered_on|reward|time_since_test_start|
+--------------------+---+------+--------------------+-----------------+--------------+--------+--------------+------+---------+--------------------+----------+-------------+------+---------------------+
|afd41b230f924f9ca...| 51|  NULL|        [web, email]|          78000.0|             5|    10.0|offer received|     M|       20|0b1e1539f2cc45b7b...|  discount|   2017-01-03|  NULL|                 17.0|
|afd41b230f924f9ca...| 51|  NULL|[email, mobile, s...|          78000.0|            10|     7.0|offer received|     M|       10|ae264e3637204a6fb...|      bogo|   2017-01-03|  NULL|   