# 星巴克毕业项目

### 简介

这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天，星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO（买一送一）。一些顾客可能一连几周都收不到任何推送。 

顾客收到的推送可能是不同的，这就是这个数据集的挑战所在。

你的任务是将交易数据、人口统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品， 实际上星巴克的饮品有几十种。

每种推送都有有效期。例如，买一送一（BOGO）优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期，哪怕这些推送仅仅是饮品的广告，例如，如果一条消息型推送的有效期是 7 天，你可以认为是该顾客在这 7 天都可能受到这条推送的影响。

数据集中还包含 app 上支付的交易信息，交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。 

同样需要记住有可能顾客购买了商品，但没有收到或者没有看推送。

### 示例

举个例子，一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元，该顾客就满足了该推送的要求。

然而，这个数据集里有一些地方需要注意。即，这个推送是自动生效的；也就是说，顾客收到推送后，哪怕没有看到，满足了条件，推送的优惠依然能够生效。比如，一个顾客收到了"满10美元减2美元优惠券"的推送，但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求，然而，这个顾客并没被受到这个推送的影响，因为他并不知道它的存在。

### 清洗

清洗数据非常重要也非常需要技巧。

你也要考虑到某类人群即使没有收到推送，也会购买的情况。从商业角度出发，如果顾客无论是否收到推送都打算花 10 美元，你并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以你可能需要分析某类人群在没有任何推送的情况下会购买什么。

### 最后一项建议

因为这是一个毕业项目，你可以使用任何你认为合适的方法来分析数据。例如，你可以搭建一个机器学习模型来根据人口统计数据和推送的种类来预测某人会花费多少钱。或者，你也可以搭建一个模型来预测该顾客是否会对推送做出反应。或者，你也可以完全不用搭建机器学习模型。你可以开发一套启发式算法来决定你会给每个顾客发出什么样的消息（比如75% 的35 岁女性用户会对推送 A 做出反应，对推送 B 则只有 40% 会做出反应，那么应该向她们发送推送 A）。


# 数据集

一共有三个数据文件：

* portfolio.json – 包括推送的 id 和每个推送的元数据（持续时间、种类等等）
* profile.json – 每个顾客的人口统计数据
* transcript.json – 交易、收到的推送、查看的推送和完成的推送的记录

以下是文件中每个变量的类型和解释 ：

**portfolio.json**
* id (string) – 推送的id
* offer_type (string) – 推送的种类，例如 BOGO、打折（discount）、信息（informational）
* difficulty (int) – 满足推送的要求所需的最少花费
* reward (int) – 满足推送的要求后给与的优惠
* duration (int) – 推送持续的时间，单位是天
* channels (字符串列表)

**profile.json**
* age (int) – 顾客的年龄 
* became_member_on (int) – 该顾客第一次注册app的时间
* gender (str) – 顾客的性别（注意除了表示男性的 M 和表示女性的 F 之外，还有表示其他的 O）
* id (str) – 顾客id
* income (float) – 顾客的收入

**transcript.json**
* event (str) – 记录的描述（比如交易记录、推送已收到、推送已阅）
* person (str) – 顾客id
* time (int) – 单位是小时，测试开始时计时。该数据从时间点 t=0 开始
* value - (dict of strings) – 推送的id 或者交易的数额

**注意：**如果你正在使用 Workspace，在读取文件前，你需要打开终端/命令行，运行命令 `conda update pandas` 。因为 Workspace 中的 pandas 版本不能正确读入 transcript.json 文件的内容，所以需要更新到 pandas 的最新版本。你可以单击 notebook 左上角橘黄色的 jupyter 图标来打开终端/命令行。  

下面两张图展示了如何打开终端/命令行以及如何安装更新。首先打开终端/命令行：
<img src="pic1.png"/>

然后运行上面的命令：
<img src="pic2.png"/>

最后回到这个 notebook（还是点击橘黄色的 jupyter 图标），再次运行下面的单元格就不会报错了。

In [1]:
import pandas as pd
import numpy as np
import math
import json
% matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

In [2]:
portfolio.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5


In [3]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [4]:
profile.shape

(17000, 5)

### 处理transcript里的value字段，拆分成多个列

In [2]:
# 把value的值拆成多个列（根据字典里的类型）
transcript_value_s = transcript['value'].apply(pd.Series)
transcript_value_s.head(6)

Unnamed: 0,offer id,amount,offer_id,reward
0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,,,
2,2906b810c7d4411798c6938adc9daaa5,,,
3,fafdcd668e3743c1bb461111dcafc2a4,,,
4,4d5c57ea9a6940dd891ad53e9dbe8da0,,,
5,f19421c1d4aa40978ebb69ca19b0e20d,,,


In [3]:
# offer id和offer_id这两列重复了，需要把他们合并：如果offer_id为空，则把offer id的值填到offer_id里

def combine_offer_id(a, b):
    if pd.isna(b):
        return a
    else:
        return b
    
transcript_value_s['offer_id']=transcript_value_s.apply(lambda transcript_value_s: combine_offer_id(transcript_value_s['offer id'],transcript_value_s['offer_id']),axis=1)
transcript_value_s.head()


Unnamed: 0,offer id,amount,offer_id,reward
0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,2906b810c7d4411798c6938adc9daaa5,,2906b810c7d4411798c6938adc9daaa5,
3,fafdcd668e3743c1bb461111dcafc2a4,,fafdcd668e3743c1bb461111dcafc2a4,
4,4d5c57ea9a6940dd891ad53e9dbe8da0,,4d5c57ea9a6940dd891ad53e9dbe8da0,


In [4]:
# 去掉多余的offer id列
transcript_value = transcript_value_s.drop(['offer id'], axis=1)
transcript_value.head(3)

Unnamed: 0,amount,offer_id,reward
0,,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,,2906b810c7d4411798c6938adc9daaa5,


In [6]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}


In [5]:
# 把加工好的transcript_value合并到transcript表后面
transcript_new = pd.concat([transcript, transcript_value], axis=1, join_axes=[transcript.index])
transcript_new.head()

Unnamed: 0,event,person,time,value,amount,offer_id,reward
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},,2906b810c7d4411798c6938adc9daaa5,
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},,fafdcd668e3743c1bb461111dcafc2a4,
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},,4d5c57ea9a6940dd891ad53e9dbe8da0,


In [6]:
# 查看event类型有几种：
# offer received
# offer viewed
# transaction
# offer completed
transcript_new.drop_duplicates(['event'])

Unnamed: 0,event,person,time,value,amount,offer_id,reward
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},,9b98b8c7a33c4b65b9aebfe6a799e6d9,
12650,offer viewed,389bc3fa690240e798340f5a15918d5c,0,{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'},,f19421c1d4aa40978ebb69ca19b0e20d,
12654,transaction,02c083884c7d45b39cc68e1314fec56c,0,{'amount': 0.8300000000000001},0.83,,
12658,offer completed,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,{'offer_id': '2906b810c7d4411798c6938adc9daaa5...,,2906b810c7d4411798c6938adc9daaa5,2.0


In [12]:
transcript_new.shape

(306534, 7)

In [13]:
transcript.shape

(306534, 4)

In [18]:
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [7]:
# 重命名id列，使后面的merge更方便
portfolio.rename(columns={'id':'offer_id'},inplace=True) 
portfolio

Unnamed: 0,channels,difficulty,duration,offer_id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [8]:
# 重命名id列，使后面的merge更方便
profile.rename(columns={'id':'person_id'},inplace=True) 
profile.head()

Unnamed: 0,age,became_member_on,gender,person_id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


### 1. 计算每个顾客对于每种offer的得分（反应）


In [29]:
# 1.1.1 把收到offer到行为单独提取出来
offer_received = transcript_new[transcript_new['event'] == 'offer received'].loc[:,['person','offer_id','time']]
offer_received.rename(columns={'time':'receive_time','person':'person_id'},inplace=True) 
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time
0,78afa995795e4d85b5d9ceeca43f5fef,9b98b8c7a33c4b65b9aebfe6a799e6d9,0
1,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0
2,e2127556f4f64592b11af22de27a7932,2906b810c7d4411798c6938adc9daaa5,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,fafdcd668e3743c1bb461111dcafc2a4,0
4,68617ca6246f4fbc85e91a2a49552598,4d5c57ea9a6940dd891ad53e9dbe8da0,0


In [30]:
offer_received.shape

(76277, 3)

In [31]:
# 1.1.2 关联portfolio表，取得券的有效期
offer_received = pd.merge(offer_received, portfolio, how='left', on=['offer_id'])
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,channels,difficulty,duration,offer_type,reward
0,78afa995795e4d85b5d9ceeca43f5fef,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,"[web, email, mobile]",5,7,bogo,5
1,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,"[web, email]",20,10,discount,5
2,e2127556f4f64592b11af22de27a7932,2906b810c7d4411798c6938adc9daaa5,0,"[web, email, mobile]",10,7,discount,2
3,8ec6ce2a7e7949b1bf142def7d0e0586,fafdcd668e3743c1bb461111dcafc2a4,0,"[web, email, mobile, social]",10,10,discount,2
4,68617ca6246f4fbc85e91a2a49552598,4d5c57ea9a6940dd891ad53e9dbe8da0,0,"[web, email, mobile, social]",10,5,bogo,10


In [32]:
# 1.2.1 把offer view的行为单独提取出来
offer_viewed = transcript_new[transcript_new['event'] == 'offer viewed'].loc[:,['person','offer_id','time']]
offer_viewed.rename(columns={'time':'view_time','person':'person_id'},inplace=True) 
offer_viewed.head()

Unnamed: 0,person_id,offer_id,view_time
12650,389bc3fa690240e798340f5a15918d5c,f19421c1d4aa40978ebb69ca19b0e20d,0
12651,d1ede868e29245ea91818a903fec04c6,5a8bc65990b245e5a138643cd4eb9837,0
12652,102e9454054946fda62242d2e176fdce,4d5c57ea9a6940dd891ad53e9dbe8da0,0
12653,02c083884c7d45b39cc68e1314fec56c,ae264e3637204a6fb9bb56bc8210ddfd,0
12655,be8a5d1981a2458d90b255ddc7e0d174,5a8bc65990b245e5a138643cd4eb9837,0


In [33]:
# 1.2.2 收到offer到行为左连接view的行为，来计算多少收到offer后被查看了。（如果只有view行为，没有收到行为，理论上不应该，这里当作脏数据删除了）
offer_received = pd.merge(offer_received, offer_viewed, how='left', on=['person_id','offer_id'])
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,channels,difficulty,duration,offer_type,reward,view_time
0,78afa995795e4d85b5d9ceeca43f5fef,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,"[web, email, mobile]",5,7,bogo,5,6.0
1,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,"[web, email]",20,10,discount,5,6.0
2,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,"[web, email]",20,10,discount,5,624.0
3,e2127556f4f64592b11af22de27a7932,2906b810c7d4411798c6938adc9daaa5,0,"[web, email, mobile]",10,7,discount,2,18.0
4,8ec6ce2a7e7949b1bf142def7d0e0586,fafdcd668e3743c1bb461111dcafc2a4,0,"[web, email, mobile, social]",10,10,discount,2,12.0


In [34]:
# 1.5 创建判断行为是否发生在有效期的函数

def action_in_valiperd(receive_time, duration, action_time):
    if action_time >= receive_time and action_time <= receive_time+duration*24:
        return 1
    else:
        return 0


In [35]:
# 1.2.3 添加一列，判断在有效期内是否有view发生（注意：一条offer的推送可能关联多个view的行为，这里判断每条view行为是否在有效期内）
offer_received['is_viewed']=offer_received.apply(lambda offer_received: 
                                                 action_in_valiperd(offer_received['receive_time'], 
                                                                    offer_received['duration'], 
                                                                    offer_received['view_time']),axis=1)
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,channels,difficulty,duration,offer_type,reward,view_time,is_viewed
0,78afa995795e4d85b5d9ceeca43f5fef,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,"[web, email, mobile]",5,7,bogo,5,6.0,1
1,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,"[web, email]",20,10,discount,5,6.0,1
2,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,"[web, email]",20,10,discount,5,624.0,0
3,e2127556f4f64592b11af22de27a7932,2906b810c7d4411798c6938adc9daaa5,0,"[web, email, mobile]",10,7,discount,2,18.0,1
4,8ec6ce2a7e7949b1bf142def7d0e0586,fafdcd668e3743c1bb461111dcafc2a4,0,"[web, email, mobile, social]",10,10,discount,2,12.0,1


In [36]:
# 1.2.4 得到每次用户收到offer后，是否在有效期内阅读了。（注意：一条offer用户可能阅读多次，这里统一只计为1-是否阅读；如果后续需要区分阅读次数，可以再改）
offer_received = offer_received.groupby(['person_id','offer_id','receive_time','duration'],as_index=False)['is_viewed'].max()
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1


In [37]:
offer_received.query('person_id == "a03223e636434f42ac4c3df47e8bac43" and offer_id == "0b1e1539f2cc45b7b9fa7c272da2e1d7"')


Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed
48141,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,0,10,1
48142,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,504,10,1
48143,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,576,10,1


In [38]:
offer_received.shape

(76277, 5)

In [39]:
# 1.3.1 把offer use的行为单独提取出来（跟view的行为处理方法一致）
offer_completed = transcript_new[transcript_new['event'] == 'offer completed'].loc[:,['person','offer_id','time','reward']]
offer_completed.rename(columns={'time':'use_time','person':'person_id'},inplace=True) 
offer_completed.head()

Unnamed: 0,person_id,offer_id,use_time,reward
12658,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,2906b810c7d4411798c6938adc9daaa5,0,2.0
12672,fe97aa22dd3e48c8b143116a8403dd52,fafdcd668e3743c1bb461111dcafc2a4,0,2.0
12679,629fc02d56414d91bca360decdfa9288,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,5.0
12692,676506bad68e4161b9bbaffeb039626b,ae264e3637204a6fb9bb56bc8210ddfd,0,10.0
12697,8f7dd3b2afe14c078eb4f6e6fe4ba97d,4d5c57ea9a6940dd891ad53e9dbe8da0,0,10.0


In [40]:
# 1.3.2 收到offer到行为左连接view的行为，来计算多少收到offer后被查看了。（如果只有view行为，没有收到行为，理论上不应该，这里当作脏数据删除了）
offer_received = pd.merge(offer_received, offer_completed, how='left', on=['person_id','offer_id'])
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,use_time,reward
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,576.0,2.0
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,,
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,,
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,414.0,5.0
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,528.0,2.0


In [41]:
# 1.3.3 添加一列，判断在有效期内是否有use发生（注意：一条offer的推送也可能关联多个use的行为，不一定在同一有效期）
offer_received['is_used']=offer_received.apply(lambda offer_received: 
                                                 action_in_valiperd(offer_received['receive_time'], 
                                                                    offer_received['duration'], 
                                                                    offer_received['use_time']),axis=1)
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,use_time,reward,is_used
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,576.0,2.0,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,,,0
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,,,0
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,414.0,5.0,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,528.0,2.0,1


In [42]:
# 1.3.4 得到每次用户收到offer后，是否在有效期内使用了。（注意：原则上一次推送的offer，只能在有效期内被使用一次，
#      但是也有可能两次推送同样的offer，且有效期重叠，这样就会得到每条推送都关联了2条使用记录，且reward金额可能不同。
#      但是，根据现有数据，没有办法区分哪条use是对应哪条推送的，所以统一都取一条最大的）
offer_received = offer_received.groupby(['person_id','offer_id','receive_time','duration','is_viewed'],as_index=False)['is_used'].max()
offer_received.head()

#offer_received = offer_received.groupby(['person_id','offer_id','receive_time','duration','is_viewed'],as_index=False)['is_used','reward'].max()
# 注意：使用max有一个问题，就是选取的是每一列的最大值，而不是选取最大值的那一行，也就是说只能在选取单列的最大值的时候才是准确的.
#      所以如果有2条以上使用记录，那么这个reward的金额可能不是真正用那一张的金额

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,0
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,0
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,1


In [43]:
offer_received.shape

(76277, 6)

In [24]:
# 1.4.1 把transaction的“交易”行为单独提取出来（这是每个顾客所有的交易记录）
transaction_log = transcript_new[transcript_new['event'] == 'transaction'].loc[:,['person','time','amount']]
transaction_log.rename(columns={'time':'transaction_time','person':'person_id'},inplace=True) 
transaction_log.head()

Unnamed: 0,person_id,transaction_time,amount
12654,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,54890f68699049c2a04d415abc25e717,0,13.23
12670,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


In [50]:
# 1.4.2 收到offer到行为左连接交易行为，来计算offer的有效期内，是否有交易产生。（这里只看和收到offer的人有关的交易数据）
offer_received = pd.merge(offer_received, transaction_log, how='left', on=['person_id'])
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,transaction_time,amount
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,228.0,22.16
1,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,414.0,8.57
2,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,528.0,14.11
3,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,552.0,13.56
4,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,576.0,10.27


In [51]:
# 1.4.3 添加一列，判断在有效期内是否有use发生（注意：一条offer的推送也可能关联多个use的行为，不一定在同一有效期）
offer_received['has_trans']=offer_received.apply(lambda offer_received: 
                                                 action_in_valiperd(offer_received['receive_time'], 
                                                                    offer_received['duration'], 
                                                                    offer_received['transaction_time']),axis=1)
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,transaction_time,amount,has_trans
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,228.0,22.16,0
1,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,414.0,8.57,0
2,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,528.0,14.11,0
3,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,552.0,13.56,0
4,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,576.0,10.27,1


In [52]:
# 1.4.4 得到每次用户收到offer后，是否在有效期内有交易。（注意：原则上如果有交易，不一定用券；但是用券了，一定有交易）
offer_received = offer_received.groupby(['person_id','offer_id','receive_time','duration','is_viewed','is_used'],as_index=False)['has_trans'].max()
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,has_trans
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,0,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,0,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,1,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,1,1


In [53]:
offer_received.shape

(76277, 7)

In [54]:
# Data-Check：查询是否存在用券了，但是没有交易的情况（理论上不应该）
offer_received.query('is_used == 1 and has_trans == 0')
# 还好没有这种情况，说明数据还是比较完整的。

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,has_trans


In [55]:
# 1.5.1 再次关联portfolio表，取得券的类型（后续判断会用到）
offer_received = pd.merge(offer_received, portfolio, how='left', on=['offer_id'])
offer_received = offer_received.drop(['duration_y'], axis=1)
offer_received.rename(columns={'duration_x':'duration'},inplace=True) 
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration_x,is_viewed,is_used,has_trans,channels,difficulty,duration_y,offer_type,reward
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,1,"[web, email, mobile]",10,7,discount,2
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,0,1,"[web, email, mobile]",0,4,informational,0
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,0,1,"[email, mobile, social]",0,3,informational,0
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,1,1,"[web, email, mobile, social]",5,5,bogo,5
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,1,1,"[web, email, mobile, social]",10,10,discount,2


In [59]:
### 创建判断顾客的行为是否收到offer的影响

def offer_has_effect(is_viewed, is_used, has_trans, offer_type):
    if is_viewed == 0: 
        return 0   # 只要用户没看过offer，都算没影响
    else:
        if has_trans == 0:
            return 0   # 如果被看过，但是没有交易，也算没影响 （没有交易，但是用券的情况应该不存在）
        else: 
            if is_used == 1:
                return 1 # 被看过，有交易，且用券，说明有影响
            else: 
                if offer_type == 'informational':
                    return 1 # 被看过，有交易，没用券，但是是消息类的推送（不可能用券），说明有影响
                else:
                    return 0 # 非消息类，即使有交易，没用券也认为没影响

In [61]:
# 1.5.2 添加一列，判断offer对于顾客的影响
offer_received['is_effect']=offer_received.apply(lambda offer_received: 
                                                 offer_has_effect(offer_received['is_viewed'], 
                                                                    offer_received['is_used'], 
                                                                    offer_received['has_trans'],
                                                                    offer_received['offer_type']),axis=1)
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,has_trans,channels,difficulty,offer_type,reward,is_effect
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,1,"[web, email, mobile]",10,discount,2,0
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,0,1,"[web, email, mobile]",0,informational,0,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,0,1,"[email, mobile, social]",0,informational,0,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,1,1,"[web, email, mobile, social]",5,bogo,5,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,1,1,"[web, email, mobile, social]",10,discount,2,1


In [62]:
# 1.5.3 增加一列is_receive，全部赋值为1，因为所有记录都是收到offer的。这个方便后续的groupby计算
offer_received['is_receive'] = 1
offer_received.head()

Unnamed: 0,person_id,offer_id,receive_time,duration,is_viewed,is_used,has_trans,channels,difficulty,offer_type,reward,is_effect,is_receive
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,576,7,0,1,1,"[web, email, mobile]",10,discount,2,0,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,336,4,1,0,1,"[web, email, mobile]",0,informational,0,1,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,168,3,1,0,1,"[email, mobile, social]",0,informational,0,1,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,408,5,1,1,1,"[web, email, mobile, social]",5,bogo,5,1,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,504,10,1,1,1,"[web, email, mobile, social]",10,discount,2,1,1


In [None]:
#######------------到这里，第一张宽表就做好了。---------------------

### 分析：计算出每个顾客对于每种offer的反应
如果一个顾客，对于offer1，收到3次，有2次有反应，那么该顾客对于次offer的反应score = 2/3

In [63]:
# 1.6.1 计算出每个顾客对于每种offer的反应
user_offer_reaction = offer_received_1.groupby(['person_id','offer_id'],as_index=False)['is_effect','is_receive'].sum()
user_offer_reaction.head()

Unnamed: 0,person_id,offer_id,is_effect,is_receive
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,0,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,1,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,1,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1,1


In [64]:
user_offer_reaction['effect_score'] = user_offer_reaction['is_effect']/user_offer_reaction['is_receive']
user_offer_reaction.head()

Unnamed: 0,person_id,offer_id,is_effect,is_receive,effect_score
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,0,1,0.0
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,1,1,1.0
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,1,1,1.0
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1,1,1.0
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1,1,1.0
5,00116118485d4dfda04fdbaba9a87b5c,f19421c1d4aa40978ebb69ca19b0e20d,0,2,0.0
6,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,1,1.0
7,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,1,1,1.0
8,0011e0d4e6b944f998e987f904e8c1e5,3f207df678b143eea3cee63160fa8bed,0,1,0.0
9,0011e0d4e6b944f998e987f904e8c1e5,5a8bc65990b245e5a138643cd4eb9837,0,1,0.0


```python

```