# 星巴克毕业项目

### 简介

这个数据集是一些模拟 Starbucks rewards 移动 app 上用户行为的数据。每隔几天，星巴克会向 app 的用户发送一些推送。这个推送可能仅仅是一条饮品的广告或者是折扣券或 BOGO（买一送一）。一些顾客可能一连几周都收不到任何推送。 

顾客收到的推送可能是不同的，这就是这个数据集的挑战所在。

你的任务是将交易数据、人口统计数据和推送数据结合起来判断哪一类人群会受到某种推送的影响。这个数据集是从星巴克 app 的真实数据简化而来。因为下面的这个模拟器仅产生了一种饮品， 实际上星巴克的饮品有几十种。

每种推送都有有效期。例如，买一送一（BOGO）优惠券推送的有效期可能只有 5 天。你会发现数据集中即使是一些消息型的推送都有有效期，哪怕这些推送仅仅是饮品的广告，例如，如果一条消息型推送的有效期是 7 天，你可以认为是该顾客在这 7 天都可能受到这条推送的影响。

数据集中还包含 app 上支付的交易信息，交易信息包括购买时间和购买支付的金额。交易信息还包括该顾客收到的推送种类和数量以及看了该推送的时间。顾客做出了购买行为也会产生一条记录。 

同样需要记住有可能顾客购买了商品，但没有收到或者没有看推送。

### 示例

举个例子，一个顾客在周一收到了满 10 美元减 2 美元的优惠券推送。这个推送的有效期从收到日算起一共 10 天。如果该顾客在有效日期内的消费累计达到了 10 美元，该顾客就满足了该推送的要求。

然而，这个数据集里有一些地方需要注意。即，这个推送是自动生效的；也就是说，顾客收到推送后，哪怕没有看到，满足了条件，推送的优惠依然能够生效。比如，一个顾客收到了"满10美元减2美元优惠券"的推送，但是该用户在 10 天有效期内从来没有打开看到过它。该顾客在 10 天内累计消费了 15 美元。数据集也会记录他满足了推送的要求，然而，这个顾客并没被受到这个推送的影响，因为他并不知道它的存在。

### 清洗

清洗数据非常重要也非常需要技巧。

你也要考虑到某类人群即使没有收到推送，也会购买的情况。从商业角度出发，如果顾客无论是否收到推送都打算花 10 美元，你并不希望给他发送满 10 美元减 2 美元的优惠券推送。所以你可能需要分析某类人群在没有任何推送的情况下会购买什么。

### 最后一项建议

因为这是一个毕业项目，你可以使用任何你认为合适的方法来分析数据。例如，你可以搭建一个机器学习模型来根据人口统计数据和推送的种类来预测某人会花费多少钱。或者，你也可以搭建一个模型来预测该顾客是否会对推送做出反应。或者，你也可以完全不用搭建机器学习模型。你可以开发一套启发式算法来决定你会给每个顾客发出什么样的消息（比如75% 的35 岁女性用户会对推送 A 做出反应，对推送 B 则只有 40% 会做出反应，那么应该向她们发送推送 A）。


# 数据集

一共有三个数据文件：

* portfolio.json – 包括推送的 id 和每个推送的元数据（持续时间、种类等等）
* profile.json – 每个顾客的人口统计数据
* transcript.json – 交易、收到的推送、查看的推送和完成的推送的记录

以下是文件中每个变量的类型和解释 ：

**portfolio.json**
* id (string) – 推送的id
* offer_type (string) – 推送的种类，例如 BOGO、打折（discount）、信息（informational）
* difficulty (int) – 满足推送的要求所需的最少花费
* reward (int) – 满足推送的要求后给与的优惠
* duration (int) – 推送持续的时间，单位是天
* channels (字符串列表)

**profile.json**
* age (int) – 顾客的年龄 
* became_member_on (int) – 该顾客第一次注册app的时间
* gender (str) – 顾客的性别（注意除了表示男性的 M 和表示女性的 F 之外，还有表示其他的 O）
* id (str) – 顾客id
* income (float) – 顾客的收入

**transcript.json**
* event (str) – 记录的描述（比如交易记录、推送已收到、推送已阅）
* person (str) – 顾客id
* time (int) – 单位是小时，测试开始时计时。该数据从时间点 t=0 开始
* value - (dict of strings) – 推送的id 或者交易的数额

**注意：**如果你正在使用 Workspace，在读取文件前，你需要打开终端/命令行，运行命令 `conda update pandas` 。因为 Workspace 中的 pandas 版本不能正确读入 transcript.json 文件的内容，所以需要更新到 pandas 的最新版本。你可以单击 notebook 左上角橘黄色的 jupyter 图标来打开终端/命令行。  

下面两张图展示了如何打开终端/命令行以及如何安装更新。首先打开终端/命令行：
<img src="pic1.png"/>

然后运行上面的命令：
<img src="pic2.png"/>

最后回到这个 notebook（还是点击橘黄色的 jupyter 图标），再次运行下面的单元格就不会报错了。

In [1]:
import pandas as pd
import numpy as np
import math
import json
% matplotlib inline

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

In [2]:
portfolio.head()

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5


In [3]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [4]:
profile.shape

(17000, 5)

In [5]:
# 把value的值拆成多个列（根据字典里的类型）
transcript_value_s = transcript['value'].apply(pd.Series)
transcript_value_s.head(6)

Unnamed: 0,offer id,amount,offer_id,reward
0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,,
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,,,
2,2906b810c7d4411798c6938adc9daaa5,,,
3,fafdcd668e3743c1bb461111dcafc2a4,,,
4,4d5c57ea9a6940dd891ad53e9dbe8da0,,,
5,f19421c1d4aa40978ebb69ca19b0e20d,,,


In [3]:
# offer id和offer_id这两列重复了，需要把他们合并：如果offer_id为空，则把offer id的值填到offer_id里

def combine_offer_id(a, b):
    if pd.isna(b):
        return a
    else:
        return b
    
transcript_value_s['offer_id']=transcript_value_s.apply(lambda transcript_value_s: combine_offer_id(transcript_value_s['offer id'],transcript_value_s['offer_id']),axis=1)
transcript_value_s.head()


Unnamed: 0,offer id,amount,offer_id,reward
0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,2906b810c7d4411798c6938adc9daaa5,,2906b810c7d4411798c6938adc9daaa5,
3,fafdcd668e3743c1bb461111dcafc2a4,,fafdcd668e3743c1bb461111dcafc2a4,
4,4d5c57ea9a6940dd891ad53e9dbe8da0,,4d5c57ea9a6940dd891ad53e9dbe8da0,


In [4]:
# 去掉多余的offer id列
transcript_value = transcript_value_s.drop(['offer id'], axis=1)
transcript_value.head(3)

Unnamed: 0,amount,offer_id,reward
0,,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,,2906b810c7d4411798c6938adc9daaa5,


In [6]:
transcript.head()

Unnamed: 0,event,person,time,value
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}


In [5]:
# 把加工好的transcript_value合并到transcript表后面
transcript_new = pd.concat([transcript, transcript_value], axis=1, join_axes=[transcript.index])
transcript_new.head()

Unnamed: 0,event,person,time,value,amount,offer_id,reward
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,offer received,a03223e636434f42ac4c3df47e8bac43,0,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,offer received,e2127556f4f64592b11af22de27a7932,0,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},,2906b810c7d4411798c6938adc9daaa5,
3,offer received,8ec6ce2a7e7949b1bf142def7d0e0586,0,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},,fafdcd668e3743c1bb461111dcafc2a4,
4,offer received,68617ca6246f4fbc85e91a2a49552598,0,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},,4d5c57ea9a6940dd891ad53e9dbe8da0,


In [6]:
# 查看event类型有几种：
# offer received
# offer viewed
# transaction
# offer completed
transcript_new.drop_duplicates(['event'])

Unnamed: 0,event,person,time,value,amount,offer_id,reward
0,offer received,78afa995795e4d85b5d9ceeca43f5fef,0,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},,9b98b8c7a33c4b65b9aebfe6a799e6d9,
12650,offer viewed,389bc3fa690240e798340f5a15918d5c,0,{'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'},,f19421c1d4aa40978ebb69ca19b0e20d,
12654,transaction,02c083884c7d45b39cc68e1314fec56c,0,{'amount': 0.8300000000000001},0.83,,
12658,offer completed,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,{'offer_id': '2906b810c7d4411798c6938adc9daaa5...,,2906b810c7d4411798c6938adc9daaa5,2.0


In [13]:
transcript_new.shape

(306534, 7)

In [17]:
transcript.shape

(306534, 4)

In [18]:
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [7]:
# 重命名id列，使后面的merge更方便
portfolio.rename(columns={'id':'offer_id'},inplace=True) 
portfolio

Unnamed: 0,channels,difficulty,duration,offer_id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


In [8]:
# 重命名id列，使后面的merge更方便
profile.rename(columns={'id':'person_id'},inplace=True) 
profile.head()

Unnamed: 0,age,became_member_on,gender,person_id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


In [9]:
# 给顾客的income分类

def income_type(x):
    if pd.isna(x):
        return x
    elif x <= 50000:
        return "0~50000"
    elif x > 50000 and x <= 100000:
        return "50001~100000"
    elif x > 100000 and x <= 150000:
        return "100000~150000"
    elif x > 150000 and x <= 200000:
        return "150000~200000"
    else:
        return ">200000"
    
profile['income_type']=profile.apply(lambda profile: income_type(profile['income']),axis=1)
profile.head()

Unnamed: 0,age,became_member_on,gender,person_id,income,income_type
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0,100000~150000
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0,50001~100000
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,,


In [10]:
# 给顾客的年龄age分类

def age_type(x):
    if pd.isna(x) or x == 118:
        return "None"
    elif x <= 10:
        return "0~10y"
    elif x > 10 and x <= 20:
        return "10~20y"
    elif x > 20 and x <= 30:
        return "20~30y"
    elif x > 30 and x <= 40:
        return "30~40y"
    elif x > 40 and x <= 50:
        return "40~50y"
    elif x > 50 and x <= 60:
        return "50~60y"
    elif x > 60 and x <= 70:
        return "60~70y"
    elif x > 70 and x <= 80:
        return "70~80y"
    elif x > 80 and x <= 90:
        return "80~90y"
    elif x > 90 and x <= 100:
        return "90~100y"
    else:
        return ">100y"
    
profile['age_type']=profile.apply(lambda profile: age_type(profile['age']),axis=1)
profile.head()

Unnamed: 0,age,became_member_on,gender,person_id,income,income_type,age_type
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,,,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0,100000~150000,50~60y
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,,,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0,50001~100000,70~80y
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,,,


In [11]:
# 1.1 把收到offer到行为单独提取出来
offer_received = transcript_new[transcript_new['event'] == 'offer received'].loc[:,['person','time','offer_id']]
#transcript_new[transcript_new['event'] == 'offer received']
offer_received.rename(columns={'time':'receive_time','person':'person_id'},inplace=True) 
offer_received.head()

Unnamed: 0,person_id,receive_time,offer_id
0,78afa995795e4d85b5d9ceeca43f5fef,0,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,a03223e636434f42ac4c3df47e8bac43,0,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,e2127556f4f64592b11af22de27a7932,0,2906b810c7d4411798c6938adc9daaa5
3,8ec6ce2a7e7949b1bf142def7d0e0586,0,fafdcd668e3743c1bb461111dcafc2a4
4,68617ca6246f4fbc85e91a2a49552598,0,4d5c57ea9a6940dd891ad53e9dbe8da0


In [12]:
# 把'person_id','offer_id'相同的行汇总count，生成唯一的'person_id','offer_id'的组合
received_cnt = offer_received.groupby(['person_id','offer_id'],as_index=False)['receive_time'].count()
received_cnt.rename(columns={'receive_time':'receive_cnt'},inplace=True) 
received_cnt

Unnamed: 0,person_id,offer_id,receive_cnt
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,1
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,1
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,1
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1
5,00116118485d4dfda04fdbaba9a87b5c,f19421c1d4aa40978ebb69ca19b0e20d,2
6,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,1
7,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,1
8,0011e0d4e6b944f998e987f904e8c1e5,3f207df678b143eea3cee63160fa8bed,1
9,0011e0d4e6b944f998e987f904e8c1e5,5a8bc65990b245e5a138643cd4eb9837,1


In [13]:
# 1.2 把offer view的行为单独提取出来
offer_viewed = transcript_new[transcript_new['event'] == 'offer viewed'].loc[:,['person','time','offer_id']]
#transcript_new[transcript_new['event'] == 'offer received']
offer_viewed.rename(columns={'time':'view_time','person':'person_id'},inplace=True) 
offer_viewed.head()

Unnamed: 0,person_id,view_time,offer_id
12650,389bc3fa690240e798340f5a15918d5c,0,f19421c1d4aa40978ebb69ca19b0e20d
12651,d1ede868e29245ea91818a903fec04c6,0,5a8bc65990b245e5a138643cd4eb9837
12652,102e9454054946fda62242d2e176fdce,0,4d5c57ea9a6940dd891ad53e9dbe8da0
12653,02c083884c7d45b39cc68e1314fec56c,0,ae264e3637204a6fb9bb56bc8210ddfd
12655,be8a5d1981a2458d90b255ddc7e0d174,0,5a8bc65990b245e5a138643cd4eb9837


In [14]:
# 把'person_id','offer_id'相同的行汇总count，生成唯一的'person_id','offer_id'的组合
viewed_cnt = offer_viewed.groupby(['person_id','offer_id'],as_index=False)['view_time'].count()
viewed_cnt.rename(columns={'view_time':'viewed_cnt'},inplace=True) 
viewed_cnt

Unnamed: 0,person_id,offer_id,viewed_cnt
0,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,1
1,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,1
2,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1
3,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1
4,00116118485d4dfda04fdbaba9a87b5c,f19421c1d4aa40978ebb69ca19b0e20d,2
5,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,1
6,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,1
7,0011e0d4e6b944f998e987f904e8c1e5,3f207df678b143eea3cee63160fa8bed,1
8,0011e0d4e6b944f998e987f904e8c1e5,5a8bc65990b245e5a138643cd4eb9837,1
9,0011e0d4e6b944f998e987f904e8c1e5,9b98b8c7a33c4b65b9aebfe6a799e6d9,1


In [15]:
# 1.3 把offer use的行为单独提取出来
offer_completed = transcript_new[transcript_new['event'] == 'offer completed'].loc[:,['person','time','offer_id','reward']]
#transcript_new[transcript_new['event'] == 'offer received']
offer_completed.rename(columns={'time':'use_time','person':'person_id'},inplace=True) 
offer_completed.head()

Unnamed: 0,person_id,use_time,offer_id,reward
12658,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,2906b810c7d4411798c6938adc9daaa5,2.0
12672,fe97aa22dd3e48c8b143116a8403dd52,0,fafdcd668e3743c1bb461111dcafc2a4,2.0
12679,629fc02d56414d91bca360decdfa9288,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,5.0
12692,676506bad68e4161b9bbaffeb039626b,0,ae264e3637204a6fb9bb56bc8210ddfd,10.0
12697,8f7dd3b2afe14c078eb4f6e6fe4ba97d,0,4d5c57ea9a6940dd891ad53e9dbe8da0,10.0


In [16]:
# 把'person_id','offer_id'相同的行汇总count，生成唯一的'person_id','offer_id'的组合
completed_cnt = offer_completed.groupby(['person_id','offer_id'],as_index=False)['use_time'].count()
completed_cnt.rename(columns={'use_time':'use_cnt'},inplace=True) 
completed_cnt

Unnamed: 0,person_id,offer_id,use_cnt
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,1
1,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1
2,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1
3,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,1
4,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,1
5,0011e0d4e6b944f998e987f904e8c1e5,9b98b8c7a33c4b65b9aebfe6a799e6d9,1
6,0020c2b971eb4e9188eac86d93036a77,4d5c57ea9a6940dd891ad53e9dbe8da0,1
7,0020c2b971eb4e9188eac86d93036a77,fafdcd668e3743c1bb461111dcafc2a4,2
8,0020ccbbb6d84e358d3414a3ff76cffd,2298d6c36e964ae4a3e7e9706d1fb8c2,1
9,0020ccbbb6d84e358d3414a3ff76cffd,9b98b8c7a33c4b65b9aebfe6a799e6d9,1


In [17]:
# 1.4 把收到offer，view offer，use offer的行为都合并起来，并且merge顾客和offer的属性

offer_person_detail = pd.merge(received_cnt, viewed_cnt, how='left', on=['person_id','offer_id'])
offer_person_detail = pd.merge(offer_person_detail, completed_cnt, how='left', on=['person_id','offer_id'])
offer_person_detail = pd.merge(offer_person_detail, profile, how='left', on=['person_id'])
offer_person_detail = pd.merge(offer_person_detail, portfolio, how='left', on=['offer_id'])
offer_person_detail

Unnamed: 0,person_id,offer_id,receive_cnt,viewed_cnt,use_cnt,age,became_member_on,gender,income,income_type,age_type,channels,difficulty,duration,offer_type,reward
0,0009655768c64bdeb2e877511632db8f,2906b810c7d4411798c6938adc9daaa5,1,,1.0,33,20170421,M,72000.0,50001~100000,30~40y,"[web, email, mobile]",10,7,discount,2
1,0009655768c64bdeb2e877511632db8f,3f207df678b143eea3cee63160fa8bed,1,1.0,,33,20170421,M,72000.0,50001~100000,30~40y,"[web, email, mobile]",0,4,informational,0
2,0009655768c64bdeb2e877511632db8f,5a8bc65990b245e5a138643cd4eb9837,1,1.0,,33,20170421,M,72000.0,50001~100000,30~40y,"[email, mobile, social]",0,3,informational,0
3,0009655768c64bdeb2e877511632db8f,f19421c1d4aa40978ebb69ca19b0e20d,1,1.0,1.0,33,20170421,M,72000.0,50001~100000,30~40y,"[web, email, mobile, social]",5,5,bogo,5
4,0009655768c64bdeb2e877511632db8f,fafdcd668e3743c1bb461111dcafc2a4,1,1.0,1.0,33,20170421,M,72000.0,50001~100000,30~40y,"[web, email, mobile, social]",10,10,discount,2
5,00116118485d4dfda04fdbaba9a87b5c,f19421c1d4aa40978ebb69ca19b0e20d,2,2.0,,118,20180425,,,,,"[web, email, mobile, social]",5,5,bogo,5
6,0011e0d4e6b944f998e987f904e8c1e5,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,1.0,1.0,40,20180109,O,57000.0,50001~100000,30~40y,"[web, email]",20,10,discount,5
7,0011e0d4e6b944f998e987f904e8c1e5,2298d6c36e964ae4a3e7e9706d1fb8c2,1,1.0,1.0,40,20180109,O,57000.0,50001~100000,30~40y,"[web, email, mobile, social]",7,7,discount,3
8,0011e0d4e6b944f998e987f904e8c1e5,3f207df678b143eea3cee63160fa8bed,1,1.0,,40,20180109,O,57000.0,50001~100000,30~40y,"[web, email, mobile]",0,4,informational,0
9,0011e0d4e6b944f998e987f904e8c1e5,5a8bc65990b245e5a138643cd4eb9837,1,1.0,,40,20180109,O,57000.0,50001~100000,30~40y,"[email, mobile, social]",0,3,informational,0


In [18]:
offer_person_detail.to_excel('offer_person_detail.xlsx')

In [19]:
# 2.1 看看各个人群的交易行为如何？ 把transaction的行为单独提取出来
transaction_log = transcript_new[transcript_new['event'] == 'transaction'].loc[:,['person','time','amount']]
transaction_log.rename(columns={'time':'transaction_time','person':'person_id'},inplace=True) 
transaction_log.head()

Unnamed: 0,person_id,transaction_time,amount
12654,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,54890f68699049c2a04d415abc25e717,0,13.23
12670,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


In [20]:
# 2.2 把交易行为merge顾客的属性（交易行为没有和offer关联）
trans_person_detail = pd.merge(transaction_log, profile, how='left', on=['person_id'])
trans_person_detail

Unnamed: 0,person_id,transaction_time,amount,age,became_member_on,gender,income,income_type,age_type
0,02c083884c7d45b39cc68e1314fec56c,0,0.83,20,20160711,F,30000.0,0~50000,10~20y
1,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56,42,20160117,M,96000.0,50001~100000,40~50y
2,54890f68699049c2a04d415abc25e717,0,13.23,36,20171228,M,56000.0,50001~100000,30~40y
3,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51,55,20171016,F,94000.0,50001~100000,50~60y
4,fe97aa22dd3e48c8b143116a8403dd52,0,18.97,39,20171217,F,67000.0,50001~100000,30~40y
5,629fc02d56414d91bca360decdfa9288,0,33.90,52,20180605,M,72000.0,50001~100000,50~60y
6,bbeb54e861614fc7b22a8844f72dca6c,0,0.22,118,20171207,,,,
7,a97e6f33219c432db82acfa0d19c602d,0,18.59,75,20130806,F,69000.0,50001~100000,70~80y
8,676506bad68e4161b9bbaffeb039626b,0,18.01,37,20170515,M,92000.0,50001~100000,30~40y
9,8f7dd3b2afe14c078eb4f6e6fe4ba97d,0,19.11,48,20150903,M,62000.0,50001~100000,40~50y


In [21]:
trans_person_detail.shape

(138953, 9)

In [22]:
trans_person_detail.to_excel('trans_person_detail.xlsx')

In [16]:
# 
received_summary = offer_received_person.groupby(['gender','income_type','age_type','difficulty','duration','offer_type','reward'])['person_id'].count().to_frame()
received_summary.rename(columns={'person_id':'receive_person_cnt'},inplace=True) 
received_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,receive_person_cnt
gender,income_type,age_type,difficulty,duration,offer_type,reward,Unnamed: 7_level_1
F,0~50000,10~20y,0,3,informational,0,18
F,0~50000,10~20y,0,4,informational,0,20
F,0~50000,10~20y,5,5,bogo,5,20
F,0~50000,10~20y,5,7,bogo,5,13
F,0~50000,10~20y,7,7,discount,3,14
F,0~50000,10~20y,10,5,bogo,10,21
F,0~50000,10~20y,10,7,bogo,10,19
F,0~50000,10~20y,10,7,discount,2,12
F,0~50000,10~20y,10,10,discount,2,7
F,0~50000,10~20y,20,10,discount,5,12


In [18]:
offer_received_person.to_excel('offer_received_person.xlsx')

In [None]:
# 计算每个不同顾客类型以及不同offer的行为（次数）
#offer_received_person = pd.merge(offer_received, profile, how='left', on=['person_id'])
#offer_received_person = pd.merge(offer_received_person, portfolio, how='left', on=['offer_id'])


In [28]:
# 把所有offer和 receive offer行为连接起来
#offer_detail = pd.merge(portfolio, offer_received, how='left', on=['offer_id'])
#offer_detail

In [29]:
# 把所有offer和 view offer的行为关联起来
# offer_detail = pd.merge(offer_detail, offer_viewed, how='outer', on=['offer_id','person'])
# offer_detail

In [30]:
# 把所有offer和 use offer的行为关联起来
#offer_detail = pd.merge(offer_detail, offer_completed, how='outer', on=['offer_id','person'])
#offer_detail

```python

```