## 任务：京东用户购买意向预测

### 故事背景：
京东作为中国最大的自营式电商，在保持高速发展的同时，沉淀了数亿的忠实用户，积累了海量的真实数据。如何从历史数据中找出规律，去预测用户未来的购买需求，让最合适的商品遇见最需要的人，是大数据应用在精准营销中的关键问题，也是所有电商平台在做智能化升级时所需要的核心技术。

以京东商城真实的用户、商品和行为数据（脱敏后）为基础，通过数据挖掘的技术和机器学习的算法，构建用户购买商品的预测模型，输出高潜用户和目标商品的匹配结果，为精准营销提供高质量的目标群体。

目标：使用京东多个品类下商品的历史销售数据，构建算法模型，预测用户在未来5天内，对某个目标品类下商品的购买意向。

### 数据集：
这里涉及到的数据集是京东的数据集:

* JData_User.csv 用户数据集 105,321个用户
* JData_Comment.csv 商品评论 558,552条记录
* JData_Product.csv 预测商品集合 24,187条记录
* JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录
* JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录
* JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录

**JData_User.csv用户数据**

|字段|意义|备注|
|-|-|-|
|user_id|用户id|脱敏|
|age|年龄|-1表未知|
|sex|性别|0男，1女，2未知|
|user_lv_cd|用户等级|级别枚举，越高级别越大|
|user_reg_tm|用户注册日期|粒度到天|

**JData_Comment.csv评论数据**

|字段|意义|备注|
|-|-|-|
|dt|截止时间|天，到2016-02-01|
|sku_id|商品编号|脱敏|
|comment_num|累积评论数分段|0表示无评论，1表是1条，2表示2-10条，3表示11-50条，5表示大于50条|
|has_bad_comment|是否有差评|0表示无，1表示有|
|bad_comment_rate|差评率|差评数占总评论数的比率|

**JData_Product.csv商品数据**

|字段|意义|备注|
|-|-|-|
|sku_id|商品编号|脱敏|
|a1|属性1|枚举，-1表未知|
|a2|属性2|枚举，-1表未知|
|a3|属性3|枚举，-1表未知|
|cate|品牌ID|脱敏|
|brand|品牌ID|脱敏|

**JData_Action_xx.csv商品数据**

|字段|意义|备注|
|-|-|-|
|user_id|用户ID|脱敏|
|sku_id|商品编号|脱敏|
|time|行为时间||
|model_id|点击板块的编号|脱敏|
|type|行为类型|1.浏览商品详情页;2.加入购物车;3.购物车删除;4.下单;5.关注;6.点击;|
|cate|品牌ID|脱敏|
|brand|品牌ID|脱敏|

### 数据挖掘流程：
（一）.数据清洗
1. 数据集完整性验证
2. 数据集中是否存在缺失值
3. 数据集中各特征数值应该如何处理
4. 哪些数据是我们想要的，哪些是可以过滤掉的
5. 将有价值数据信息做成新的数据源
6. 去除无行为交互的商品和用户
7. 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)

（二）.数据理解与分析
1. 掌握各个特征的含义
2. 观察数据有哪些特点，是否可利用来建模
3. 可视化展示便于分析
4. 用户的购买意向是否随着时间等因素变化
（三）.特征提取
1. 基于清洗后的数据集哪些特征是有价值
2. 分别对用户与商品以及其之间构成的行为进行特征提取
3. 行为因素中哪些是核心？如何提取？
4. 瞬时行为特征or累计行为特征？

（四）.模型建立
1. 使用机器学习算法进行预测
2. 参数设置与调节
3. 数据集切分

### 数据集完整性验证
首先检查JData_User中的用户和JData_Dction中的用户是否一致，保证行为数据中锁产生的行为均由用户数据中的用户产生。

思路：利用pd.Merge连接sku和Action中的sku，观测Action中的数据是否减少Example：

In [1]:
# 测试方法
import pandas as pd
df1 = pd.DataFrame({'sku':['a','a','e','c'], 'data':[1,1,2,3]})
df2 = pd.DataFrame({'sku':['a','b','c']})
print(pd.merge(df1,df2))

  sku  data
0   a     1
1   a     1
2   c     3


结果只会打印两者共有的部分

In [2]:
#数据集验证
def user_action_check():
    df_user = pd.read_csv('data/JData_User.csv',encoding='gbk')
    df_sku = df_user.loc[:,'user_id'].to_frame()
    df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')
    # pd.merge(df_sku,df_month2) 会以user_id字段为基准取两个df的交集 不是取并集，这样才能证明 action中的userid 都在df_user里面
    print ('Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))) 
    df_month3 = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')
    print ('Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3)))
    df_month4 = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')
    print ('Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4)))

user_action_check() 

Is action of Feb. from User file?  True
Is action of Mar. from User file?  True
Is action of Apr. from User file?  True


结论：User数据集中的用户和交互行为数据集中的用户完全一致

根据merge前后的数据量对，能保障Action中的用户ID是User中的ID的子集

### 检查是否有重复记录
除去各个数据文件中完全重复的记录,可能解释是重复数据是有意义的，比如用户同时购买多件商品，同时添加多个数量的商品到购物车等…

In [3]:
#重复数据
def deduplicate(filepath, filename, newpath):
    df_file = pd.read_csv(filepath,encoding='gbk')       
    before = df_file.shape[0]
    df_file.drop_duplicates(inplace=True) # 列相同认为是重复 inplace=True表示在原来的DataFrame上删除重复项4
    after = df_file.shape[0]
    n_dup = before-after  # 查看前后差值
    print ('Number of duplicate records for ' + filename + ' is: ' + str(n_dup))
    if n_dup != 0:
        df_file.to_csv(newpath, index=None)
    else:
        print ('Number duplicate records in ' + filename)

In [4]:
deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')
deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')
deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')
deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')
deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')
deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')

Number of duplicate records for Feb. action is: 2756093
Number of duplicate records for Mar. action is: 7085038
Number of duplicate records for Feb. action is: 3672710
Number of duplicate records for Comment is: 0
Number duplicate records in Comment
Number of duplicate records for Product is: 0
Number duplicate records in Product
Number of duplicate records for User is: 0
Number duplicate records in User


In [6]:
# 查看重复数据
df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')
IsDuplicated = df_month2.duplicated()
df_d = df_month2[IsDuplicated]
df_d.groupby('type').count()  # 发现重复数据大多数都是由于浏览（1），或者点击(6)产生

Unnamed: 0_level_0,user_id,sku_id,time,model_id,cate,brand
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2176378,2176378,2176378,0,2176378,2176378
2,636,636,636,0,636,636
3,1464,1464,1464,0,1464,1464
4,37,37,37,0,37,37
5,1981,1981,1981,0,1981,1981
6,575597,575597,575597,545054,575597,575597


### 检查是否存在注册时间在2016年-4月-15号之后的用户
统计的是4月15号前的客户行为，不应该包含4月15号后的注册客户。

In [7]:
# check user who’s user_reg_tm  >= '2016-4-15'
df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')
df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) 
df_user.loc[df_user.user_reg_tm>= '2016-4-15']

Unnamed: 0,user_id,age,sex,user_lv_cd,user_reg_tm
7457,207458,-1,2.0,1,2016-04-15
7463,207464,26-35岁,2.0,2,2016-04-15
7467,207468,36-45岁,2.0,3,2016-04-15
7472,207473,-1,2.0,1,2016-04-15
7482,207483,26-35岁,2.0,3,2016-04-15
7492,207493,16-25岁,2.0,3,2016-04-15
7493,207494,16-25岁,2.0,3,2016-04-15
7503,207504,16-25岁,2.0,4,2016-04-15
7510,207511,46-55岁,2.0,5,2016-04-15
7512,207513,-1,2.0,1,2016-04-15


检查依然存在4月15号后注册的，如果这些客户没有4月15号后的行为数据，说明要删除。

In [8]:
df_month = pd.read_csv('data/JData_Action_201604.csv')
df_month['time'] = pd.to_datetime(df_month['time'])
df_month.loc[df_month.time >= '2016-4-16']

Unnamed: 0,user_id,sku_id,time,model_id,type,cate,brand


说明客户没有交互数据，所以这一批客户不需要删除

### 行为数据中的user_id为浮点型，进行INT类型转换

In [10]:
df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')
df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))
print (df_month['user_id'].dtype)
df_month.to_csv('data/JData_Action_201602.csv',index=None)
 
df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')
df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))
print (df_month['user_id'].dtype)
df_month.to_csv('data/JData_Action_201603.csv',index=None)
 
df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')
df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))
print (df_month['user_id'].dtype)
df_month.to_csv('data/JData_Action_201604.csv',index=None)

int64
int64
int64


### 年龄区间的处理
查看用户年龄分布，并做特征编码

In [11]:
age_mapping = {  
           '15岁以下': 1,  
           '16-25岁': 2,  
           '26-35岁': 3,
           '36-45岁': 4,
           '46-55岁': 5,
           '56岁以上': 6,
           '-1' :-1
            }  
df_user['age'] = df_user['age'].map(age_mapping)
print(df_user.age.value_counts())
df_user.to_csv('data\JData_User.csv',index=None)

 3.0    46570
 4.0    30336
-1.0    14412
 2.0     8797
 5.0     3325
 6.0     1871
 1.0        7
Name: age, dtype: int64


为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table.

### user_table
* user_table特征包括:
* user_id(用户id),age(年龄),sex(性别),
* user_lv_cd(用户级别),browse_num(浏览数),
* addcart_num(加购数),delcart_num(删购数),
* buy_num(购买数),favor_num(收藏数),
* click_num(点击数),buy_addcart_ratio(购买加购转化率),
* buy_browse_ratio(购买浏览转化率),
* buy_click_ratio(购买点击转化率),
* buy_favor_ratio(购买收藏转化率)

### item_table特征包括:
* sku_id(商品id),attr1,attr2,
* attr3,cate,brand,browse_num,
* addcart_num,delcart_num,
* buy_num,favor_num,click_num,
* buy_addcart_ratio,buy_browse_ratio,
* buy_click_ratio,buy_favor_ratio,
* comment_num(评论数),
* has_bad_comment(是否有差评),
* bad_comment_rate(差评率)

### 构建User_table

In [14]:
# 定义文件名
ACTION_201602_FILE = "data/JData_Action_201602.csv"  # 11M条
ACTION_201603_FILE = "data/JData_Action_201603.csv"   #26M 条
ACTION_201604_FILE = "data/JData_Action_201604.csv"   #13M条
COMMENT_FILE = "data/JData_Comment.csv"      #560K条
PRODUCT_FILE = "data/JData_Product.csv"      #24k
USER_FILE = "data/JData_User.csv"          # 105K 条
 
USER_TABLE_FILE = "data/user_table.csv"
ITEM_TABLE_FILE = "data/item_table.csv"

In [15]:
import numpy as np
from collections import Counter

In [16]:
# 功能函数: 对每一个user分组的数据进行统计
def add_type_count(group):
    behavior_type = group.type.astype(int) 
    # 用户行为类别
    type_cnt = Counter(behavior_type)
    # 1: 浏览 2: 加购 3: 删除
    # 4: 购买 5: 收藏 6: 点击
    group['browse_num'] = type_cnt[1]
    group['addcart_num'] = type_cnt[2]
    group['delcart_num'] = type_cnt[3]
    group['buy_num'] = type_cnt[4]
    group['favor_num'] = type_cnt[5]
    group['click_num'] = type_cnt[6]
 
    return group[['user_id', 'browse_num', 'addcart_num',
                  'delcart_num', 'buy_num', 'favor_num',
                  'click_num']]

由于用户行为数据量较大，一次性读入可能造成内存错误（Memory Error），因此使用pandas的分块（chunk）读取

In [17]:
#对action数据进行统计
#根据自己调节chunk_size大小
def get_from_action_data(fname, chunk_size=50000):
    reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')
    chunks = []
    loop = True
    while loop:
        try:
            # 只读取user_id和type两个字段
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")
    # 将块拼接为pandas dataframe格式
    df_ac = pd.concat(chunks, ignore_index=True)
    # 按user_id分组，对每一组进行统计，as_index 表示无索引形式返回数据
    df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)
    # 将重复的行丢弃
    df_ac = df_ac.drop_duplicates('user_id')
 
    return df_ac

In [18]:
# 将各个action数据的统计量进行聚合
def merge_action_data():
    df_ac = []
    df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))
    df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))
    df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))
 
    df_ac = pd.concat(df_ac, ignore_index=True)
    # 用户在不同action表中统计量求和
    df_ac = df_ac.groupby(['user_id'], as_index=False).sum()
    #　构造转化率字段
    df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']
    df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']
    df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']
    df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']
    
    # 将大于１的转化率字段置为１(100%)
    df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.
    df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.
    df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.
    df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.
 
    return df_ac

In [19]:
#　从FJData_User表中抽取需要的字段
def get_from_jdata_user():
    df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')
    df_usr = df_usr[["user_id", "age", "sex", "user_lv_cd"]]
    return df_usr

In [20]:
user_base = get_from_jdata_user()
user_base.head()

Unnamed: 0,user_id,age,sex,user_lv_cd
0,200001,6.0,2.0,5
1,200002,-1.0,0.0,1
2,200003,4.0,1.0,4
3,200004,-1.0,2.0,1
4,200005,2.0,0.0,4


In [21]:
user_behavior = merge_action_data()
user_behavior.head()

Iteration is stopped
Iteration is stopped
Iteration is stopped


Unnamed: 0,user_id,browse_num,addcart_num,delcart_num,buy_num,favor_num,click_num,buy_addcart_ratio,buy_browse_ratio,buy_click_ratio,buy_favor_ratio
0,200001,212,22,13,1,0,414,0.045455,0.004717,0.002415,1.0
1,200002,238,1,0,0,0,484,0.0,0.0,0.0,
2,200003,221,4,1,0,1,420,0.0,0.0,0.0,0.0
3,200004,52,0,0,0,0,61,,0.0,0.0,
4,200005,106,2,3,1,2,161,0.5,0.009434,0.006211,0.5


In [22]:
# 连接成一张表，类似于SQL的左连接(left join)
user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')
# 保存为user_table.csv
user_behavior.to_csv(USER_TABLE_FILE, index=False)

In [23]:
user_table = pd.read_csv(USER_TABLE_FILE)
user_table.head()

Unnamed: 0,user_id,age,sex,user_lv_cd,browse_num,addcart_num,delcart_num,buy_num,favor_num,click_num,buy_addcart_ratio,buy_browse_ratio,buy_click_ratio,buy_favor_ratio
0,200001,6.0,2.0,5,212.0,22.0,13.0,1.0,0.0,414.0,0.045455,0.004717,0.002415,1.0
1,200002,-1.0,0.0,1,238.0,1.0,0.0,0.0,0.0,484.0,0.0,0.0,0.0,
2,200003,4.0,1.0,4,221.0,4.0,1.0,0.0,1.0,420.0,0.0,0.0,0.0,0.0
3,200004,-1.0,2.0,1,52.0,0.0,0.0,0.0,0.0,61.0,,0.0,0.0,
4,200005,2.0,0.0,4,106.0,2.0,3.0,1.0,2.0,161.0,0.5,0.009434,0.006211,0.5
