项目背景：
在互联网的时代下，电商平台提供给网民很多便利，如：提升了购物选择性、更直观的展示出各类商品的优惠折扣以及降低了购买成本等等。网购已经逐渐渗透进我们的生活。淘宝是电商圈里龙头企业，创造过无数奇迹，如“双十一”购物热潮及单日交易额百亿元等历史性的突破。淘宝平台不停的更新迭代，提供了更多的个性化服务。淘宝在2003年创立的，2012年注册会员近5亿，日活跃用户超1.2亿。接下来会通过淘宝2014年11月18日至2014年12月18日的随机用户行为的数据对淘宝进行用户行为分析，找出问题并提出优化方案。

# 数据说明：
数据来源：https://www.kesci.com/mw/project/5fc84f4f6571040030a416a8/dataset

本数据集共有104万条左右数据，数据为淘宝APP2014年11月18日至2014年12月18日的用户行为数据，共计6列字段。

- user_id：用户身份，脱敏
- item_id：商品ID，脱敏
- behavior_type：用户行为类型（包含点击、收藏、加购物车、支付四种行为，分别用数字1、2、3、4表示）
- user_geohash：地理位置
- item_category：品类ID（商品所属的品类）
- time：用户行为发生的时间


## 1.数据输入，描述

In [494]:
import numpy as np
import pandas as pd
from pyecharts.charts import *
import pyecharts.options as opts
from pyecharts.components import Table
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB
# 只需要在顶部声明 CurrentConfig.ONLINE_HOST 即可
CurrentConfig.ONLINE_HOST = "http://127.0.0.1:8000/assets/"
# 接下来所有图形的静态资源文件都会来自刚启动的服务器
from pyecharts.charts import Bar
bar = Bar()
from pyecharts.globals import ThemeType
from pyecharts.commons.utils import JsCode
from pyecharts.options import ComponentTitleOpts
import warnings
#忽略警告
warnings.filterwarnings('ignore')

In [318]:
tb=pd.read_csv('./淘宝用户行为.csv')

In [6]:
tb.head()

Unnamed: 0,user_id,item_id,behavior_type,user_geohash,item_category,time
0,98047837,232431562,1,,4245,2014-12-06 02
1,97726136,383583590,1,,5894,2014-12-09 20
2,98607707,64749712,1,,2883,2014-12-18 11
3,98662432,320593836,1,96nn52n,6562,2014-12-06 10
4,98145908,290208520,1,,13926,2014-12-16 21


In [9]:
tb.describe()

Unnamed: 0,user_id,item_id,behavior_type,item_category
count,2440293.0,2440293.0,2440293.0,2440293.0
mean,71740330.0,202300300.0,1.105082,6844.19
std,41226210.0,116692500.0,0.4570049,3808.725
min,4913.0,581.0,1.0,2.0
25%,35967320.0,101490500.0,1.0,3723.0
50%,72990070.0,202135600.0,1.0,6209.0
75%,107391500.0,303394700.0,1.0,10286.0
max,142455900.0,404562400.0,4.0,14080.0


## 2. 数据清理处理（空值，重复值）

In [137]:
tb.isnull().sum()

user_id                0
item_id                0
behavior_type          0
user_geohash     1660562
item_category          0
time                   0
dtype: int64

In [138]:
tb.describe(include=['O'])

Unnamed: 0,user_geohash,time
count,779731,2440293
unique,298868,745
top,94ek6lj,2014-12-11 22
freq,213,10862


In [145]:
tb[tb.duplicated()]

Unnamed: 0,user_id,item_id,behavior_type,item_category,time,date,hour
51,103802946,194298205,1,11406,2014-12-18 21:00:00,2014-12-18,21
75,103891828,149380817,1,7876,2014-12-08 21:00:00,2014-12-08,21
107,116730636,303940848,1,11956,2014-12-15 12:00:00,2014-12-15,12
122,104811265,26017196,1,10585,2014-12-12 22:00:00,2014-12-12,22
144,100684618,278753736,1,1606,2014-12-14 11:00:00,2014-12-14,11
...,...,...,...,...,...,...,...
2440277,65013453,245001602,1,7223,2014-11-21 07:00:00,2014-11-21,7
2440279,89153511,98844681,1,11520,2014-11-29 11:00:00,2014-11-29,11
2440281,100917601,20284593,1,10213,2014-12-02 22:00:00,2014-12-02,22
2440283,65013453,245001602,1,7223,2014-11-21 07:00:00,2014-11-21,7


- 可以发现很多的重复值都是时间time 的重复值，因为是不同的客户相同时间内的多次购买，故不去除。

In [319]:
del tb['user_geohash']

In [320]:
tb.drop(2440292,inplace = True)

- 从其中发现改行的日期有异常。

In [321]:
tb['date'] = tb['time'].apply(lambda x:x.split(' ')[0])
tb['hour'] = tb['time'].apply(lambda x:x.split(' ')[1])

In [322]:
tb['time']=pd.to_datetime(tb['time'])
tb['date']=pd.to_datetime(tb['date'])
tb['hour']=tb['hour'].astype('int64')

- 将time 和 date 变为时间戳型， hour 变为int64。用于方便后续的时间排列。

In [323]:
tb.loc[tb['behavior_type']==1,'behavior_type']='click'
tb.loc[tb['behavior_type']==2,'behavior_type']='fav'
tb.loc[tb['behavior_type']==3,'behavior_type']='car'
tb.loc[tb['behavior_type']==4,'behavior_type']='buy'

- 将数值型转化为对应的行为，点击、收藏、加购物车、支付。

### 数据为淘宝APP2014年11月18日至2014年12月18日的用户行为数据，确定数据集是在范围内。

In [324]:
tb[(tb['date']>'2014-12-18') | (tb['date']<'2014-11-18')]

Unnamed: 0,user_id,item_id,behavior_type,item_category,time,date,hour


- 在此确定数据集没有异常.

In [150]:
tb.describe(include=['O'])

Unnamed: 0,behavior_type
count,2440292
unique,4
top,pay
freq,2299962


In [325]:
tb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2440292 entries, 0 to 2440291
Data columns (total 7 columns):
 #   Column         Dtype         
---  ------         -----         
 0   user_id        int64         
 1   item_id        int64         
 2   behavior_type  object        
 3   item_category  int64         
 4   time           datetime64[ns]
 5   date           datetime64[ns]
 6   hour           int64         
dtypes: datetime64[ns](2), int64(4), object(1)
memory usage: 148.9+ MB


## 由于数据集是APP 的销售节日的定期数据，列出宏观数据。


In [326]:
#PV：总浏览人数
total_pv = tb["user_id"].count()
#UV：独立的访客数
total_uv = tb["user_id"].nunique()
#OD：总订单数
total_od = tb[tb['behavior_type']=="buy"].behavior_type.count()

In [327]:
table = Table()

headers = ["指标名称",'指标数']
rows = [
    ["PV：总浏览页面人数",total_pv],
    ["UV：独立访客数",total_uv],
    ["OD：总订单数",total_od],
    ["OD/PV：浏览到购买转化率", "{:.2f}%".format((total_od/total_pv)*100)],
    ["PV/UV：平均浏览页面数", total_pv/total_uv],
    ["OD/UV：平均每位用的订单数",total_od/total_uv],
]
table.add(headers, rows)
table.set_global_opts(
    title_opts=ComponentTitleOpts(title="双十二期间的销售情况", subtitle="宏观指标")
)
table.render_notebook()

指标名称,指标数
PV：总浏览页面人数,2440292
UV：独立访客数,9979
OD：总订单数,23987
OD/PV：浏览到购买转化率,0.98%
PV/UV：平均浏览页面数,244.54273975348232
OD/UV：平均每位用的订单数,2.4037478705281092


## 双十二引流情况

### 用户增长曲线，或者用户回流增长曲线

In [328]:
tb

Unnamed: 0,user_id,item_id,behavior_type,item_category,time,date,hour
0,98047837,232431562,click,4245,2014-12-06 02:00:00,2014-12-06,2
1,97726136,383583590,click,5894,2014-12-09 20:00:00,2014-12-09,20
2,98607707,64749712,click,2883,2014-12-18 11:00:00,2014-12-18,11
3,98662432,320593836,click,6562,2014-12-06 10:00:00,2014-12-06,10
4,98145908,290208520,click,13926,2014-12-16 21:00:00,2014-12-16,21
...,...,...,...,...,...,...,...
2440287,100917601,170857979,click,4605,2014-11-21 10:00:00,2014-11-21,10
2440288,95721550,346091674,click,5273,2014-12-18 19:00:00,2014-12-18,19
2440289,65013453,403663602,click,6020,2014-11-21 17:00:00,2014-11-21,17
2440290,100917601,359736366,click,5332,2014-12-01 21:00:00,2014-12-01,21


In [329]:
da_group=tb.groupby(['user_id','date']).count().reset_index()
da_group_drop=da_group.drop_duplicates(subset=['user_id'],keep='first').sort_values('date')

new_user=da_group_drop.groupby('date')['user_id'].count().reset_index().rename(columns={'user_id':'n_user'})
new_pv=da_group_drop.groupby('date')['behavior_type'].sum().reset_index().rename(columns={'behavior_type':'n_pv'})

In [330]:
new_user.drop(0,inplace = True)

In [331]:
attr = new_user.date.astype(str)
n_pv = new_pv.n_pv
n_user = new_user.n_user

bar = (Bar()
       .add_xaxis(attr.tolist())
       .add_yaxis('日新增UV', n_user.values.tolist(), yaxis_index=0)
       # 加一个Y轴
       .extend_axis(
            yaxis=opts.AxisOpts(
                type_="value",
                position="right",
                axislabel_opts=opts.LabelOpts(formatter="{value}次"))
        )
       .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True,trigger="axis",axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True,type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(
            min_=0,
            max_=2000,
            interval=100,
            axislabel_opts=opts.LabelOpts(formatter="{value}人"),
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        title_opts=opts.TitleOpts(title="日新增UV和日新增UV的PV的对比图",pos_left='center'),
        legend_opts=opts.LegendOpts(is_show=True,pos_top='95%')
    )
      )
bar.render_notebook()

line = (Line()
       .add_xaxis(attr.values.tolist())
       .add_yaxis('日新增UV的PV', n_pv.values.tolist(),yaxis_index=1,
                 label_opts=opts.LabelOpts(is_show=False))
        
      )
overlap = bar.overlap(line)
overlap.load_javascript()


<pyecharts.render.display.Javascript at 0x7f80130ee9e8>

In [495]:
overlap.render_notebook()


由于在数据集中并没有包括商店的老用户，并不代表新增用户。可以看到在数据集中的全体用户在不同时间开始访问app的新增值。

In [500]:
a= da_group.set_index(['date'])
a= a.sort_values('date')
a

Unnamed: 0_level_0,user_id,item_id,behavior_type,item_category,time,hour
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-11-18,4913,8,8,8,8,8
2014-11-18,35014928,8,8,8,8,8
2014-11-18,74624581,4,4,4,4,4
2014-11-18,135539714,1,1,1,1,1
2014-11-18,92221962,1,1,1,1,1
...,...,...,...,...,...,...
2014-12-18,128323172,11,11,11,11,11
2014-12-18,74452601,4,4,4,4,4
2014-12-18,74418729,16,16,16,16,16
2014-12-18,74820346,1,1,1,1,1


In [501]:
uvnew= a.groupby('date')['user_id'].count().reset_index().rename(columns={'user_id':'uv'})

In [502]:
pvnew=tb.groupby('date')['behavior_type'].count().reset_index().rename(columns={'behavior_type':'pv'})


In [503]:
attr=list(pv.date.astype('str').tolist())
pv=(
    Line(init_opts=opts.InitOpts(width="1000px",height="500px"))
    .add_xaxis(xaxis_data=attr)
    .add_yaxis(
        "页面的访问量(PV)",
        np.around(pvnew.pv/10000,decimals=2),
        label_opts=opts.LabelOpts(is_show=False)
    )
    .add_yaxis(
        series_name="页面的独立访客数(UV)",
        yaxis_index=1,
        y_axis=np.around(uvnew.uv/1,decimals=2),
        label_opts=opts.LabelOpts(is_show=False),
    )
    .extend_axis(
        yaxis=opts.AxisOpts(
            name="uv",
            type_="value",
            min_=0,
            max_=10000,
            interval=2000,
            axislabel_opts=opts.LabelOpts(formatter="{value} 人"),
        )
    )
    .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True,trigger="axis",axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True,type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(
            name="pv",
            type_="value",
            min_=0,
            max_=20,
            interval=5,
            axislabel_opts=opts.LabelOpts(formatter="{value} 万次"),
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        title_opts=opts.TitleOpts(title="日期维度下的PV和UV",pos_left='center'),
        legend_opts=opts.LegendOpts(is_show=True,pos_top='95%')
    )
    .set_series_opts(
        # 为了不影响标记点，这里把标签关掉
        label_opts=opts.LabelOpts(is_show=False),
        markpoint_opts=opts.MarkPointOpts(
            data=[
                opts.MarkPointItem(type_="max", name="x轴最大", value_index=1),
                opts.MarkPointItem(type_="min", name="x轴最大", value_index=1),
                
            ]))
)

pv.render_notebook()

In [504]:
hours = tb.groupby('hour').count()
hours= hours['user_id']

In [505]:
hours = hours.reset_index()
hours.user_id

0     103106
1      53557
2      29311
3      19615
4      15982
5      17648
6      31628
7      57456
8      79036
9      96520
10    109438
11    105100
12    105582
13    119367
14    118212
15    119036
16    114313
17    100666
18    108800
19    146262
20    185678
21    217423
22    217265
23    169291
Name: user_id, dtype: int64

In [506]:
attr = hours.index


bar1 = (Bar()
       .add_xaxis(attr.tolist())
       .add_yaxis('不同时间段的PV', y_axis =(np.around(hours.user_id/1000,decimals=0)).to_list())
       .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True,trigger="axis",axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True,type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(
            min_=0,
            max_=300,
            interval=10,
            axislabel_opts=opts.LabelOpts(formatter="{value}千人"),
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        title_opts=opts.TitleOpts(title="不同时间段的PV",pos_left='center'),
        legend_opts=opts.LegendOpts(is_show=True,pos_top='95%')
    )
      )

bar1.load_javascript()

<pyecharts.render.display.Javascript at 0x7f8004e8c160>

In [507]:
bar1.render_notebook()

In [509]:
clk = tb[tb['behavior_type'] == 'click']
fav = tb[tb['behavior_type'] == 'fav']
car = tb[tb['behavior_type'] == 'car']
buy = tb[tb['behavior_type'] == 'buy']

In [510]:
clk1 = clk.groupby('date').count().reset_index()
fav1 = fav.groupby('date').count().reset_index()
car1 = car.groupby('date').count().reset_index()
buy1 = buy.groupby('date').count().reset_index()

In [512]:
attr=list(pvnew.date.astype('str').tolist())
action =(
    Line(init_opts=opts.InitOpts(width="1000px",height="500px"))
    .add_xaxis(xaxis_data=attr)
    .add_yaxis(
        "点击率",
        y_axis = np.around(clk1.behavior_type/100,decimals=2),
        label_opts=opts.LabelOpts(is_show=False)
    )
    .add_yaxis(
        '收藏数',
        y_axis =np.around(fav1.behavior_type/100,decimals=2),
        yaxis_index=1,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        '加入购物车',
        y_axis =np.around(car1.behavior_type/100,decimals=2),
        yaxis_index=1,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .add_yaxis(
        '购买',
        y_axis =np.around(buy1.behavior_type/100,decimals=2),
        yaxis_index=1,
        label_opts=opts.LabelOpts(is_show=False),
    )
    .extend_axis(
        yaxis=opts.AxisOpts(
            name="uv",
            type_="value",
            min_=0,
            max_=100,
            interval=20,
            axislabel_opts=opts.LabelOpts(formatter="{value} 百次"),
        )
    )
    .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True,trigger="axis",axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True,type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(
            name="num",
            type_="value",
            min_=0,
            max_=1500,
            interval=100,
            axislabel_opts=opts.LabelOpts(formatter="{value} 百次"),
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=True),
        ),
        title_opts=opts.TitleOpts(title="四个用户行为分布",pos_left='center'),
        legend_opts=opts.LegendOpts(is_show=True,pos_top='95%')
    )
    .set_series_opts(
        # 为了不影响标记点，这里把标签关掉
        label_opts=opts.LabelOpts(is_show=False),
        markpoint_opts=opts.MarkPointOpts(
            data=[
                opts.MarkPointItem(type_="max", name="x轴最大", value_index=1),
                opts.MarkPointItem(type_="min", name="x轴最大", value_index=1),
                
            ]))
)
action.load_javascript()

<pyecharts.render.display.Javascript at 0x7f8004e7d048>

In [513]:
action.render_notebook()