#  Decision Tree on Google PlayStore Apps 

这次我想通过使用决策树对 __[Kaggle 的 Google PlayStore 应用数据集](https://www.kaggle.com/lava18/google-play-store-apps)__ 进行分析，希望找出与应用__安装量__密切相关的因素，也是希望在工作上对我所在的海外部门的应用市场数据增长起到一定作用。

## 初始化

In [23]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph.

import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

import warnings
warnings.filterwarnings('ignore')
from pylab import rcParams
# figure size in inches

%matplotlib inline

ModuleNotFoundError: No module named 'plotly'

## 数据集初探

In [13]:
df = pd.read_csv('data/googleplaystore.csv')
print('Number of apps in the dataset : ' , len(df))
df.sample(7)

Number of apps in the dataset :  10841


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2225,Frozen Free Fall,FAMILY,4.3,1574204,37M,"50,000,000+",Free,0,Everyone,Puzzle;Action & Adventure,"July 27, 2018",6.7.0,4.2 and up
10653,Jigsaw Puzzles FN FAL Light Automatic Rifle,FAMILY,,0,5.6M,50+,Free,0,Teen,Puzzle,"May 7, 2018",1.0,4.0.3 and up
3961,B.GOOD,FOOD_AND_DRINK,2.9,155,20M,"10,000+",Free,0,Everyone,Food & Drink,"April 4, 2018",4.94.19,4.1 and up
3810,JailBase - Arrests + Mugshots,NEWS_AND_MAGAZINES,4.0,17240,Varies with device,"1,000,000+",Free,0,Everyone 10+,News & Magazines,"August 3, 2018",Varies with device,4.1 and up
1056,CASHIER,FINANCE,3.3,335738,Varies with device,"10,000,000+",Free,0,Everyone,Finance,"May 3, 2018",Varies with device,Varies with device
1618,Brit + Co,LIFESTYLE,3.9,987,4.5M,"10,000+",Free,0,Everyone,Lifestyle,"August 29, 2017",2.0.4,4.0 and up
1978,Earn to Die 2,GAME,4.6,1327269,99M,"50,000,000+",Free,0,Teen,Racing,"April 12, 2017",1.3,2.3.3 and up



可以看出数据集里对每一个应用共有 13 种属性的定义，根据 Title 就能很直观的明白其意义。
这里我将**安装量**(*Installs*)作为我们的目标，为了实现分类，可以将安装的数量按一定的数据区间分类：

| **安装量**         | **Label**         | 
| ------------- |---------------| 
| 0 - 1000    | 1 | 
| 1000 - 5000      | 2      |  
| 5000 - 10k | 3      |
| 10k - 50k | 4      | 
| 50k - 100k | 5      | 
| 100k - 500k | 6      | 
| 500k - 1m | 7      | 
| 1m - 5m | 8      | 
| 5m - 10m | 9      | 
| 10m - 50m | 10      | 
| 50m+ | 11      | 

考虑到目前我们海外版的应用安装量在 5 milliions 数量级，所以这里代表最高安装量的分类暂定为 50m+（个人觉得已经是个非常高的界限了）

## 数据预处理

In [14]:
print( len(df['Installs'].unique()) , "Installs")
print("\n", df['Installs'].unique())

22 Installs

 ['10,000+' '500,000+' '5,000,000+' '50,000,000+' '100,000+' '50,000+'
 '1,000,000+' '10,000,000+' '5,000+' '100,000,000+' '1,000,000,000+'
 '1,000+' '500,000,000+' '50+' '100+' '500+' '10+' '1+' '5+' '0+' '0'
 'Free']


In [15]:
# - Installs : Remove + and ,

df = df[df['Installs'] != 'Free']
df['Installs'] = df['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: int(x))
print( len(df['Installs'].unique()) , "Installs")
print("\n", df['Installs'].unique())

20 Installs

 [     10000     500000    5000000   50000000     100000      50000
    1000000   10000000       5000  100000000 1000000000       1000
  500000000         50        100        500         10          1
          5          0]


In [None]:
data = [go.Histogram(
        x = df.Rating,
        xbins = {'start': 1, 'size': 0.1, 'end' :5}
)]

print('Average app rating = ', np.mean(df['Rating']))
plotly.offline.iplot(data, filename='overall_rating_distribution')