# SVM on Google PlayStore Apps 

在上次的[作业](https://github.com/changyiZ/ivy1/blob/master/google-play-store_decision-tree/google_play_store_apps.ipynb)中，我应用决策树 **Decision Tree** 对 Google PlayStore 的数据集进行分析，<br>
目标是想根据 App 的基本信息，按百万、千万级的分类对其下载量进行预测。

#### **Decision Tree** 相关算法的准确率结果
- DecisionTree 0.867619926199262
- RandomForest 0.8906826568265682
- AdaBoost 0.889760147601476

这次作业中，我会通过 SVM 算法做对比，看看效率和准确率有什么变化，并尝试分析其原因。

### 数据预处理

#### 初始化

In [1]:
import pandas as pd
from sklearn import preprocessing, svm
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

#### 数据处理

In [2]:
df = pd.read_csv('data/googleplaystore.csv')
print('Number of apps in the dataset : ' , len(df))
df.sample(5)

Number of apps in the dataset :  10841


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9113,News Dz,SOCIAL,,3,9.9M,10+,Free,0,Everyone,Social,"July 19, 2017",1.0,4.0 and up
9907,E.U. Trademark Search Tool,BUSINESS,,0,3.1M,10+,Free,0,Everyone,Business,"March 29, 2018",1.5,4.0.3 and up
8609,Svenska Dagbladet,NEWS_AND_MAGAZINES,2.6,820,Varies with device,"100,000+",Free,0,Everyone,News & Magazines,"February 13, 2018",Varies with device,Varies with device
10678,HAL-9000 - FN Theme,PERSONALIZATION,3.5,159,257k,"10,000+",Free,0,Everyone,Personalization,"August 16, 2013",1.0,2.2 and up
1516,Best New Ringtones 2018 Free 🔥 For Android™,LIBRARIES_AND_DEMO,4.6,3014,21M,"100,000+",Free,0,Everyone,Libraries & Demo,"June 27, 2018",1.1,5.0 and up


In [3]:
df['Rating'] = df['Rating'].fillna(df['Rating'].median())
index = df[df['Rating'] == 19.].index
df = df.drop(index)

df = df[pd.notnull(df['Last Updated'])]
df = df[pd.notnull(df['Content Rating'])]

In [4]:
def map_content_rating(content_rating):
    if 'Teen' in content_rating:
        return 1
    elif 'Everyone 10+' in content_rating:
        return 2
    elif 'Mature 17+' in content_rating:
        return 3
    elif 'Adults only 18+' in content_rating:
        return 4
    else:
        return 0


# Encode Content Rating features
df['Content Rating'] = df['Content Rating'].map(map_content_rating)

In [5]:
df['Size'] = df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(x))
df['Size'] = df['Size'].fillna(df['Size'].mean())


def map_size(size):
    if size < 5.0:
        return 1
    elif size < 10.0:
        return 2
    elif size < 20.0:
        return 3
    elif size < 50.0:
        return 4
    elif size < 100.0:
        return 5
    else:
        return 6


df['Size'] = df['Size'].map(map_size)


def map_reviews(number):
    number = int(number)
    if number < 10:
        return 1
    elif number < 100:
        return 2
    elif number < 1000:
        return 3
    elif number < 10000:
        return 4
    elif number < 100000:
        return 5
    elif number < 1000000:
        return 6
    elif number < 10000000:
        return 7
    else:
        return 8


df['Reviews'] = df['Reviews'].map(map_reviews)

In [6]:
# scaling and cleaning size of installation
def map_version(version):
    version = str(version)
    if version.startswith("1."):
        return 1
    elif version.startswith("2."):
        return 2
    elif version.startswith("3."):
        return 3
    elif version.startswith("4."):
        return 4
    elif version.startswith("5."):
        return 5
    elif version.startswith("6."):
        return 6
    elif version.startswith("7."):
        return 7
    elif version.startswith("8."):
        return 8
    else:
        return 0


df['Android Ver'] = df['Android Ver'].map(map_version)

In [7]:
df['Price'] = df['Price'].apply(lambda x: x.strip('$'))


def map_price(price):
    price = float(price)
    if price > 10.0:
        return 2
    elif price > 0.0:
        return 1
    else:
        return 0


df['Price'] = df['Price'].map(map_price)

# Encode Category features
le = preprocessing.LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])

In [8]:
def map_installs(number):
    number = int(number)
    if number < 1000000:
        return 0
    elif number < 10000000:
        return 1
    else:
        return 2


# Installs cealning
df['Installs'] = df['Installs'].apply(lambda x: x.strip('+').replace(',', ''))
df['Installs'] = df['Installs'].map(map_installs)

以上数据预处理方式基本与 DT 作业中一致，这里就不做累述。

#### 准备数据集

In [9]:
X_all = df.drop(['Installs', 'App', 'Last Updated', 'Type', 'Current Ver', 'Genres'], axis=1)
y_all = df['Installs']

In [10]:
X_all.sample(20)

Unnamed: 0,Category,Rating,Reviews,Size,Price,Content Rating,Android Ver
595,7,4.1,4,3,0,3,4
3359,23,4.5,6,3,0,0,5
7773,25,3.7,4,4,0,0,4
8401,14,4.2,4,4,0,1,3
379,6,3.7,5,5,0,0,4
7665,14,3.5,2,4,1,0,4
4966,11,4.5,3,4,0,0,4
5646,14,4.7,4,4,0,0,2
4938,24,4.8,3,1,1,0,4
9725,11,3.9,3,4,0,0,2


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.20, random_state=10)

### SVM 分类

In [12]:
def svm_cv(kernel, params):
    # Type of scoring used to compare parameter combinations
    acc_scorer = make_scorer(accuracy_score)
    clf = GridSearchCV(svm.SVC(kernel=kernel), params, scoring=acc_scorer)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print(clf.best_params_)

    # Set the clf to the best combination of parameters
    clf = clf.best_estimator_
    predictions = clf.predict(X_test)
    print(kernel, accuracy_score(y_test, predictions))


# Set the parameters by cross-validation
Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [0.001, 0.01, 0.1, 1]
degrees = [0, 1, 2, 3, 4, 5, 6]

svm_cv('linear', [{'C': Cs}])
svm_cv('rbf', [{'gamma': gammas, 'C': Cs}])

Best parameters set found on development set:
{'C': 1}
linear 0.8948339483394834
Best parameters set found on development set:
{'C': 10, 'gamma': 0.01}
rbf 0.8943726937269373


可以看出，通过利用 GridSearchCV 优化调参后，**SVM** 不同 kernel 的准确率相比 **Decision Tree**有所提升，但并不明显。 

不过运行时间相比 **DT** 却有很大的提升，<br>
*poly* 内核的运行时由于过长，这里仅贴上运行结果截图

![SVM poly result](svm_poly_results.png)

可以看出三种内核在最优参数下的准确率都在 89.48% 左右。

### 调优

考虑到 SVM 的特性，这里我尝试通过 **Feature Scaling** 对数据进一步处理，以提高准确率。

In [13]:
print('origin: ')
print(X_all.sample(5))
x_scaled = StandardScaler().fit_transform(X_all.values)
X_all = pd.DataFrame(x_scaled)
print('scaled: ')
print(X_all.sample(5))

X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.20, random_state=10)

origin: 
      Category  Rating  Reviews  Size  Price  Content Rating  Android Ver
8320        25     4.2        6     4      0               0            0
5337        15     4.3        3     4      0               0            0
6871        18     4.3        1     1      0               0            4
8253        11     4.3        6     5      0               0            4
995          9     4.6        4     4      1               0            0
scaled: 
             0         1         2         3         4         5         6
4319 -0.326277  1.027490 -0.389908  0.736706 -0.270816  2.204360  0.489626
8981  0.511753  0.611101 -0.389908  0.736706 -0.270816 -0.430905  0.489626
5269  0.751190  0.194712 -1.453070  0.736706  3.031369 -0.430905  0.489626
4796  1.708938  0.194712  1.204835  0.736706 -0.270816  0.886728 -2.181510
751  -1.044589 -0.221677  0.673254  0.736706 -0.270816 -0.430905 -2.181510


In [14]:
svm_cv('linear', [{'C': Cs}])
svm_cv('rbf', [{'gamma': gammas, 'C': Cs}])
svm_cv('poly', [{'degree': degrees}])

Best parameters set found on development set:
{'C': 0.1}
linear 0.8948339483394834
Best parameters set found on development set:
{'C': 10, 'gamma': 0.01}
rbf 0.8948339483394834
Best parameters set found on development set:
{'degree': 1}
poly 0.8948339483394834


### 思考

相对于**Decision Tree**， **SVM**的算法在准确率上有微弱的提升，这可能 **SVM** 通过内核变化坐标的特性有关，<br>
但随之带来的是耗时的巨大开销，相比之下 **RandomForest** 显得性价比很高 - 准确率相当且时间开销少。<br>
**Feature Scaling** 处理后 **SVM** 的处理效率有显著提升，但准确率并没有得到提升。<br>
最终，通过**DT** 和 **SVM**算法，结合基本信息，对 App 下载量的预测准确率接近 90%，未来希望有机会通过数据采集的进一步优化将准确率提升到 95%以上，从而能比较自信的应用到响应数据增长策略中。