# 基于集成学习的 Amazon 用户评论质量预测

## 一、案例简介

随着电商平台的兴起，以及疫情的持续影响，线上购物在我们的日常生活中扮演着越来越重要的角色。在进行线上商品挑选时，评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐，甚至有水军刷好评或者恶意差评的情况出现，严重影响了顾客的购物体验。因此，对于评论质量的预测成为电商平台越来越关注的话题，如果能自动对评论质量进行评估，就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对 Amazon 现实场景中的评论质量进行预测。

通过手动实现两种集成学习算法（Bagging、AdaBoost.M1）领会集成学习的基本思路，其中基分类器要求使用 SVM 和决策树两种。

## 二、数据概览

In [241]:
import pandas as pd 
train_df = pd.read_csv('./data/train.csv', sep='\t')
test_df = pd.read_csv('./data/test.csv', sep='\t')
test_label_df = pd.read_csv('./groundTruth.csv', sep=',')

In [242]:
train_df.head()

Unnamed: 0,reviewerID,asin,reviewText,overall,votes_up,votes_all,label
0,7885,3901,"First off, allow me to correct a common mistak...",5.0,6,7,0
1,52087,47978,I am really troubled by this Story and Enterta...,3.0,99,134,0
2,5701,3667,A near-perfect film version of a downright glo...,4.0,14,14,1
3,47191,40892,Keep your expectations low. Really really low...,1.0,4,7,0
4,40957,15367,"""they dont make em like this no more...""well.....",5.0,3,6,0


In [111]:
test_df.head()

Unnamed: 0,Id,reviewerID,asin,reviewText,overall
0,0,82947,37386,I REALLY wanted this series but I am in SHOCK ...,1.0
1,1,10154,23543,I have to say that this is a work of art for m...,4.0
2,2,5789,5724,Alien 3 is certainly the most controversal fil...,3.0
3,3,9198,5909,"I love this film...preachy? Well, of course i...",5.0
4,4,33252,21214,Even though I previously bought the Gamera Dou...,5.0


In [116]:
test_label_df.head()

Unnamed: 0,Id,Expected
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


本次数据来源于 Amazon 电商平台，包含超过 50,000 条用户在购买商品后留下的评论，各列的含义如下：

* reviewerID：用户 ID
* asin：商品 ID
* reviewText：英文评论文本
* overall：用户对商品的打分（1-5）
* votes_up：认为评论有用的点赞数（只在训练集出现）
* votes_all：该评论得到的总评价数（只在训练集出现）
* label：评论质量的 label，1 表示高质量，0 表示低质量（只在训练集出现）

评论质量的 label 来自于其他用户对评论的 votes，votes_up/votes_all ≥ 0.9 的作为高质量评论。此外测试集包含一个额外的列 ID，标识了每一个测试的样例。

Base model: AUC > 0.65

Ensemble model: AUC > 0.7

一些tips:
1. 处理文本特征：sklearn.feature_extraction.text.TfidfVectorizer
2. 大矩阵的处理：scipy.sparse
3. SVM的运算速度较慢：用linearSVC代替SVC
4. Ensemble的基类方法最好能输出probability而不是二分类结果，便于提升集成效果：sklearn.calibration.CalibratedClassifierCV

### 根据数据格式设计特征的表示

In [191]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV

In [193]:
vectorizer = TfidfVectorizer()
train_X = vectorizer.fit_transform(train_df['reviewText'][:2000])
train_y = np.array(train_df['label'][:2000])
test_X = vectorizer.transform(test_df['reviewText'][:200])
test_y = np.array(test_label_df['Expected'][:200])

In [194]:
print(train_X.shape)
print(train_y.shape)
print(test_X.shape)
print(test_y.shape)

(2000, 29267)
(2000,)
(200, 29267)
(200,)


### Bagging + SVM

In [195]:
from sklearn.model_selection import train_test_split

#### predict 0/1

In [196]:
def bagging_svm(num_model, train_X, train_y, num_sample_per, test_X):
    predict_result_sum = np.zeros(test_X.shape[0])
    for _ in range(num_model):
        svm_classifier = LinearSVC()
        train_num = len(train_y)
        sample_index = np.random.choice(train_num, int(num_sample_per*train_num), replace=True)
        sample_X, sample_y = train_X[sample_index],train_y[sample_index]
        svm_classifier.fit(sample_X, sample_y)
        predict_result = svm_classifier.predict(test_X)
        predict_result_sum += predict_result
    predict_result_sum /= num_model
    return predict_result_sum

In [197]:
predict_bagging_svm = bagging_svm(10,train_X,train_y,0.5,test_X)
roc_auc_score(test_y,predict_bagging_svm)

0.6200464267948931

#### predict probability

In [198]:
def bagging_svm_calib(num_model, train_X, train_y, num_sample_per, test_X):
    predict_result_sum = np.zeros(test_X.shape[0])
    for _ in range(num_model):
        train_num = len(train_y)
        sample_index = np.random.choice(train_num, int(num_sample_per*train_num), replace=True)
        sample_X, sample_y = train_X[sample_index],train_y[sample_index]
        X_train, X_calib, y_train, y_calib = train_test_split(
                                             sample_X, sample_y, random_state=42)
        
        base_clf = LinearSVC()
        base_clf.fit(X_train, y_train)
        calibrated_clf = CalibratedClassifierCV(
                       base_estimator=base_clf,
                       cv="prefit"
                     )
        calibrated_clf.fit(X_calib, y_calib)
        
        predict_result = calibrated_clf.predict_proba(test_X)[:,1]
        predict_result_sum += predict_result
    predict_result_sum /= num_model
    return predict_result_sum

In [199]:
predict_bagging_svm_calib = bagging_svm_calib(10,train_X,train_y,0.5,test_X)
roc_auc_score(test_y,predict_bagging_svm_calib)

0.7410048084894711

### Bagging + 决策树

In [200]:
from sklearn.tree import DecisionTreeClassifier

In [201]:
def bagging_dc_calib(num_model, train_X, train_y, num_sample_per, test_X):
    predict_result_sum = np.zeros(test_X.shape[0])
    for _ in range(num_model):
        train_num = len(train_y)
        sample_index = np.random.choice(train_num, int(num_sample_per*train_num), replace=True)
        sample_X, sample_y = train_X[sample_index],train_y[sample_index]
        X_train, X_calib, y_train, y_calib = train_test_split(
                                             sample_X, sample_y, random_state=42)
        
        base_clf = DecisionTreeClassifier()
        base_clf.fit(X_train, y_train)
        calibrated_clf = CalibratedClassifierCV(
                       base_estimator=base_clf,
                       cv="prefit"
                     )
        calibrated_clf.fit(X_calib, y_calib)
        
        predict_result = calibrated_clf.predict_proba(test_X)[:,1]
        predict_result_sum += predict_result
    predict_result_sum /= num_model
    return predict_result_sum

In [202]:
predict_bagging_dc_calib = bagging_dc_calib(10,train_X,train_y,0.5,test_X)
roc_auc_score(test_y,predict_bagging_dc_calib)

0.6144917924059029

In [203]:
predict_bagging_dc_calib = bagging_dc_calib(20,train_X,train_y,0.1,test_X)
roc_auc_score(test_y,predict_bagging_dc_calib)

0.6580998176090203

### AdaBoost.M1 + SVM

In [204]:
from sklearn.metrics import accuracy_score

In [225]:
def adaboost_m1_svm(num_model, train_X, train_y,test_X):    
    sample_weights = np.ones(len(train_y)) / len(train_y)
    beta_list = []
    result_list = []
    for _ in range(num_model):   
        base_clf = LinearSVC()
        base_clf.fit(train_X, train_y,sample_weight=sample_weights)
        predict_result = base_clf.predict(train_X)

        epsilon = 1- accuracy_score(train_y,predict_result,
                                    sample_weight=sample_weights)
        if epsilon > 0.5:
            print('try a better model')
            break
            
        beta = epsilon/(1-epsilon)
        beta_list.append(beta)
        error_flag = predict_result != train_y
        sample_weights *= (1.0 - error_flag) * beta + error_flag
        sample_weights /= np.sum(sample_weights) / len(sample_weights)

        predict_result_test = base_clf.predict(test_X)
        result_list.append(predict_result_test)

    beta_list = np.log(1 / np.array(beta_list))
    beta_list /= np.sum(beta_list)
    # print(beta_list.shape)

    return (np.array(result_list) * beta_list[:, None]).sum(0) 

In [231]:
predict_adaboost_m1_svm = adaboost_m1_svm(30, train_X, 
                                          train_y,test_X)
roc_auc_score(test_y,predict_adaboost_m1_svm)

0.7187033659426298

### AdaBoost.M1 + 决策树

In [238]:
def adaboost_m1_dc(num_model, train_X, train_y,test_X):    
    sample_weights = np.ones(len(train_y)) / len(train_y)
    beta_list = []
    result_list = []
    for _ in range(num_model):   
        base_clf = DecisionTreeClassifier(max_depth=10, class_weight='balanced')
        base_clf.fit(train_X, train_y,sample_weight=sample_weights)
        predict_result = base_clf.predict(train_X)

        epsilon = 1- accuracy_score(train_y,predict_result,
                                    sample_weight=sample_weights)
        if epsilon > 0.5:
            print('try a better model')
            break
            
        beta = epsilon/(1-epsilon)
        beta_list.append(beta)
        error_flag = predict_result != train_y
        sample_weights *= (1.0 - error_flag) * beta + error_flag
        sample_weights /= np.sum(sample_weights) / len(sample_weights)

        predict_result_test = base_clf.predict(test_X)
        result_list.append(predict_result_test)

    beta_list = np.log(1 / np.array(beta_list))
    beta_list /= np.sum(beta_list)
    # print(beta_list.shape)

    return (np.array(result_list) * beta_list[:, None]).sum(0) 

In [240]:
predict_adaboost_m1_dc = adaboost_m1_dc(10, train_X, 
                                          train_y,test_X)
roc_auc_score(test_y,predict_adaboost_m1_dc)

0.664898026861217