# Assignment 10

### 1.复习上课内容

### 2. 回答一下理论题目

#### 1. What is independent assumption in Naive bayes ?

Naive bayes provides a way of calculating the posterior probability, $P(c|x)$, from $P(c)$, $P(x)$, and $P(x|c)$. Naive Bayes classifier assume that the effect of the value of a feature($x_1$) on a given class (c) is independent of the values of other features$(x_2,x_3,..x_n)$. This assumption is called class conditional independence.

In other words,naive bayes (NB) is ‘naive’ because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true.

#### 2. What is MAP(maximum a posterior) and ML(maximum likelihood) ?

MLE(Maximum Likelihood Estimation ) and MAP(Maximum A Posteriori), are both a method for estimating some variable in the setting of probability distributions.

**What is MLE?**

Assume we have a likelihood function P(X|θ). Then, the MLE for θ, the parameter we want to infer, is:
<img src='https://i.loli.net/2019/12/09/YNJzCxpiWLESMIj.png' width=200 height=200>

we just need to derive the log likelihood of our model, then maximizing it with regard of θ using our favorite optimization algorithm like Gradient Descent.

**What is MAP**

MAP usually comes up in Bayesian framework as it works on a posterior distribution, not only the likelihood.

If we replace the likelihood in the MLE formula above with the posterior P(θ), we get:
<img src=https://i.loli.net/2019/12/09/cqIrnHMJbYpLsVo.png width=200 height=200>

the only thing differs is the inclusion of prior P(θ) in MAP, otherwise they are identical.

#### 3. What is support vector in SVM?

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. 

Using these support vectors, we maximize the margin of the classifier. 

#### 4. What is the intuition behind SVM ?

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

#### 5. Shortly describ what 'random' means in random forest ?

1、Random samples(Bootstrapping): sampling random sets of observations with replacement.

2、Random subsets of features: selecting a random set of the features when considering splits for each node in a decision tree.

#### 6. What cariterion does XGBoost use to find the best split point in a tree ?

The cariterion in Chen T's PPT is call 'Gain': the loss reduction after the split.

<img src="https://s2.ax1x.com/2019/12/12/QcSI6H.png" width=400 height=400 >


For each feature, sorted the instances by feature value; for each value ,calculate the gain; then we choose the value with the max gain as the split point of this feature——that is what the ppt said: "Use a linear scan to decide the best split along that feature".

Finally, take the best split solution along all the features.

### 3. Practial part

##### Problem description: In this part you are going to build a classifier to detect if a piece of news is published by the Xinhua news agency (新华社）.

#### Hints:

###### 1. Firstly, you have to come up with a way to represent the news. (Vectorize the sentence, you can find different ways to do so online)  

## 基于词袋模型（词频矩阵、TFIDF、LSA和Bigram）

In [1]:
import re,jieba
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics 
from sklearn.model_selection import train_test_split

## 第一步：准备训练数据

In [2]:
'''
根据来源和内容去掉缺失值
'''
df_data = pd.read_csv('./zh_news.csv',encoding='gb18030')
df_data.dropna(subset=['source','content'],inplace=True)
df_data = df_data[['source','content']]
print('Number of news: {}'.format(len(df_data)))
df_data.head()

Number of news: 87052


Unnamed: 0,source,content
0,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...
1,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...
2,快科技@http://www.kkj.cn/,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...
3,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n
4,深圳大件事,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...


In [3]:
'''
source为新闻的来源，把来自新华社的视为正样本，用1表示，其他的视为负样本，用0表示
'''
def create_flag(x):
    if x == '新华社':
        return 1
    else:
        return 0
    
df_data['flag'] = df_data['source'].apply(create_flag)
df_pos = df_data[df_data['flag']==1]
df_neg = df_data[df_data['flag']==0]
print('新华社新闻有 {} 条，其他来源的新闻有 {} 条'.format(len(df_pos),len(df_neg)))
df_data.head()

新华社新闻有 78661 条，其他来源的新闻有 8391 条


Unnamed: 0,source,content,flag
0,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,0
1,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,0
2,快科技@http://www.kkj.cn/,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,0
3,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,1
4,深圳大件事,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,0


In [4]:
'''
存在严重的类别不平衡问题，对新华社的样本进行简单的下采样
从新华社样本中取20000条
然后再把正负样本拼接
'''
df_pos_down = df_pos[:20000]
df_merge = pd.concat([df_pos_down,df_neg],ignore_index=True)
print(df_merge.shape)

(28391, 3)


## 第二步：对影评数据做预处理

* 切分成词/token
* 去掉停用词
* 重组为新的句子

In [5]:
def clean_sent(sentence):
    '''
    函数没有return时会返回None，为了避免这种情况，空值返回''
    '''
    if isinstance(sentence, str):
        return re.sub(
            r'[\s+\-\|\!\/\[\]\{\}_,.$%^*(+\"\')]+|[:：+——()?【】“”~@#￥%……&*（）]+',
            '', sentence)
    else:
        return ''

def filter_stopwords(words):
    return [word for word in words if word not in stop_words]

def sentence_seg(sentence):
    sentence = clean_sent(sentence.strip())
    words = jieba.lcut(sentence)
    words = filter_stopwords(words)
    return ' '.join(words)   

In [6]:
stop_words = [word.strip() for word in open('哈工大停用词表.txt',encoding='utf-8')]

In [7]:
df_merge['content_seg'] = df_merge['content'].apply(sentence_seg)
df_merge.head()

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.577 seconds.
Prefix dict has been built succesfully.


Unnamed: 0,source,content,flag,content_seg
0,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,1,这是 6 月 18 日 葡萄牙 中部 大 佩德罗 冈 地区 拍摄 森林 大火 烧毁 汽车 新...
1,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,1,这是 6 月 18 日 葡萄牙 中部 大 佩德罗 冈 地区 拍摄 森林 大火 烧毁 汽车 新...
2,新华社,新华社韩国济州6月18日电综述：亚投行第二届年会三大亮点\r\n新华社记者 耿学鹏 严蕾\r...,1,新华社 韩国 济州 6 月 18 日电 综述 亚 投行 第二届 年会 三大 亮点 新华社 记...
3,新华社,新华社北京6月18日电 经军委领导批准，《军营理论热点怎么看·2017》日前印发全军。\r\...,1,新华社 北京 6 月 18 日电 军委 领导 批准 军营 理论 热点 看 2017 日前 印...
4,新华社,新华社兰州6月19日电（记者张钦）记者19日了解到，刚刚出台的《甘肃省网络扶贫行动的实施方案...,1,新华社 兰州 6 月 19 日电 记者 张钦 记者 19 日 了解 刚刚 出台 甘肃省 网络...


## 第三步：抽取文本特征

* VSM(词频矩阵）
* TFIDF
* LSA
* ngram

In [8]:
'''
将数据随机打乱，并划分训练集和验证集
'''
df_merge.sample(frac=1).head()

Unnamed: 0,source,content,flag,content_seg
4875,新华社,新华社照片，开封（河南），2017年4月1日\n清明踏春大巡游\n4月1日，来自开封铁塔公园...,1,新华社 照片 开封 河南 2017 年 4 月 1 日 \ n 清明 踏春 大 巡游 \ n...
9780,新华社,\n新华社莫斯科4月7日新媒体专电（记者安晓萌）俄罗斯国防部发言人科纳申科夫7日说，俄方将从...,1,\ n 新华社 莫斯科 4 月 7 日新 媒体 专电 记者 安晓萌 俄罗斯国防部 发言人 科...
23702,中国日报网,点击图片进入下一页\r\n,0,点击 图片 进入 下 一页
4936,新华社,新华社照片，杭州，2017年4月1日\n“捣青草为汁”和粉作青团\n4月1日，浙江临安山核桃...,1,新华社 照片 杭州 2017 年 4 月 1 日 \ n 捣 青草 汁 粉作 青团 \ n4...
20029,人民日报,“我刚刚启动发动机，就听见清脆的玻璃破裂声，然后一匹骏马出现在副驾驶座位上，含情脉脉地看着我...,0,刚刚 启动 发动机 听见 清脆 玻璃 破裂声 一匹 骏马 出现 副驾驶 座位 上 含情脉脉 ...


In [9]:
'''
词频矩阵作为文本特征
'''
vec_freq = CountVectorizer(max_features=5000)
vsm_freq = vec_freq.fit_transform(df_merge['content_seg']).toarray()
print("以词频为元素的文本-单词矩阵的维度是：\n\n",vsm_freq.shape)

# 可选以TFIDF作为文本特征
# vec_tfidf = TfidfVectorizer(max_features=5000)
# vsm_tfidf = vec_tfidf.fit_transform(df_data['content_seg']).toarray()

以词频为元素的文本-单词矩阵的维度是：

 (28391, 5000)


In [10]:
X_train, X_test, y_train, y_test = train_test_split(vsm_freq,
                                                    df_merge['flag'],
                                                    test_size=0.2, 
                                                    random_state=10)

###### 2. Secondly,  pick a machine learning algorithm that you think is suitable for this task

## 第四步：训练随机森林分类器

In [11]:
'''
使用包外估计作为模型泛化误差的估计，即oob_score=True
设定为200棵数，最大深度为10，叶子节点最小的样本数为10
'''
forest1 = RandomForestClassifier(oob_score=True,
                                n_estimators = 200,
                                max_depth=10,
                                min_samples_leaf=10)
forest1.fit(X_train, y_train)

print("\n包外估计为：{}\n".format(forest1.oob_score_))


包外估计为：0.9204385346953152



## 第五步：评估分类器的性能

In [12]:
"""第五步：评估模型"""

def model_eval(x_test,y_test,forest):

    print("1、混淆矩阵为：\n")
    print(metrics.confusion_matrix(y_test, forest.predict(x_test)))

    print("\n2、准确率、召回率和F1值为：\n")
    print(metrics.classification_report(y_test,forest.predict(x_test)))

    print("\n3、AUC Score为：\n")
    y_predprob = forest.predict_proba(x_test)[:,1]
    print(metrics.roc_auc_score(y_test, y_predprob))

In [13]:
print("\n====================评估以词频为特征训练的模型==================\n")
model_eval(X_test,y_test,forest1)



1、混淆矩阵为：

[[1283  424]
 [  43 3929]]

2、准确率、召回率和F1值为：

              precision    recall  f1-score   support

           0       0.97      0.75      0.85      1707
           1       0.90      0.99      0.94      3972

   micro avg       0.92      0.92      0.92      5679
   macro avg       0.94      0.87      0.89      5679
weighted avg       0.92      0.92      0.91      5679


3、AUC Score为：

0.9869705247806703


## 第六步：用LSA和Bigram作为文本特征

In [14]:
"""
用NMF计算LSA的话题-文本矩阵
对以词频为特征的单词-文本矩阵进行NMF分解,得到话题-文本矩阵，
注意如果输入进行了转置，那么得到的是单词-话题矩阵
"""
  
from sklearn.decomposition import NMF
nmf = NMF(n_components=200)
lsa_freq = nmf.fit_transform(vsm_freq)
print(lsa_freq.shape)

(28391, 200)


In [15]:
X_train, X_test, y_train, y_test = train_test_split(lsa_freq,
                                                    df_merge['flag'],
                                                    test_size=0.2, 
                                                    random_state=10)

In [16]:
forest2 = RandomForestClassifier(oob_score=True,
                                n_estimators = 200,
                                max_depth=10,
                                min_samples_leaf=10)
forest2.fit(X_train, y_train)

print("\n包外估计为：{}\n".format(forest2.oob_score_))


包外估计为：0.9658770693906304



In [17]:
model_eval(X_test,y_test,forest2)

1、混淆矩阵为：

[[1607  100]
 [  86 3886]]

2、准确率、召回率和F1值为：

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      1707
           1       0.97      0.98      0.98      3972

   micro avg       0.97      0.97      0.97      5679
   macro avg       0.96      0.96      0.96      5679
weighted avg       0.97      0.97      0.97      5679


3、AUC Score为：

0.9929813321251101


In [18]:
"""
在用Bigram作为特征
使用sklearn计算Bigram，得到bigram词语-文本矩阵
token_pattern的作用是，出现"bi-gram"、"two:three"这种时，可以切成"bi gram"、"two three"的形式
(2,2)的意思是不保留unigram。
"""
vec_bigram = CountVectorizer(ngram_range=(2,2),
                             token_pattern=r'\b\w+\b',
                             max_features=5000) 
vsm_bigram = vec_bigram.fit_transform(df_merge['content_seg']).toarray()

In [20]:
print("bigram构成的语料库中前10个元素为：\n")
print(vec_bigram.get_feature_names()[:10])

bigram构成的语料库中前10个元素为：

['0 0', '0 3', '0 击败', '0 总比分', '0 战胜', '0 胜', '04 月', '04 队', '1 0', '1 1']


In [21]:
X_train, X_test, y_train, y_test = train_test_split(vsm_bigram,
                                                    df_merge['flag'],
                                                    test_size=0.2, 
                                                    random_state=10)

In [22]:
forest3 = RandomForestClassifier(oob_score=True,
                                n_estimators = 200,
                                max_depth=10,
                                min_samples_leaf=10)
forest3.fit(X_train, y_train)

print("\n包外估计为：{}\n".format(forest3.oob_score_))


包外估计为：0.9819038393800634



In [23]:
model_eval(X_test,y_test,forest3)

1、混淆矩阵为：

[[1671   36]
 [  36 3936]]

2、准确率、召回率和F1值为：

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1707
           1       0.99      0.99      0.99      3972

   micro avg       0.99      0.99      0.99      5679
   macro avg       0.98      0.98      0.98      5679
weighted avg       0.99      0.99      0.99      5679


3、AUC Score为：

0.995799315182847


## 可以看到bigram的效果是最好的！

### Congratulations! You have completed all assignments in this week. The question below is optional. If you still have time, why don't try it out.

## Option:

#### Try differnt machine learning algorithms with different combinations of parameters in the practical part, and compare their performances (Better use some visualization techiniques).

## 第七步：以bigram为特征，用xgboost分类

In [24]:
# 记录训练花费的时间
import time
from datetime import timedelta

def get_time_dif(start_time):
    '''
    :param start_time: 开始时间
    :return:  耗费的时间 00:00:00
    '''
    end_time = time.time()
    time_dif = end_time - start_time 
    return timedelta(seconds=int(round(time_dif)))

In [25]:
'''
参数设置：树的最大深度100,100棵树
样本随机采样率0.8，特征随机采样（列采样）率为0.8
同时在训练中把测试集传入，以auc为监控指标，做早停
如果迭代20轮auc都没有提升，就终止训练，保存auc最高的结果。
'''

from xgboost.sklearn import XGBClassifier

xgb = XGBClassifier(max_depth=10,
                    n_estimators=200,
                    random_state=20,
                    subsample=0.8,
                    colsample_bytree=0.8)
        
eval_set = [(X_test,y_test)]
print('----------> 开始训练模型：')
start_time = time.time()
xgb.fit(X_train,y_train,
        early_stopping_rounds=20,
        eval_set=eval_set,
        eval_metric='auc',
        verbose=True)  
print(get_time_dif(start_time))

----------> 开始训练模型：
[0]	validation_0-auc:0.989816
Will train until validation_0-auc hasn't improved in 20 rounds.
[1]	validation_0-auc:0.996341
[2]	validation_0-auc:0.996486
[3]	validation_0-auc:0.997774
[4]	validation_0-auc:0.998188
[5]	validation_0-auc:0.998077
[6]	validation_0-auc:0.99819
[7]	validation_0-auc:0.998151
[8]	validation_0-auc:0.998101
[9]	validation_0-auc:0.998217
[10]	validation_0-auc:0.998341
[11]	validation_0-auc:0.998262
[12]	validation_0-auc:0.998313
[13]	validation_0-auc:0.998323
[14]	validation_0-auc:0.998494
[15]	validation_0-auc:0.998443
[16]	validation_0-auc:0.998455
[17]	validation_0-auc:0.998567
[18]	validation_0-auc:0.998592
[19]	validation_0-auc:0.998652
[20]	validation_0-auc:0.998669
[21]	validation_0-auc:0.998675
[22]	validation_0-auc:0.998687
[23]	validation_0-auc:0.998726
[24]	validation_0-auc:0.998883
[25]	validation_0-auc:0.998775
[26]	validation_0-auc:0.998883
[27]	validation_0-auc:0.998884
[28]	validation_0-auc:0.99889
[29]	validation_0-auc:0.99890

In [26]:
model_eval(X_test,y_test,xgb)

1、混淆矩阵为：

[[1704    3]
 [  43 3929]]

2、准确率、召回率和F1值为：

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1707
           1       1.00      0.99      0.99      3972

   micro avg       0.99      0.99      0.99      5679
   macro avg       0.99      0.99      0.99      5679
weighted avg       0.99      0.99      0.99      5679


3、AUC Score为：

0.9993605649623521


## xgboost真是神器！