In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

In [3]:
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

In [4]:
def letter_only(astr):
    for c in astr:
        if not c.isalpha():
            return False
    return True

def clean_text(docs):
    cleaned_docs = []
    for doc in docs:
        cleaned_docs.append(' '.join([lemmatizer.lemmatize(word.lower()) for word in doc.split()
                                      if letter_only(word) and word not in all_names]))
    return cleaned_docs

## Binary classification

### 数据准备，本例采用2个类别

In [16]:
categories = ['comp.graphics', 'sci.space']

data_train = fetch_20newsgroups(subset='train',categories=categories, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

cleaned_train = clean_text(data_train.data)
label_train = data_train.target
cleaned_test = clean_text(data_test.data)
label_test = data_test.target

In [17]:
len(label_train)
len(label_test)

1177

783

#### 一个好的经验：确定类别是否失衡：

In [18]:
from collections import Counter
Counter(label_train)
Counter(label_test)

Counter({0: 584, 1: 593})

Counter({0: 389, 1: 394})

#### 提取tf-idf特征：

In [5]:
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english',
                                   max_features = 8000)
term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

NameError: name 'cleaned_train' is not defined

### 应用SVM

In [20]:
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=42) # 核函数选择线性，惩罚系数默认C=1.0

In [21]:
svm.fit(term_docs_train, label_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

In [22]:
# 模型训练后，输入(X,y)可直接得到accuracy，score方法内含预测过程
accuracy = svm.score(term_docs_test, label_test) 
print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))

The accuracy on testing set is: 96.4%


## 多类别分类
2中方法：
### one-vs-all![one-vs-all](one-vs-all.png)
对于新样本x'，代入每个classifer **w**x'+b，值越大，越可能为"正"。
例如，

**w_r** x'+ b = - 0.78

**w_b** x' + b = - 0.35

**w_g** x' + b = - 0.642

那么x'属于blue class

### one-vs-one![one-vs-one](one-vs-one.png)

所有类两两组合，每次与一对进行训练，得到相应的classifier。新样本x'代入各分类器，各分类器的结果进行“投票”

就准确率而言，两种策略效果差不多。就计算量而言，one-vs-one的计算代价更小。sklearn中，采用的是one-vs-one

In [23]:
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
    'rec.sport.hockey'
]
data_train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

cleaned_train = clean_text(data_train.data)
label_train = data_train.target
cleaned_test = clean_text(data_test.data)
label_test = data_test.target

term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(term_docs_train, label_train)
accuracy = svm.score(term_docs_test, label_test)
print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

The accuracy on testing set is: 88.6%


In [24]:
from sklearn.metrics import classification_report
prediction = svm.predict(term_docs_test)
report = classification_report(label_test, prediction)
print(report)

             precision    recall  f1-score   support

          0       0.81      0.77      0.79       319
          1       0.91      0.94      0.93       389
          2       0.98      0.96      0.97       399
          3       0.93      0.93      0.93       394
          4       0.73      0.76      0.74       251

avg / total       0.89      0.89      0.89      1752



参数C控制间隔的严格程度即偏差与方差的制衡：
    
    C越大，越严格，间隔越小，偏差越小，方差越大；
    C越小，越宽松，间隔越打，偏差越大，方差越小。
    
![C](C.png)

# 非线性可分问题——Kernels
低维映射到高维空间

最常用的kernel是RBF，也就是Gaussion kernel![Gaussion](RBF.png)
其中𝛾是kernel coefficient，决定核函数fit观测样本的特异程度或者泛化程度。

![gamma](gamma.png)

𝛾值很大（注意前面的负号），表示方差很小，相对准确地fit训练样本，这可能导致高偏差；

𝛾值小，则表示高方差，广泛的fit，这可能导致过拟合。

𝛾的最佳选择通过交叉验证得到

## 线性核与RBF核的选择

经验上，文本数据都是线性可分的，用线性核。

以下三种情形，线性核优于高斯核：

    Case 1: 实例数和特征数都很大，超过104或105。因为特征空间的维度足够大，RBF转换而来的额外特征将不会提供任何性能改进，却会增加计算代价。
    Case 2: 特征数远大于训练样本数。除了case 1的原因，RBF核对过拟合有显著倾向。
    Case 3: 实例数显著大于特征数。对于低维的数据集，RBF会以“映射到高维空间”来提升性能，但由于训练的复杂度，它通常对样本数超过106或107的训练集不再有效。
    
除此之外，高斯核是第一选择。

## News topic classification with SVM
最后建立一个最完整的SVM 新闻主题分类器

In [12]:
categories = None
data_train = fetch_20newsgroups(subset='train', categories=categories,random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

cleaned_train = clean_text(data_train.data)
label_train = data_train.target

cleaned_test = clean_text(data_test.data)
label_test = data_test.target

tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english',
                                   max_features = 8000)
term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

In [13]:
#经验上，文本分类用线性核；惩罚系数C通过交叉验证选择：
from sklearn.svm import SVC
svc_libsvm = SVC(kernel='linear')

### GridSearchCV
之前的交叉验证都是手动地拆成几折，然后用for循环验证每个参数。接下来用一个更优雅的GridSearchCV，它的整个过程包含了拆分数据集、folds generation、交叉训练和验证、找出最佳参数组合。

我们只需指定参数和参数值。

In [14]:
parameters = {'C': (0.1, 1, 10, 100)}
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(svc_libsvm, parameters, n_jobs=-1, cv=3)
# 初始化为：3折cv，并行跑空余的CPU核（n_job=-1）

In [15]:
import timeit #记录调参数用时：
start_time = timeit.default_timer()
grid_search.fit(term_docs_train, label_train)
print('--- %0.3fs seconds ---' % (timeit.default_timer() - start_time))

GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': (0.1, 1, 10, 100)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

--- 353.210s seconds ---


In [16]:
# 得到最佳参数
grid_search.best_params_

{'C': 10}

In [17]:
# 最佳参数下，3折平均性能
grid_search.best_score_

0.8665370337634789

### 得出最佳参数后，就可代入SVM模型，用于未知的testing set:

In [18]:
svc_libsvm_best = grid_search.best_estimator_
accuracy = svc_libsvm_best.score(term_docs_test, label_test)
print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))

The accuracy on testing set is: 76.2%


需要注意的是，模型调优是基于已经“打过折”的原始训练集，即内含验证集。

而我们采用的最佳模型是基于原始testing set，以确保对全新数据集的泛化能力。

对于SVM 基本型的求解，它是二次规划问题，sklearn 的svm 基于libsvm 和liblinear 两个开源库。
上面的 76.2% 是基于libsvm 的 SVC model，

#### 接下来试试另一个： LinearSVC，它同样是用线性核，但基于liblinear实现。

In [19]:
from sklearn.svm import LinearSVC 
svc_linear = LinearSVC()
grid_search = GridSearchCV(svc_linear, parameters, n_jobs=-1, cv=3)
start_time = timeit.default_timer()
grid_search.fit(term_docs_train,label_train)
print('--- %0.3fs seconds' % (timeit.default_timer()-start_time))

GridSearchCV(cv=3, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': (0.1, 1, 10, 100)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)

--- 14.563s seconds


In [20]:
grid_search.best_params_

{'C': 1}

In [21]:
grid_search.best_score_

0.8707795651405339

In [22]:
svc_linear_best = grid_search.best_estimator_
accuracy = svc_linear_best.score(term_docs_test,label_test)
print('The accuracy on testing set is : {0:.2f}%'.format(accuracy*100))

The accuracy on testing set is : 77.88%


准确率稍高，速度快了10倍以上。因为liblinear 库是为大数据集设计的，而libsvm对于超过二次的计算复杂度，训练样本数超过105就无法很好地规模计算。

### 调优feature extractor——TfidfVectorizer model，进一步提升性能
feature extraction和 classification 作为2个连续步骤，应该同时进行交叉验证。我们利用**pipeline**实现：

In [23]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')), # tfidf feature extractor
    ('svc', LinearSVC()),                            # linear SVM classifier
])

同时进行“两步”的参数调优，形式如下：中间用"__"连接

In [24]:
parameters_pipeline = {
    'tfidf__max_df': (0.25, 0.5),  # 一个词的最大文档频数，防止常见词频繁出现在文档中
    'tfidf__max_features': (40000, 50000), 
    'tfidf__sublinear_tf': (True, False), # 是否用log函数缩放词频
    'tfidf__smooth_idf': (True, False), # 文档频数是否加1，防止除零错误
    'svc__C': (0.1, 1, 10, 100),
}

In [25]:
grid_search = GridSearchCV(pipeline, parameters_pipeline, n_jobs=-1, cv=3)
start_time = timeit.default_timer()
grid_search.fit(cleaned_train, label_train) # 用的是未做特征提取的cleaned_train
print('--- %0.3fs secondes ---' % (timeit.default_timer() - start_time))


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'tfidf__max_df': (0.25, 0.5), 'tfidf__max_features': (40000, 50000), 'tfidf__sublinear_tf': (True, False), 'tfidf__smooth_idf': (True, False), 'svc__C': (0.1, 1, 10, 100)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

--- 653.694s secondes ---


In [28]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'svc__C': 1, 'tfidf__max_df': 0.5, 'tfidf__max_features': 40000, 'tfidf__smooth_idf': False, 'tfidf__sublinear_tf': True}
0.888368393141


最后应用与testing set：

In [30]:
pipeline_best = grid_search.best_estimator_
accuracy = pipeline_best.score(cleaned_test, label_test)
print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))

The accuracy on testing set is: 80.6%


最优参数组合能使分类器达到80.6% 的准确率