# 特征提升

**特征抽取：**就是逐条将原始数据转化为特征向量的形式，这个过程同时涉及对数据特征的量化表示；而特征筛选则更进一步，在高维度、已向量化的特征向量中选择对指定任务更有效的特征组合，进一步提升模型的性能。

In [1]:
# 有些符号表示的数据特征已经相对结构化，并且以字典这种数据结构进行存储。这时我们使用DictVectorizer对特征进行抽取和向量化
# DicVectorizer对使用字典存储的数据进行特征抽取与向量化

measurements = [{'city': 'Dubai', 'temperature': 33.}, {'city': 'London', 'temperature': 12.}, {'city': 'San Fransisco', 'temperature': 18.}]

In [4]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

# 输出转化后的特征矩阵
print(vec.fit_transform(measurements).toarray())
# 输出各个维度的特征含义
print(vec.get_feature_names())

[[  1.   0.   0.  33.]
 [  0.   1.   0.  12.]
 [  0.   0.   1.  18.]]
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']


In [5]:
# CounterVectorizer对于每条训练文本，CounterVectorizer只考虑每种词汇在该条训练文本中出现的频率
# 使用CounterVectorizer并且不去掉停用词的条件下，对文本特征进行量化的朴素贝叶斯分类性能测试
from sklearn.datasets import fetch_20newsgroups  # 导入20类新闻文本数据抓取器
# 从互联网上即时下载新闻样本，subset='all' 参数代表下载全部近2万条文本存储在变量news中
news = fetch_20newsgroups(subset='all')

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


In [8]:
# 分割数据集
from sklearn.model_selection import train_test_split  # train_test_split的新位置
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
# 采用默认配置对CountVectorizer进行初始化（默认配置不去除英文停用词），并赋值给变量count_vec
count_vec = CountVectorizer()

In [10]:
# 只使用词频统计的方式将原始训练和测试文本转化为特征向量
x_count_train = count_vec.fit_transform(x_train)
x_count_test = count_vec.transform(x_test)

In [13]:
print(x_train[0])
print('----------------')
print(x_count_train[0])

From: scotts@math.orst.edu (Scott Settlemier)
Subject: FORSALE: MAG Innovision MX15F 1280x1024
Article-I.D.: gaia.1r7hir$9sk
Distribution: world
Organization: Oregon State University Math Department
Lines: 7
NNTP-Posting-Host: math.orst.edu

MAG Innovision MX15F
Fantastic 15" multiscan monitor that can display up to
1280x1024 noninterlaced (!) with .26 mm dot pitch.
If you are looking for a large crystal clear super vga
monitor then this is for you.
$430   call Scott at (503) 757-3483 or
email scotts@math.orst.edu

----------------
  (0, 60066)	1
  (0, 104942)	1
  (0, 14433)	1
  (0, 22750)	1
  (0, 17937)	1
  (0, 35665)	1
  (0, 44232)	1
  (0, 16311)	1
  (0, 79874)	1
  (0, 132903)	1
  (0, 132665)	1
  (0, 140565)	1
  (0, 129553)	1
  (0, 47467)	1
  (0, 51298)	1
  (0, 87060)	1
  (0, 65719)	2
  (0, 89395)	1
  (0, 34760)	1
  (0, 148646)	2
  (0, 76791)	1
  (0, 109290)	1
  (0, 57011)	1
  (0, 96571)	1
  (0, 11905)	1
  :	:
  (0, 88624)	1
  (0, 54291)	1
  (0, 137926)	1
  (0, 127872)	1
  (0, 105052

In [14]:
from sklearn.naive_bayes import MultinomialNB  # 导入朴素贝叶斯分类器
mnb = MultinomialNB()
mnb.fit(x_count_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
# 得分
print(mnb.score(x_count_test, y_test))

0.839770797963


In [18]:
# 其它性能指标
from sklearn.metrics import classification_report
y_predict = mnb.predict(x_count_test)
print(classification_report(y_test, y_predict, target_names=news.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
         

In [19]:
# TfidfVectorizer 除了考量某个词在当前文本中出现的频率之外，同时关注包含这个词汇的文本条数的倒数，相比之下，训练文本的条目越多，这种特征量化方式就更有优势。
# 因为我们计算词频的目的在于找出对所在文本的含义更有贡献的重要词汇。
# 使用TfidfVectorizer并且不去掉停用词的条件下，对文本特征进行量化的朴素贝叶斯分类性能测试
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()

x_tfidf_trian = tfidf_vec.fit_transform(x_train)
x_tfidf_test = tfidf_vec.transform(x_test)

mnb_tfidf = MultinomialNB()
mnb_tfidf.fit(x_tfidf_trian, y_train)

print(mnb_tfidf.score(x_tfidf_test, y_test))

y_predict = mnb_tfidf.predict(x_tfidf_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict, target_names=news.target_names))

0.846349745331
                          precision    recall  f1-score   support

             alt.atheism       0.84      0.67      0.75       201
           comp.graphics       0.85      0.74      0.79       250
 comp.os.ms-windows.misc       0.82      0.85      0.83       248
comp.sys.ibm.pc.hardware       0.76      0.88      0.82       240
   comp.sys.mac.hardware       0.94      0.84      0.89       242
          comp.windows.x       0.96      0.84      0.89       263
            misc.forsale       0.93      0.69      0.79       257
               rec.autos       0.84      0.92      0.88       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.96      0.91      0.94       251
        rec.sport.hockey       0.88      0.99      0.93       233
               sci.crypt       0.73      0.98      0.83       238
         sci.electronics       0.91      0.83      0.87       249
                 sci.med       0.97      0.92      0.95     

上述实验表明，在训练文本量较多的时候，利用TfidfVectorizer压制这些常用词汇对分类决策的干扰，往往可以起到提升模型性能的作用。

下面的实验会验证：在文本特征提取中以黑名单的方式过滤掉停用词，可以用来提高模型的性能表现。

In [21]:
count_filter_vec = CountVectorizer(analyzer='word', stop_words='english')
tfidf_filter_vec = TfidfVectorizer(analyzer='word', stop_words='english')

x_count_filter_train = count_filter_vec.fit_transform(x_train)
x_count_filter_test = count_filter_vec.transform(x_test)

x_tfidf_filter_train = tfidf_filter_vec.fit_transform(x_train)
x_tfidf_filter_test = tfidf_filter_vec.transform(x_test)

mnb_count_filter = MultinomialNB()
mnb_count_filter.fit(x_count_filter_train, y_train)
print(mnb_count_filter.score(x_count_filter_test, y_test))
print(classification_report(y_test, mnb_count_filter.predict(x_count_filter_test), target_names=news.target_names))

0.863752122241
                          precision    recall  f1-score   support

             alt.atheism       0.85      0.89      0.87       201
           comp.graphics       0.62      0.88      0.73       250
 comp.os.ms-windows.misc       0.93      0.22      0.36       248
comp.sys.ibm.pc.hardware       0.62      0.88      0.73       240
   comp.sys.mac.hardware       0.93      0.85      0.89       242
          comp.windows.x       0.82      0.85      0.84       263
            misc.forsale       0.90      0.79      0.84       257
               rec.autos       0.91      0.91      0.91       238
         rec.motorcycles       0.98      0.94      0.96       276
      rec.sport.baseball       0.98      0.92      0.95       251
        rec.sport.hockey       0.92      0.99      0.95       233
               sci.crypt       0.91      0.97      0.93       238
         sci.electronics       0.87      0.89      0.88       249
                 sci.med       0.94      0.95      0.95     

In [22]:
mnb_tfidf_filter = MultinomialNB()
mnb_tfidf_filter.fit(x_tfidf_filter_train, y_train)
print(mnb_tfidf_filter.score(x_tfidf_filter_test, y_test))
print(classification_report(y_test, mnb_tfidf_filter.predict(x_tfidf_filter_test), target_names=news.target_names))

0.882640067912
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.81      0.83       201
           comp.graphics       0.85      0.81      0.83       250
 comp.os.ms-windows.misc       0.84      0.87      0.86       248
comp.sys.ibm.pc.hardware       0.78      0.88      0.83       240
   comp.sys.mac.hardware       0.92      0.90      0.91       242
          comp.windows.x       0.95      0.88      0.91       263
            misc.forsale       0.90      0.80      0.85       257
               rec.autos       0.89      0.92      0.90       238
         rec.motorcycles       0.98      0.94      0.96       276
      rec.sport.baseball       0.97      0.93      0.95       251
        rec.sport.hockey       0.88      0.99      0.93       233
               sci.crypt       0.85      0.98      0.91       238
         sci.electronics       0.93      0.86      0.89       249
                 sci.med       0.96      0.93      0.95     

实验证明TfidfVectorizerd的特征抽取和量化方法更加具备优势；以及对停用词过滤的文本特征抽取方法，平均比不过滤停用词的模型综合性能要高出3%~4%。