# Chapter8. 使用机器学习进行情感分析
- 清洗和准备文本数据
- 基于文本文档构建特征向量
- 训练机器学习模型用于区分电影正面与负面评论
- 使用out-of-core学习处理大规模文本数据集

## 1. 获取IMDb电影数据集
从http://ai.stanford.edu/~amaas/data/sentiment/ 获取数据集然后解压。然后从当前文本文档组合成一个CSV文件，并读取到pandas的DataFrame

In [1]:
import pandas as pd
import pyprind
import os

basepath = './aclImdb/'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:12


集成处理过后的数据集中的类标是经过排序的，我们使用`np.random`的子模块`permutation`对其进行重排，这对后期划分训练集和测试集很有用：

In [2]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

将重排后的数据保存为CSV文件：

In [3]:
df.to_csv('./movie_data.csv', index=False)

In [1]:
# 这里使用了书中提供的数据集，自行下载的数据集在后面进行训练的时候会出现问题，导致准确率为
# 100%
import pandas as pd
df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


## 2. 词袋模型简介
- 在整个文档集上为每个词汇创建唯一标记，例如单词；
- 为每个文档构建一个特征向量，其中包含每个单词在文档中出现的次数。

### a. 将单词转换为特征向量
使用sklearn中的CountVectorizer类，它接收文本数据数组作为输入，输出的就是词袋模型：

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


上面bag记录了每个所有文档中单词的频度。同时，可以讲每个样本转换成特征向量：

In [3]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


特征向量中每个索引位置与通过CountVectorizer得到的词汇表字典中存储的整数值对应，例如在词汇表中and的值为0，对应特征向量中的索引位置就是0，one是2，以此类推。在特征向量里出现的值也称为原始词频（raw term frequency）：${\rm tf}(t,d)$——词汇$t$在文档$d$中出现的次数。

### b. 通过词频——逆文档频率计算单词关联度
在分析文本数据时，经常遇到一个单词在两种类型的多个文档中出现，这种频繁出现的词汇通常不包含具备辨识度的信息。使用词频——逆文档频率（term frequency-inverse document frequency, tf-idf）的技术：

$$ {\rm tf-idf}(t, d) = {\rm tf}(t, d) \times {\rm idf}(t, d)$$

其中${\rm tf}(t, d)$是上一节介绍的词频，逆文档频率可以通过如下公式计算：

$${\rm idf}(t, d) = log\frac{n_d}{1 + {\rm df}(d, t)}$$

其中${\rm df}(d, t)$为包含词汇$t$的文档$d$的总数量。

sklearn中的TfidfTransformer，它以CountVectorizer的原始词频作为输入，并将其转换为tf-idf

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
# 设置有效数字
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


从这里可以看到，is在第三个文档中具有最高的词频，出现了3次，但是经过处理后只得到了一个较小的tf-idf(0.45)，这是由于第一个和第二个文档都包含is，因此它不可能包含有用的信息。

如果手动计算tf-idf，会发现它和定义的标准算式结果不同。sklearn中的实现为：

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

tf-idf的实现为：

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

通常在计算tf-idf之前都会对原始词频进行归一化处理，TfidTransformer直接对tf-idf进行了归一化，默认情况下使用`norm='l2'`归一化。它通过与一个未归一化特征向量L2范数的比值，使得返回向量的长度为1：

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

### c. 清洗文本数据
首先展示一下经过重排后数据集中第60个文档的最后50个字符：

In [6]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

我们看到其中包含了HTML标记，标点以及其他特殊字符，我们将处理这些符号，只保留一些表情符号。

In [7]:
import re

def preprocessor(text):
    # 移除HTML标记
    text = re.sub('<[^>]*>', '', text)
    # 寻找表情符号，并存储在emoticons中
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # 删除所有非单词字符，并将文本转换为小写字母，为了保证表情符号一致，
    # 去除代表鼻子的字符（-）
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')    
    return text

In [8]:
preprocessor(df.loc[60, 'review'][-50:])

'a different angle and ethnicity it s a good rent '

In [9]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [10]:
df['review'] = df['review'].apply(preprocessor)
df['review'].head()

0    in 1974 the teenager martha moxley maggie grac...
1    ok so i really like kris kristofferson and his...
2     spoiler do not read this if you think about w...
3    hi for all the people who have seen this wonde...
4    i recently bought the dvd forgetting just how ...
Name: review, dtype: object

### d. 标记文档
为了将文本语料拆分为单独的元素，标记文档的一种常用方法就是通过文档的空白字符将其拆分为单独的单词：

In [11]:
def tokenizer(text):
    return text.split()

In [12]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

还有一种方法称为词干提取（词干提取是去除词缀得到词根的过程）

In [13]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [14]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

在训练模型之前我们还要使用一个技术：停用词移除，停用词是指在各种文本中太过常见，以致没有含有区分文本所属类别的有用信息。比如is，and等。

In [15]:
import nltk

# 下载nltk提供的停用词库
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/tuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## 3. 训练用于文档分类的逻辑回归模型
将上一节清洗过的文本文档划分为25000个训练集，25000个测试集

In [29]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

接下来使用GridSearchCV，并使用5折交叉验证寻找最佳参数：

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
import multiprocessing

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear', random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1)

In [31]:
#更换为书中提供的movie_data.csv数据集后，在训练了5个小时得到了最终结果
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  sorted(inconsistent))
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 303.1min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...nalty='l2', random_state=0, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=Tr

In [32]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each'

In [33]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.897


## 4. 使用大数据——在线算法和外存学习

In [18]:
import numpy as np
import re
from nltk.corpus import stopwords

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

# 每次读取且返回一个文档的内容
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

验证一下stream_docs是否正常工作，它返回第一个文档和类标组成的二元组

In [19]:
next(stream_docs(path='./movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [20]:
def get_minibatch(doc_stream, size):
    """
    返回指定数量的文档内容
    """
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            # 不断添加文档和类标
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

使用`HashingVectorizer`可以不需要在内存中存储字典，降低了内存开销。使用`tokenizer`函数初始化`HashingVectorizer`，并将特征的数量设为$2^{21}$

In [21]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

# 通过为HasingVectorizer设定一个大的特征数量，降低了hash碰撞概率，也增加了逻辑回归系数的数量
vect = HashingVectorizer(decode_error='ignore',
                         n_features=2 ** 21,
                         preprocessor=None,
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

In [22]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    # 文档分为45份，每份1000个文档，分别进行minibatch
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    # 使用partial_fit来读取本地的存储设备，也就是外存，并使用小型批次文档来训练
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:20


In [23]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


使用书中提供的数据集训练后正常，应该是在划分训练数据集和测试数据集时出现了问题。

使用剩下的5000个文档来升级模型：

In [24]:
clf = clf.partial_fit(X_test, y_test)