## Sections
- [Obtaining the IMDb movie review dataset](#Obtaining-the-IMDb-movie-review-dataset)
- [Text-feature-extraction](#Text-feature-extraction)
    - [Bag-of-words model](#Bag-of-words-model)
    - [Bigrams and N-Grams](#Bigrams-and-N-Grams)
    - [Character n-grams](#Character-n-grams)
    - [Tfidf encoding](#Tfidf-encoding)
- [Cleaning text data](#Cleaning-text-data)
- [Processing documents into tokens](#Processing-documents-into-tokens)
- [Training a logistic regression model for sentiment classification](#Training-a-logistic-regression-model-for-sentiment-classification)
- [Working with bigger data - online algorithms and out-of-core learning](#Working-with-bigger-data---online-algorithms-and-out-of-core-learning)
- [Model persistence](#Model-persistence)

- [word2vec](#word2vec)

<br>
<br>

# Obtaining the IMDb movie review dataset

[[back to top](#Sections)]

数据可从[这](http://ai.stanford.edu/~amaas/data/sentiment/)下载

解压之后，下面代码可将数据读成 Pandas 的 DataFrame

In [11]:
cat 0_9.txt

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

In [1]:
# pip install pyprind
import pyprind   # module for Python Progress Indicator
import pandas as pd
import os

labels = {'pos':1, 'neg':0}  # 1 = positive and 0 = negative
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path ='data/aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:03:57


In [2]:
df.head(3)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1


Shuffling the DataFrame:

In [3]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

Optional: Saving the assembled data as CSV file:

In [4]:
df.to_csv('./data/movie_data.csv', index=False, encoding='utf-8')

In [3]:
import pandas as pd
df = pd.read_csv('./data/movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


<br>
<br>

# Text feature extraction

[[back to top](#Sections)]

## Bag-of-words model

[[back to top](#Sections)]

Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.  
However, there is an easy and effective way to go from text data to a numeric representation using the so-called [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model), which provides a data structure that is compatible with the machine learning aglorithms in scikit-learn.

In [6]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [7]:
count.vocabulary_

{'and': 0,
 'is': 1,
 'one': 2,
 'shining': 3,
 'sun': 4,
 'sweet': 5,
 'the': 6,
 'two': 7,
 'weather': 8}

In [8]:
count.get_feature_names()

['and', 'is', 'one', 'shining', 'sun', 'sweet', 'the', 'two', 'weather']

As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:

In [9]:
bag.toarray()  # 每行对应着一个 document， 每列对应一个 word， 值是 word 对应的计数

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]], dtype=int64)

In [10]:
count.inverse_transform(bag)

[array(['the', 'sun', 'is', 'shining'], 
       dtype='<U7'), array(['the', 'is', 'weather', 'sweet'], 
       dtype='<U7'), array(['the', 'sun', 'is', 'shining', 'weather', 'sweet', 'and', 'one',
        'two'], 
       dtype='<U7')]

<br>
<br>

## Bigrams and N-Grams

[[back to top](#Sections)]

In last section, we used the so-called 1-gram (unigram) tokenization: Each token represents a single element with regard to the splittling criterion. 

Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like "not" can invert the meaning of words.

A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens. For example, in 2-gram (bigram) tokenization, we would group words together with an overlap of one word; in 3-gram (trigram) splits we would create an overlap two words, and so forth:

- original text: "this is how you get ants"
- 1-gram: "this", "is", "how", "you", "get", "ants"
- 2-gram: "this is", "is how", "how you", "you get", "get ants"
- 3-gram: "this is how", "is how you", "how you get", "you get ants"

Which "n" we choose for "n-gram" tokenization to obtain the optimal performance in our predictive model depends on the learning algorithm, dataset, and task. Or in other words, we have consider "n" in "n-grams" as a tuning parameters.

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer instance with ngram_range=(2,2)

In [11]:
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
bigram_vectorizer.fit_transform(docs).toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

In [12]:
bigram_vectorizer.vocabulary_

{'and one': 0,
 'is shining': 1,
 'is sweet': 2,
 'is two': 3,
 'one and': 4,
 'one is': 5,
 'shining the': 6,
 'sun is': 7,
 'sweet and': 8,
 'the sun': 9,
 'the weather': 10,
 'weather is': 11}

<br>
<br>

## Character n-grams

[[back to top](#Sections)]

Sometimes it is also helpful not only to look at words, but to consider single characters instead.   
That is particularly useful if we have very noisy data and want to identify the language, or if we want to predict something about a single word.
We can simply look at characters instead of words by setting ``analyzer="char"``.

In [13]:
X = ['Some say the world will end in fire,', 'Some say in ice.']

In [14]:
char_vectorizer = CountVectorizer(analyzer="char")
char_vectorizer.fit(X)

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
print(char_vectorizer.get_feature_names())

[' ', ',', '.', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'w', 'y']


<br>
<br>

## Tfidf encoding

[[back to top](#Sections)]

In [16]:
np.set_printoptions(precision=2)

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be de ned as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

Here the tf(t, d) is the term frequency that we introduced in the previous section,
and the inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]


As we saw in the previous subsection, the word "is" (column 2) had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.

However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we see earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

The tf-idf equation that was implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

By default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

To make sure that we understand how TfidfTransformer works, let us walk
through an example and calculate the tf-idf of the word is in the 3rd document.

The word is has a term frequency of 3 (tf = 3) in document 3, and the document frequency of this term is 3 since the term is occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [18]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer that we used previously. The step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

In [19]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

array([ 3.39,  3.  ,  3.39,  1.29,  1.29,  1.29,  2.  ,  1.69,  1.29])

In [20]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([ 0.5 ,  0.45,  0.5 ,  0.19,  0.19,  0.19,  0.3 ,  0.25,  0.19])

As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (below). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

<br>
<br>

# Cleaning text data

[[back to top](#Sections)]

In [21]:
df.loc[0, 'review'][-50:]  # last 50 characters from the first document

'is seven.<br /><br />Title (Brazil): Not Available'

text contains html markup tags, we need clean them.
we will now remove all punctuation marks but only keep emoticon characters such as ":)"

In [5]:
import re #  Python 中处理正则表达式的模块
def preprocessor(text):
    text = re.sub(r'<[^>]*>', '', text)  # 去除 HTML 标签
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub(r'\W+', ' ', text.lower()) + \
           ' '.join(emoticons).replace('-', '')   # 将表情符号拼接在正文后面
    return text

In [6]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [7]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [8]:
df['review'] = df['review'].apply(preprocessor)

<br>
<br>

# Processing documents into tokens

[[back to top](#Sections)]

split the text corpora into individual elements.

tokenize 是把长文本切成一系列单词，最简单的方式就是以 whitespace 切分.

word stemming 是将词转为最原始的形式, root form, (例如 running -> run), 一种算法是 Porter stemmer algorithm

以下需要使用 nltk, 需要先安装: `pip install nltk`

In [26]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()  # 默认以空格切分
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [27]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [28]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Stop-words (停词) 是 最常见的一些单词, 它们的实际意义并不是很大, 大多是起辅助作用的, 但是它们的频次非常高, 所以需要去除, 例如 is, and, has

In [29]:
import nltk
nltk.download('stopwords')  # 下载停词

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/xiaokai/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [30]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

<br>
<br>

# Training a logistic regression model for sentiment classification

[[back to top](#Sections)]

Strip HTML and punctuation to speed up the GridSearch later:

In [14]:
#  25,000 documents for training and 25,000 documents for testing, 需要大约40分钟
# 所以先使用5000 documents
X_train = df.loc[:5000, 'review'].values
y_train = df.loc[:5000, 'sentiment'].values
X_test = df.loc[5000:, 'review'].values
y_test = df.loc[5000:, 'sentiment'].values

In [32]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# TfidfVectorizer 等同于 CountVectorizer + TfidfTransformer
tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False, 
                        preprocessor=None)

# grig search
param_grid = [{'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             ]

# 先转化为 tfidf matrix
lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, 
                           scoring='accuracy',
                           cv=5, verbose=1,
                           n_jobs=-1)

In [33]:
# 数据量减少为5000 后需要时间大约为8分钟
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 13.6min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'clf__penalty': ['l1', 'l2'], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 't...ect__tokenizer': [<function tokenizer at 0x11ad34158>, <function tokenizer_porter at 0x11ad340d0>]}],
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=1

In [34]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__penalty': 'l2', 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too'

In [35]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.872


<br>
<br>

# Working with bigger data - online algorithms and out-of-core learning

[[back to top](#Sections)]

Out-of-Core learning is the task of training a machine learning model on a dataset that does not fit into memory or RAM. This requires the following conditions:
    
- a **feature extraction** layer with **fixed output dimensionality**
- knowing the list of all classes in advance (in this case we only have positive and negative tweets)
- a machine learning **algorithm that supports incremental learning** (the `partial_fit` method in scikit-learn).


In [36]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):  # 生成器
    # generator funciton, reads in and returns one document at a time
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [37]:
# To verify that our stream_docs function works correctly
gen = stream_docs(path='./data/movie_data.csv')
next(gen)

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [38]:
def get_minibatch(doc_stream, size):
    '''
    take a document stream from the stream_docs function and 
    return a particular number of documents
    '''
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [39]:
from sklearn.feature_extraction.text import HashingVectorizer  # makes use of the Hashing trick
from sklearn.linear_model import SGDClassifier  # train a logistic regression model using small minibatches of documents

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./data/movie_data.csv')

In [40]:
# iterated over 45 minibatches of documents 
# where each minibatch consists of 1,000 documents each
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:39


In [41]:
#use the last 5,000 documents to evaluate the performance
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


虽然准确率略低于前面，但训练速度快了很多，而且使用的内存更少

In [42]:
# use the last 5,000 documents to update our model
# 可以使用 partial_fit 继续训练
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Model persistence

训练模型是 expensive 并且耗费时间的, 我们不希望在应用中每次都要重新训练模型, 所以我们需要保存模型, 并且能进行新的预测以及更新。
可以用到 pickle 模块来储存模型, 将 python object 储存为byte code, 可以读取也可以写入

[[back to top](#Sections)]

After we trained the logistic regression model as shown above, we can save the classifier along with the stop words, Porter Stemmer, and `HashingVectorizer` as serialized objects to our local disk so that we can use the fitted classifier in our web application later.

In [43]:
import pickle
import os
#  created a movieclassifier directory 
# created a pkl_objects subdirectory to save the serialized Python objects to our local drive
dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

# 写入
pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=2)   
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=2)

Next, we save the `HashingVectorizer` as in a separate file so that we can import it later.

In [45]:
%%writefile movieclassifier/vectorizer.py
from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                   + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

Writing movieclassifier/vectorizer.py


After executing the preceeding code cells, we can now restart the IPython notebook kernel to check if the objects were serialized correctly.

First, change the current Python directory to `movieclassifer`:

In [46]:
import os
os.chdir('movieclassifier')

In [47]:
import pickle
import re
import os
from vectorizer import vect

clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))

In [48]:
import numpy as np
label = {0:'negative', 1:'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 82.52%


<br>
<br>

# word2vec

[[back to top](#Sections)]

word2vec 是 [Mikolov et al.](http://arxiv.org/pdf/1301.3781.pdf) 提出一种训练词向量的方法。它有 Continuous Bag-of-Words model (CBOW) and the Skip-Gram model 两种变式，前者是用一个词序列窗口中的其他词来预测中心词，后者则是用中心词来预测其他词。在实际使用 word2vec 时，一般使用 Skip-Gram model 结合 [Negative Sampling](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) 进行训练。

![CBOW & skip-gram](figures/word2vec.png)

In [10]:
# gensim 库中包含了 word2vec 模块
from gensim.models.word2vec import Word2Vec

训练词向量， gensim 训练词向量时可以喂入一个 iterator

In [8]:
class MySentence:
    def __init__(self, data):
        self.data = data
    
    def __iter__(self):
        for line in self.data:
            yield line.lower().split()
        
train_corpus = MySentence(df['review'])

# 或者 train_corpus = [s.split() for s in df['review']]

In [9]:
model = Word2Vec(train_corpus, 
                 size=200, # 词向量的维度
                 iter=20, # 数据在训练中用到的次数, 即 epoch 数
                 workers=4)  # 调用的进程数

简单查看词向量的结果

In [10]:
model.most_similar('good')

[('decent', 0.7370797991752625),
 ('bad', 0.7096108198165894),
 ('great', 0.6964930295944214),
 ('nice', 0.6179347634315491),
 ('passable', 0.6055103540420532),
 ('cool', 0.5933259129524231),
 ('fine', 0.5922046899795532),
 ('funny', 0.5853421688079834),
 ('lousy', 0.5655009746551514),
 ('weak', 0.5625821352005005)]

获得词对应的词向量

In [11]:
model['good']

array([-2.16470599,  2.21028686, -2.42560244, -0.00975356,  0.38119075,
        0.28335837,  0.42377841,  1.30055726, -0.86984408, -0.0956203 ,
       -0.51766461, -1.41315424,  0.18915185, -2.9125576 , -2.36487603,
        3.38622642, -1.72619522, -0.6556592 ,  0.21282369,  1.09593034,
       -1.89328766, -1.39257145, -3.33210802, -1.55650818,  0.67174482,
        1.74836183, -1.9717145 ,  3.95806265,  1.53746414,  1.82085156,
        1.02304912, -1.3881073 ,  0.32539955, -0.84898555,  0.55216944,
       -1.15237427, -0.86884212,  0.0770219 , -2.18501925,  0.58333641,
       -1.10986626,  0.84115869,  3.0748837 , -1.02027082,  0.76942962,
        0.19782287, -0.68174469, -0.98301965, -1.85113227, -1.31748915,
       -2.22454476, -0.40986761, -1.53826535, -1.63622868,  0.99653172,
        0.28070587, -0.92813587, -0.72385532, -1.55703568,  0.36305025,
       -0.32554394, -2.66813493, -1.6127274 ,  0.70773345,  0.10997313,
       -1.41711199,  0.05896465,  0.62872934, -2.60525918, -0.54

存储/读取模型

In [12]:
model.save('data/imdb.d2v')

In [11]:
model = Word2Vec.load('data/imdb.d2v')

利用训练好的词向量来进行情感分析

In [12]:
# 用 word vec 的均值作为 doc vec
def get_doc_vec(sentence, model):
    scores = [model[word] for word in sentence.split() 
              if word in model]  # 如果词频小于 min_count, word2vec 不会把这个词放入 vocab 里
    
    return np.mean(scores, axis=0)

In [15]:
X_word2vec_train = np.array([get_doc_vec(sentence, model) for sentence in X_train])
X_word2vec_test = np.array([get_doc_vec(sentence, model) for sentence in X_test])

In [16]:
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=0)

param_grid = [{'penalty': ['l1', 'l2'],
               'C': [1.0, 10.0, 100.0]}]

gs = GridSearchCV(lr, param_grid, 
                  scoring='accuracy',
                  cv=5, verbose=1,
                  n_jobs=-1)

gs.fit(X_word2vec_train, y_train);

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   15.8s finished


In [17]:
gs.best_params_

{'C': 1.0, 'penalty': 'l2'}

In [18]:
gs.best_score_

0.86322735452909416

In [19]:
clf = gs.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_word2vec_test, y_test))

Test Accuracy: 0.869


## 中文新闻分类

In [20]:
from os import path
import os
import re
import pandas as pd
import numpy as np

In [21]:
rootdir = 'data/SogouC.reduced/Reduced'
dirs = os.listdir(rootdir)
dirs = [path.join(rootdir,f) for f in dirs if f.startswith('C')]
dirs

['data/SogouC.reduced/Reduced/C000008',
 'data/SogouC.reduced/Reduced/C000010',
 'data/SogouC.reduced/Reduced/C000013',
 'data/SogouC.reduced/Reduced/C000014',
 'data/SogouC.reduced/Reduced/C000016',
 'data/SogouC.reduced/Reduced/C000020',
 'data/SogouC.reduced/Reduced/C000022',
 'data/SogouC.reduced/Reduced/C000023',
 'data/SogouC.reduced/Reduced/C000024']

In [73]:
?codecs.open

In [74]:
import codecs
def load_file(input_file):
    input_data = codecs.open(input_file,mode= 'r',encoding= 'gbk',errors='ignore')
    input_text = input_data.read()
    return input_text

In [75]:
print(load_file('data/SogouC.reduced/Reduced/C000008/10.txt'))

　　本报记者陈雪频实习记者唐翔发自上海
　　一家刚刚成立两年的网络支付公司，它的目标是成为市值100亿美元的上市公司。
　　这家公司叫做快钱，说这句话的是快钱的CEO关国光。他之前曾任网易的高级副总裁，负责过网易的上市工作。对于为什么选择第三方支付作为创业方向，他曾经对媒体这样说：“我能看到这个胡同对面是什么，别人只能看到这个胡同。”自信与狂妄只有一步之遥——这几乎是所有创业者的共同特征，是自信还是狂妄也许需要留待时间来考证。
　　对于市值100亿美元的上市公司，他是这样算这笔账的，“百度上市时广告客户数量只有4万,而且它所做的只是把客户吸引过来，就可以支撑起现有的庞大市值；而我们几年后的客户数量是几千万，而且这些客户都是能直接带来利润的，说市值100亿美元一点都不夸张。”
　　这家公司2005年年底注册用户达到400万，计划今年注册用户突破1000万，号称是国内最大的第三方网络支付平台。“在美国跟支付相关的收入已经超过了所有商业银行本身利差收入的总和，我所查到的数据是3000亿美元，其中超过70％是个人消费者带来的收入。”关国光喜欢借用美国支付产业的现状与中国的情况进行比较。虽然美国和中国差异显著，但他坚信中国的第三方支付市场前景非常广阔。
　　便利和安全挑战网络支付
　　“你只需要一个手机号码或者一个邮件地址就可以网络支付。”在快钱的户外广告中这样写道，这和传统的需要银行账户才能进行网络支付的习惯形成了鲜明的对比。
　　然而这种支付模式和传统的网络支付并无本质的区别，因为每一个手机号码和邮件地址背后都会对应着一个账户——这个账户可以是信用卡账户、借记卡账户，也包括邮局汇款、手机代收、电话代收、预付费卡和点卡等多种形式。
　　“快钱的功能其实就相当于融合了很多交易工具的VISA卡，所以又被称为网络VISA。”关国光说，“从本质上讲，我们和VISA等采用的底层技术是没有差别的，我们和它的区别在于VISA卡面对的交易工具比较单一，而快钱面对的是多种分散的交易工具。”
　　因为“信用缺位”，网络支付一直是困扰中国电子商务发展的瓶颈之一。网络支付平台相当于“信用缺位”条件下的“补位产物”，它把众多的银行卡整合到一个页面端口，以支付公司作为信用中介，在买家确认收到商品前，代替买卖双方暂时保管货款。
　　目前最知名的网络支付平台包括阿里巴巴的支付宝和eBa

In [68]:
dirs

['data/SogouC.reduced/Reduced/C000008',
 'data/SogouC.reduced/Reduced/C000010',
 'data/SogouC.reduced/Reduced/C000013',
 'data/SogouC.reduced/Reduced/C000014',
 'data/SogouC.reduced/Reduced/C000016',
 'data/SogouC.reduced/Reduced/C000020',
 'data/SogouC.reduced/Reduced/C000022',
 'data/SogouC.reduced/Reduced/C000023',
 'data/SogouC.reduced/Reduced/C000024']

In [77]:
text_t = {}
for i, d in enumerate(dirs):
    files = os.listdir(d)
    files = [path.join(d, x) for x in files if x.endswith('txt') and not x.startswith('.')]
    text_t[i] = [load_file(f) for f in files]

In [78]:
print(text_t[0][0][:100])

　　本报记者陈雪频实习记者唐翔发自上海
　　一家刚刚成立两年的网络支付公司，它的目标是成为市值100亿美元的上市公司。
　　这家公司叫做快钱，说这句话的是快钱的CEO关国光。他之前曾任网易的高级副


In [101]:
flen = [len(t) for t in text_t.values()]
labels = np.repeat(list(text_t.keys()),flen)

In [103]:
# flatter nested list
import itertools
merged = list(itertools.chain.from_iterable(text_t.values()))

In [104]:
df = pd.DataFrame({'label': labels, 'txt': merged})
df.head()

Unnamed: 0,label,txt
0,0,本报记者陈雪频实习记者唐翔发自上海\r\n 一家刚刚成立两年的网络支付公司，它的目标是...
1,0,证券通：百联股份未来5年有能力保持高速增长\r\n\r\n 深度报告 权威内参...
2,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....
3,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....
4,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....


In [106]:
# cut word
import jieba
jieba.enable_parallel(4)
def cutword_1(x):
    words = jieba.cut(x)
    return ' '.join(words)

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/7t/2bxpnffn2r9fdg94r9fk6t4h0000gn/T/jieba.cache
Loading model cost 1.690 seconds.
Prefix dict has been built succesfully.


In [107]:
df['seg_word'] = df.txt.map(cutword_1)

In [108]:
df.head()

Unnamed: 0,label,txt,seg_word
0,0,本报记者陈雪频实习记者唐翔发自上海\r\n 一家刚刚成立两年的网络支付公司，它的目标是...,本报记者 陈雪频 实习 记者 唐翔 发自 上海 \r\n 一家 刚刚 成立 ...
1,0,证券通：百联股份未来5年有能力保持高速增长\r\n\r\n 深度报告 权威内参...,证券 通 ： 百联 股份 未来 5 年 有 能力 保持高速 增长 \r\n ...
2,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....,5 月 09 日 消息 快评 \r\n \r\n 深度 报告...
3,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....,5 月 09 日 消息 快评 \r\n \r\n 深度 报告...
4,0,5月09日消息快评\r\n\r\n 深度报告 权威内参 来自“证券通”www....,5 月 09 日 消息 快评 \r\n \r\n 深度 报告...


In [111]:
from pickle import dump,load
dump(df, open('data/tmdf.pickle', 'wb'))
# df = load(open('df.pickle','rb'))

In [112]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(ngram_range=(1,1), min_df = 2, max_features = 10000)
xvec = vect.fit_transform(df.seg_word)
xvec.shape

(17910, 10000)

In [113]:
y = df.label

In [114]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
train_X, test_X, train_y, test_y = train_test_split(xvec, y , train_size=0.7, random_state=1)
clf = MultinomialNB()

In [115]:
clf.fit(train_X, train_y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [117]:
from sklearn import metrics
pre = clf.predict(test_X)
print(metrics.classification_report(test_y, pre))

             precision    recall  f1-score   support

          0       0.90      0.88      0.89       577
          1       0.89      0.81      0.85       603
          2       0.88      0.82      0.85       619
          3       0.98      0.97      0.98       584
          4       0.86      0.88      0.87       570
          5       0.88      0.79      0.83       600
          6       0.77      0.90      0.83       600
          7       0.76      0.83      0.80       615
          8       0.93      0.93      0.93       605

avg / total       0.87      0.87      0.87      5373



In [118]:
num_features = 100
min_word_count = 10
num_workers = 4
context = 5
epoch = 20
sample = 1e-5

In [121]:
# word2vec
txt = df.seg_word.values
txtlist = []
for sent in txt:
    temp = [w for w in sent.split()]
    txtlist.append(temp)

In [119]:
from gensim.models import word2vec

In [122]:
model = word2vec.Word2Vec(txtlist, workers = num_workers,
                          sample = sample,
                          size = num_features,
                          min_count=min_word_count,
                          window = context,
                          iter = epoch)

In [123]:
model.syn0.shape

(57669, 100)

In [125]:
for w in model.most_similar('互联网'):
    print(w[0], w[1])


网络 0.7809788584709167
门户网站 0.7471935749053955
网络广告 0.7400577068328857
无线 0.7217098474502563
搜索引擎 0.716282308101654
在线 0.7109969854354858
网民 0.7070432901382446
服务商 0.7070391178131104
B2B 0.6956398487091064
艾瑞 0.6875394582748413


In [35]:
#model.save('sogo_wv')
#model = word2vec.Word2Vec.load('sogo_wv')

In [126]:
# 将词向量平均化为文档向量 
def sentvec_1(sent,m=num_features,model=model): 
    res = np.zeros(m) 
    words = sent.split() 
    num = 0  
    for w in words: 
        if w in model.index2word: 
            res += model[w] 
            num += 1.0 
    if num == 0: return np.zeros(m) 
    else: return res/num 

In [134]:
import pyprind
pbar = pyprind.ProgBar(len(df.seg_word.values))
n = df.shape[0] 
sent_matrix = np.zeros([n,num_features],float) 
for i ,sent in enumerate(df.seg_word.values): 
    sent_matrix[i,:] = sentvec_1(sent) 
    pbar.update()

0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 02:44:56


In [135]:
sent_matrix.shape 

(17910, 100)

In [136]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split
train_X, test_X, train_y, test_y = train_test_split(sent_matrix, y , train_size=0.7, random_state=1)
clf = GradientBoostingClassifier()



In [137]:
clf.fit(train_X, train_y)
from sklearn import metrics
pre = clf.predict(test_X)
print(metrics.classification_report(test_y, pre))

             precision    recall  f1-score   support

          0       0.89      0.85      0.87       577
          1       0.83      0.80      0.81       603
          2       0.85      0.87      0.86       619
          3       0.97      0.97      0.97       584
          4       0.84      0.88      0.86       570
          5       0.83      0.80      0.81       600
          6       0.83      0.87      0.85       600
          7       0.76      0.79      0.78       615
          8       0.94      0.92      0.93       605

avg / total       0.86      0.86      0.86      5373



### 练习：你来改善中文文本分类的效果