# 目录：
- bag_of_words 情感分析
- <a href='#word2vec'>word2vec训练词向量</a>
- <a href='#sentiment'>在Word2vec上训练情感分析模型</a>

### 导入所需库

In [1]:
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [2]:
import nltk
from nltk.corpus import stopwords

### 用pandas读取训练数据

In [3]:
BASE_PATH = os.getcwd()
training_file_path = os.path.join(BASE_PATH, 'data/labeledTrainData.tsv')

df = pd.read_csv(training_file_path, sep='\t', escapechar='\\')

In [4]:
print('Num of reviews: {}'.format(len(df)))

Num of reviews: 25000


In [5]:
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
df['review'][1]

'"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.'

1表示正面评论，0表示负面评论。

### 对影评数据做预处理，大概有以下环节：

1. 去掉html标签
1. 移除标点
1. 切分成词/token
1. 去掉停用词
1. 重组为新的句子

In [7]:
def display(text, title):
    print(title)
    print('\n-----我是分割线-----\n')
    print(text)

In [9]:
raw_example = df['review'][1]
display(raw_example, '原始数据')

原始数据

-----我是分割线-----

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.


In [13]:
example = BeautifulSoup(raw_example, 'html.parser').get_text()
display(example, '去掉HTML标签的数据')

去掉HTML标签的数据

-----我是分割线-----

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.


In [14]:
example_letters = re.sub(r'[^a-zA-X]', ' ', example)
display(example_letters, '去掉标点的数据')

去掉标点的数据

-----我是分割线-----

 The Classic War of the Worlds  by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H  G  Wells  classic book  Mr  Hines succeeds in doing so  I  and those who watched his film with me  appreciated the fact that it was not the standard  predictable Hollywood fare that comes out every year  e g  the Spielberg version with Tom Cruise that had only the slightest resemblance to the book  Obviously  everyone looks for different things in a movie  Those who envision themselves as amateur  critics  look only to criticize everything they can  Others rate a movie on more important bases like being entertained  which is why most people never agree with the  critics   We enjoyed the effort Mr  Hines put into being faithful to H G  Wells  classic novel  and we found it to be very entertaining  This made it easy to overlook what the  critics  perceive to be its shortcomings 


In [15]:
words = example_letters.lower().split()
display(words, '纯词列表数据')

纯词列表数据

-----我是分割线-----

['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'timothy', 'hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'in', 'doing', 'so', 'i', 'and', 'those', 'who', 'watched', 'his', 'film', 'with', 'me', 'appreciated', 'the', 'fact', 'that', 'it', 'was', 'not', 'the', 'standard', 'predictable', 'hollywood', 'fare', 'that', 'comes', 'out', 'every', 'year', 'e', 'g', 'the', 'spielberg', 'version', 'with', 'tom', 'cruise', 'that', 'had', 'only', 'the', 'slightest', 'resemblance', 'to', 'the', 'book', 'obviously', 'everyone', 'looks', 'for', 'different', 'things', 'in', 'a', 'movie', 'those', 'who', 'envision', 'themselves', 'as', 'amateur', 'critics', 'look', 'only', 'to', 'criticize', 'everything', 'they', 'can', 'others', 'rate', 'a', 'movie', 'on', 'more', 'important', 'bases', 'like', 'being

In [16]:
#下载停用词和其他语料会用到
#nltk.download()

In [25]:
stopwords = {}.fromkeys([ line.strip() for line in open('./stopwords.txt')])

In [26]:
stopwords

{"'d": None,
 "'ll": None,
 "'m": None,
 "'re": None,
 "'s": None,
 "'t": None,
 "'ve": None,
 'ZT': None,
 'ZZ': None,
 'a': None,
 "a's": None,
 'able': None,
 'about': None,
 'above': None,
 'abst': None,
 'accordance': None,
 'according': None,
 'accordingly': None,
 'across': None,
 'act': None,
 'actually': None,
 'added': None,
 'adj': None,
 'adopted': None,
 'affected': None,
 'affecting': None,
 'affects': None,
 'after': None,
 'afterwards': None,
 'again': None,
 'against': None,
 'ah': None,
 "ain't": None,
 'all': None,
 'allow': None,
 'allows': None,
 'almost': None,
 'alone': None,
 'along': None,
 'already': None,
 'also': None,
 'although': None,
 'always': None,
 'am': None,
 'among': None,
 'amongst': None,
 'an': None,
 'and': None,
 'announce': None,
 'another': None,
 'any': None,
 'anybody': None,
 'anyhow': None,
 'anymore': None,
 'anyone': None,
 'anything': None,
 'anyway': None,
 'anyways': None,
 'anywhere': None,
 'apart': None,
 'apparently': None,
 'ap

In [27]:
words_nostop = [w for w in words if w not in stopwords]
#words_nostop = [w for w in words if w not in stopwords.words('english')] # nltk.corpus.stopwords
display(words_nostop, '去掉停用词数据')

去掉停用词数据

-----我是分割线-----

['classic', 'war', 'worlds', 'timothy', 'hines', 'entertaining', 'film', 'effort', 'lengths', 'faithfully', 'recreate', 'classic', 'book', 'hines', 'succeeds', 'watched', 'film', 'appreciated', 'standard', 'predictable', 'hollywood', 'fare', 'spielberg', 'version', 'tom', 'cruise', 'slightest', 'resemblance', 'book', 'movie', 'envision', 'amateur', 'critics', 'criticize', 'rate', 'movie', 'bases', 'entertained', 'people', 'agree', 'critics', 'enjoyed', 'effort', 'hines', 'faithful', 'classic', 'entertaining', 'easy', 'overlook', 'critics', 'perceive', 'shortcomings']


In [37]:
# eng_stopwords = set(stopwords.words('english')) # nltk.corpus.stopwords
eng_stopwords = set(stopwords)

def clean_text(text):
    # HTML标记去除
    text = BeautifulSoup(text, 'html.parser').get_text()
    # 移除标点
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # 最小化，分词
    words = text.lower().split()
    # 去掉停用词
    words = [w for w in words if w not in eng_stopwords]
    # 重新组成新的句子
    return ' '.join(words)

In [38]:
clean_text(raw_example)

'classic war worlds timothy hines entertaining film effort lengths faithfully recreate classic book hines succeeds watched film appreciated standard predictable hollywood fare spielberg version tom cruise slightest resemblance book movie envision amateur critics criticize rate movie bases entertained people agree critics enjoyed effort hines faithful classic entertaining easy overlook critics perceive shortcomings'

### 清洗数据添加到dataframe里

In [39]:
df['clean_review'] = df.review.apply(clean_text)
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,stuff moment mj ve started listening music wat...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin...",classic war worlds timothy hines entertaining ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,film starts manager nicholas bell investors ro...
3,3630_4,0,It must be assumed that those who praised this...,assumed praised film filmed opera didn read do...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy wondrously unpretentious explo...


In [36]:
df['clean_review'][0]

"ith stuff moment mj started listening music, watching odd documentary there, watched wiz watched moonwalker again. insight guy cool eighties mind guilty innocent. moonwalker biography, feature film remember cinema originally released. subtle messages mj's feeling press obvious message drugs bad m'kay.visually impressive michael jackson remotely mj hate boring. call mj egotist consenting movie mj fans fans true nice him.the actual feature film bit finally starts 20 minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord. mj dead bad me. mj overheard plans? nah, joe pesci's character ranted people supplying drugs dunno, hates mj's music.lots cool mj car robot speed demon sequence. also, director patience saint filming kiddy bad sequence directors hate kid bunch performing complex dance scene.bottom line, movie people mj level (which people). not, stay away. wholesome message ironically mj's bestest buddy movie girl! michael jackson talented people

### 抽取bag of words特征(用sklearn的CountVectorizer)

In [40]:
vectorizer = CountVectorizer(max_features=5000)
train_data_features = vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape

(25000, 5000)

### 训练分类器

In [48]:
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data_features, df.sentiment)

#### 在训练集上做个predict看看效果如何

In [49]:
confusion_matrix(df.sentiment, forest.predict(train_data_features))

array([[12500,     0],
       [    0, 12500]])

#### 删除不用的占内容变量

In [54]:
del df
del train_data_features

### 读取测试数据进行预测

In [55]:
test_file_path = os.path.join(BASE_PATH, 'data/testData.tsv')

df = pd.read_csv(test_file_path, sep='\t', escapechar='\\')

print('Number of reviews: {}'.format(len(df)))


Number of reviews: 25000


#### 在测试集上应用clean_text，同样的方式清洗数据

In [56]:
df['clean_review'] = df.review.apply(clean_text)

In [57]:
df.head()

Unnamed: 0,id,review,clean_review
0,12311_10,Naturally in a film who's main themes are of m...,naturally film main themes mortality nostalgia...
1,8348_2,This movie is a disaster within a disaster fil...,movie disaster disaster film action scenes mea...
2,5828_4,"All in all, this is a movie for kids. We saw i...",movie kids tonight child loved kid excitement ...
3,7186_2,Afraid of the Dark left me with the impression...,afraid dark left impression screenplays writte...
4,12128_7,A very accurate depiction of small time mob li...,accurate depiction time mob life filmed jersey...


#### 在测试集上，对clean_review按照训练集生成的vectorizer，出现最高的5000个词进行自动编码

In [60]:
test_data_features = vectorizer.transform(df.clean_review).toarray()
test_data_features.shape

(25000, 5000)

In [63]:
result = forest.predict(test_data_features)
output = pd.DataFrame({'id': df.id, 'sentiment': result})

In [64]:
output.head()

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,1


In [65]:
output.to_csv(os.path.join(BASE_PATH, 'data/Bag_of_Words_model_submission.tsv'), index=False)


In [66]:
del df
del test_data_features

----

<h2><a name='word2vec'>word2vec训练词向量</a></h2>

In [68]:
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

import nltk.data
# from nltk.corpus import stopwords

from gensim.models.word2vec import Word2Vec

In [72]:
def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    
    data_file = os.path.join(BASE_PATH, 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    
    print('Number of reviews: {} '.format(len(df)))
    
    return df

### 读取unlabeled_train数据
用于训练生成word2vec词向量

In [73]:
df = load_dataset('unlabeled_train')
df.head()

Number of reviews: 50000 


Unnamed: 0,id,review
0,9999_0,"Watching Time Chasers, it obvious that it was ..."
1,45057_0,I saw this film about 20 years ago and remembe...
2,15561_0,"Minor Spoilers<br /><br />In New York, Joan Ba..."
3,7161_0,I went to see this film with a great deal of e...
4,43971_0,"Yes, I agree with everyone on this site this m..."


### 对数据review做和上面一样的预处理
稍微有点不一样的是，我们留个候选，可以去除停用词，也可以不去除停用词

In [84]:
# eng_stopwords = set(stopwords.words('english'))
eng_stopwords = {}.fromkeys([line.strip() for line in open('./stopwords.txt')])

def clean_text(text, remove_stopwords=False):
    '''文本预处理的函数'''
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwords]
    return words

#### 加载nltk.data中加载英文的划分句子的模型
tokenizers/punkt/ 这里面有好多训练好的模型，只能划分成句子，不能划分成单词
老外写文字，单词之间都留空格，split()函数默认是空格，python是他们自己设计的，都是方便了他们。

In [86]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [85]:
# import nltk.data 
# tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# data =  'Sadly, more downs than ups. The plot was pretty descent.'
# for row in tokenizer.tokenize(data):
#     print(row)

Sadly, more downs than ups.
The plot was pretty descent.


In [87]:
def print_call_counts(f):
    '''定义装饰器'''
    n = 0 
    def wrapped(*args, **kwargs):
        nonlocal n
        n += 1
        if n % 1000 == 1:
            print('method {} called {} times'.format(f.__name__, n))
        return f(*args, **kwargs)
    return wrapped

In [82]:
@print_call_counts
def split_sentences(review):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = [clean_text(s) for s in raw_sentences if s]
    return sentences

In [83]:
%time sentences = sum(df.review.apply(split_sentences), [])
print("{} reviews -> {} sentences".format(len(df), len(sentences)))

method split_sentences called 1 times


  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)


method split_sentences called 1001 times
method split_sentences called 2001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 3001 times
method split_sentences called 4001 times
method split_sentences called 5001 times
method split_sentences called 6001 times
method split_sentences called 7001 times
method split_sentences called 8001 times
method split_sentences called 9001 times
method split_sentences called 10001 times
method split_sentences called 11001 times
method split_sentences called 12001 times
method split_sentences called 13001 times
method split_sentences called 14001 times
method split_sentences called 15001 times
method split_sentences called 16001 times
method split_sentences called 17001 times
method split_sentences called 18001 times
method split_sentences called 19001 times
method split_sentences called 20001 times
method split_sentences called 21001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 22001 times
method split_sentences called 23001 times
method split_sentences called 24001 times
method split_sentences called 25001 times
method split_sentences called 26001 times
method split_sentences called 27001 times
method split_sentences called 28001 times
method split_sentences called 29001 times
method split_sentences called 30001 times
method split_sentences called 31001 times
method split_sentences called 32001 times
method split_sentences called 33001 times
method split_sentences called 34001 times
method split_sentences called 35001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 36001 times
method split_sentences called 37001 times
method split_sentences called 38001 times
method split_sentences called 39001 times
method split_sentences called 40001 times
method split_sentences called 41001 times
method split_sentences called 42001 times
method split_sentences called 43001 times
method split_sentences called 44001 times
method split_sentences called 45001 times
method split_sentences called 46001 times
method split_sentences called 47001 times
method split_sentences called 48001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 49001 times
CPU times: user 6min 20s, sys: 15 s, total: 6min 36s
Wall time: 6min 38s
50000 reviews -> 537851 sentences


成功地通过nltk.data.load('tokenizers/punkt/english.pickle')模型，将5万条影评划分成537851条句子。

### 用gensim训练词嵌入模型

In [88]:
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)

In [89]:
# 设定词向量训练的参数
num_features = 300 # Word Vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parrallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words 负例采样的

model_name = "{}features_{}minwords_{}context.model".format(num_features, min_word_count, context)

In [90]:
print('Trainging model...')

Trainging model...


In [92]:
model = Word2Vec(sentences, 
                         workers=num_workers, 
                         size=num_features, 
                          min_count=min_word_count, 
                          window=context, 
                          sample=downsampling)

2018-09-11 13:45:23,789: INFO: collecting all words and their counts
2018-09-11 13:45:23,790: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-09-11 13:45:23,843: INFO: PROGRESS: at sentence #10000, processed 225072 words, keeping 17237 word types
2018-09-11 13:45:23,905: INFO: PROGRESS: at sentence #20000, processed 443536 words, keeping 24570 word types
2018-09-11 13:45:23,952: INFO: PROGRESS: at sentence #30000, processed 666343 words, keeping 29785 word types
2018-09-11 13:45:23,998: INFO: PROGRESS: at sentence #40000, processed 886903 words, keeping 33939 word types
2018-09-11 13:45:24,046: INFO: PROGRESS: at sentence #50000, processed 1103863 words, keeping 37503 word types
2018-09-11 13:45:24,111: INFO: PROGRESS: at sentence #60000, processed 1327231 words, keeping 40738 word types
2018-09-11 13:45:24,171: INFO: PROGRESS: at sentence #70000, processed 1550828 words, keeping 43603 word types
2018-09-11 13:45:24,226: INFO: PROGRESS: at sentence #80000, 

2018-09-11 13:45:35,408: INFO: worker thread finished; awaiting finish of 0 more threads
2018-09-11 13:45:35,408: INFO: EPOCH - 1 : training on 11877522 raw words (8395105 effective words) took 8.7s, 966293 effective words/s
2018-09-11 13:45:36,423: INFO: EPOCH 2 - PROGRESS: at 11.24% examples, 939019 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:37,428: INFO: EPOCH 2 - PROGRESS: at 21.47% examples, 897803 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:38,432: INFO: EPOCH 2 - PROGRESS: at 32.73% examples, 912309 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:39,432: INFO: EPOCH 2 - PROGRESS: at 43.95% examples, 920451 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:40,446: INFO: EPOCH 2 - PROGRESS: at 55.05% examples, 919892 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:41,457: INFO: EPOCH 2 - PROGRESS: at 66.67% examples, 927231 words/s, in_qsize 7, out_qsize 0
2018-09-11 13:45:42,457: INFO: EPOCH 2 - PROGRESS: at 79.08% examples, 942878 words/s, in_qsize 7, out_qsize 

In [93]:
# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

2018-09-11 14:26:05,461: INFO: precomputing L2-norms of word weight vectors


In [95]:
# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model.save(os.path.join(BASE_PATH, 'data', model_name))

2018-09-11 14:28:05,235: INFO: saving Word2Vec object under /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model, separately None
2018-09-11 14:28:05,236: INFO: not storing attribute vectors_norm
2018-09-11 14:28:05,237: INFO: not storing attribute cum_table
2018-09-11 14:28:05,604: INFO: saved /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model


### 看看训练的词向量结果如何

In [99]:
model.wv.doesnt_match('man woman child kitchen'.split())

'kitchen'

In [100]:
model.wv.most_similar('man')

[('woman', 0.6461790800094604),
 ('lad', 0.6023533344268799),
 ('lady', 0.598419189453125),
 ('chap', 0.5527979731559753),
 ('soldier', 0.5452903509140015),
 ('guy', 0.5335018634796143),
 ('person', 0.5306960940361023),
 ('monk', 0.5164638757705688),
 ('boy', 0.5161195993423462),
 ('millionaire', 0.4999878406524658)]

In [102]:
model.wv.most_similar('queen')

[('princess', 0.6499490737915039),
 ('bride', 0.6323518753051758),
 ('maid', 0.6304798722267151),
 ('angela', 0.6170095205307007),
 ('feisty', 0.6112707853317261),
 ('mistress', 0.6094938516616821),
 ('temple', 0.6072278618812561),
 ('belle', 0.6070975065231323),
 ('nurse', 0.6022297143936157),
 ('marlene', 0.6006361842155457)]

In [103]:
model.wv.most_similar('awful')

[('terrible', 0.7814360857009888),
 ('horrible', 0.7440706491470337),
 ('atrocious', 0.7384001016616821),
 ('abysmal', 0.6928448677062988),
 ('horrid', 0.6792713403701782),
 ('dreadful', 0.6701364517211914),
 ('embarrassing', 0.6612033247947693),
 ('horrendous', 0.6463928818702698),
 ('appalling', 0.6420878171920776),
 ('lousy', 0.6373677253723145)]

----

<h2><a name='sentiment'>在Word2vec上训练情感分析模型</a></h2>

In [107]:
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

# from nltk.corpus import stopwords

from gensim.models.word2vec import Word2Vec

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans

### 加载数据集

In [108]:
def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    data_file = os.path.join(BASE_PATH, 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of reviews: {}'.format(len(df)))
    return df

In [110]:
# eng_stopwords = set(stopwords.words('english'))
eng_stopwords = {}.fromkeys([line.strip() for line in open('./stopwords.txt')])

def clean_text(text, remove_stopwords=False):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwords]
    return words

### 读取上面训练好的word2vec模型
名为：300features_40minwords_10context.model

In [111]:
model_name = '300features_40minwords_10context.model'
model = Word2Vec.load(os.path.join(BASE_PATH, 'data', model_name))

2018-09-11 14:37:45,316: INFO: loading Word2Vec object from /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model
2018-09-11 14:37:45,604: INFO: loading wv recursively from /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model.wv.* with mmap=None
2018-09-11 14:37:45,605: INFO: setting ignored attribute vectors_norm to None
2018-09-11 14:37:45,606: INFO: loading vocabulary recursively from /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model.vocabulary.* with mmap=None
2018-09-11 14:37:45,606: INFO: loading trainables recursively from /Users/zoe/Documents/GitHub/July-NLP/Lec 09 Word2Vec/Word2Vec应用案例/Kaggle竞赛实例/kaggle_movie_sentiment/data/300features_40minwords_10context.model.trainables.* with mmap=None
2018-09-11 14:37:

### 我们可以根据word2vec的结果去对影评文本进行编码
编码方式有一点粗暴，简单说来就是：把这句话中的词的词向量做平均

In [113]:
df = load_dataset('labeled_train')
df.head()

Number of reviews: 25000


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [131]:
def to_review_vector(review):
    words = clean_text(review, remove_stopwords=True)
    array = np.array([model.wv[w] for w in words if w in model.wv])
    return pd.Series(array.mean(axis=0))

In [132]:
train_data_features = df.review.apply(to_review_vector)
train_data_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.038656,0.016665,0.012377,-0.006938,0.017928,0.028741,0.021159,0.002569,-0.00779,-0.005402,...,-0.000844,-0.019062,0.01302,0.011019,-0.005152,-0.017299,0.016229,0.001952,0.004006,0.026233
1,-0.027447,-0.003779,0.002941,0.012106,0.008971,-0.001523,0.007582,0.011109,0.011646,0.030603,...,-0.033001,-0.001475,0.019913,-0.003049,-0.008966,-0.014869,0.008585,-0.000171,-0.022355,0.030743
2,-0.040628,0.038908,0.015016,-0.005125,-0.004106,0.034407,0.03708,0.004509,0.006941,0.004752,...,-0.018228,-0.003175,-0.00299,0.009671,0.01236,0.016348,-0.013161,-0.012926,0.014982,0.043757
3,-0.04693,0.026483,0.015915,-0.001099,-0.003514,0.008214,0.010639,-0.015563,0.004354,-0.006209,...,-0.013935,0.021021,0.019471,0.005665,-0.008581,0.003367,-0.004734,0.003352,0.010746,0.027569
4,-0.025354,0.031252,0.016053,0.007296,0.012562,0.039233,0.024039,-0.01189,-0.004568,-0.009628,...,-0.016103,-0.001034,0.00729,0.012294,-0.0029,0.00587,0.001857,-0.010432,0.024471,0.037066


### 用随机森林构建分类器

In [133]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest = forest.fit(train_data_features, df.sentiment)

### 模型在训练集上验证效果

In [134]:
confusion_matrix(df.sentiment, forest.predict(train_data_features))

array([[12499,     1],
       [    0, 12500]])

### 清理占用内容的变量

In [135]:
del df
del train_data_features

### 预测测试集结果并上传kaggle

In [136]:
df = load_dataset('test')
df.head()

Number of reviews: 25000


Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [137]:
test_data_features = df.review.apply(to_review_vector)
test_data_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.046155,0.042196,0.015595,0.008609,0.021927,0.03979,0.034181,-0.014081,-0.026512,-0.047788,...,-0.025544,-0.006317,0.015436,0.00383,-0.003724,0.015556,-0.00546,-0.009434,0.014679,0.051093
1,0.012886,-0.004891,0.02242,-0.00054,-0.014649,0.011687,0.047358,-0.009926,0.005163,-0.018134,...,-0.004493,0.00202,0.019044,-0.009565,-0.023542,-0.029679,0.018448,-0.003557,0.004657,0.010662
2,-0.017965,0.014629,0.021074,0.000304,0.009187,0.013887,0.012884,-0.01992,-0.017748,-0.018572,...,0.002945,-0.008839,0.027904,-0.002652,-0.02208,-0.027755,0.034364,0.007655,-0.000173,0.017472
3,-0.03782,0.037639,-0.008661,0.004752,-0.009824,0.030751,0.040886,-0.007619,-0.003907,-0.019943,...,-0.018872,-0.005814,0.005971,-0.008786,-0.00032,-0.012749,0.000829,0.001278,0.017624,0.045917
4,-0.059465,0.032887,0.002643,-0.015895,0.001775,0.004698,0.027548,-0.007834,-0.007266,-0.013936,...,-0.011353,0.014154,0.001744,0.017066,0.013822,-0.003017,0.009888,-0.001988,-0.01228,0.03055


result = forest.predict(test_data_features)
output = pd.DataFrame({'id':df.id, 'sentiment':result})
output.to_csv(os.path.join(BASE_PATH,'data','Word2Vec_model.csv'), index=False)
output.head()

In [139]:
del df
del test_data_features
del forest

----

### 对词向量进行聚类研究和编码
使用KMeans进行聚类

In [149]:
word_vectors = model.wv.vectors
num_clusters = word_vectors.shape[0] // 10

print(num_clusters)

1305


In [150]:
%%time

kmeans_clustering = KMeans(n_clusters=num_clusters, n_jobs=4)
idx = kmeans_clustering.fit_predict(word_vectors)

  return distances if squared else np.sqrt(distances, out=distances)
  return distances if squared else np.sqrt(distances, out=distances)
  return distances if squared else np.sqrt(distances, out=distances)
  return distances if squared else np.sqrt(distances, out=distances)


CPU times: user 1.51 s, sys: 246 ms, total: 1.75 s
Wall time: 1min 32s


In [153]:
word_centroid_map = dict(zip(model.wv.index2word, idx))

In [157]:
import pickle 

filename = 'word_centroid_map_10avg.pickle'

with open(os.path.join(BASE_PATH, 'data', filename) ,'bw') as f:
    pickle.dump(word_centroid_map, f)
    
#with open(os.path.join('..', 'models', filename), 'br') as f:
#    word_centroid_map = pickle.load(f)    

### 输出一些clusters看

In [158]:
for cluster in range(0,10):
    print('\nCluster %d' % cluster)
    print([w for w,c in word_centroid_map.items() if c == cluster])


Cluster 0
['positively']

Cluster 1
['prison', 'law', 'church', 'plans', 'court', 'jail', 'charge', 'hostage', 'pressure', 'orders', 'safety', 'authorities', 'arrest', 'protection', 'charges', 'papers', 'permission', 'instructions', 'lawyers', 'testify']

Cluster 2
['shu', 'qi']

Cluster 3
['bruno', 'wang', 'distinguished', 'pedro', 'wu', 'vega', 'philippe', 'milian', 'buckaroo', 'memorably', 'karyo', 'wei', 'eduardo', 'tang', 'tomas', 'berger']

Cluster 4
['remote', 'farm', 'treasure', 'paradise', 'stranded', 'deserted', 'shelter', 'boarding', 'tourist', 'ghostly', 'luxury', 'refuge', 'backwoods', 'desolate', 'housing', 'mining', 'vermont', 'lighthouse', 'manor', 'secluded', 'farmhouse', 'roam', 'hellgate', 'dilapidated', 'lobster', 'nursing']

Cluster 5
['culture', 'rules', 'lesson', 'cultural', 'myth', 'literature', 'versus', 'origin', 'mythology', 'traditions', 'milieu', 'myths', 'diversity', 'mysticism']

Cluster 6
['btw', 'whale', 'grudge', 'wraith', 'bigfoot']

Cluster 7
['ridi

### 把评论数据转成cluster bag vectors

In [164]:
wordset = set(word_centroid_map.keys())

def make_cluster_bag(review):
    words = clean_text(review, remove_stopwords=True)
    return (pd.Series([word_centroid_map[w] for w in words if w in wordset]).value_counts().reindex(range(num_clusters+1), fill_value=0))


In [166]:
df = load_dataset('labeled_train')
df.head()

Number of reviews: 25000


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"""The Classic War of the Worlds"" by Timothy Hin..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [188]:
train_data_features = df.review.apply(make_cluster_bag)
train_data_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0


In [190]:
train_data_features.shape

(25000, 1306)

### 再用随机森林算法建模

In [191]:
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest = forest.fit(train_data_features, df.sentiment)

### 在训练集上看模型的效果

In [192]:
confusion_matrix(df.sentiment, forest.predict(train_data_features))

array([[12500,     0],
       [    0, 12500]])

### 删除无用的占内存的量

In [193]:
del  df
del train_data_features

### 载入测试数据做预测

In [194]:
df = load_dataset('test')
df.head()

Number of reviews: 25000


Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [195]:
test_data_features = df.review.apply(make_cluster_bag)
test_data_features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [196]:
result = forest.predict(test_data_features)
output = pd.DataFrame({"id": df.id, 'sentiment': result})
output.to_csv(os.path.join(BASE_PATH,'data', 'Word2Vec_BagOfClusters.csv'))
output.head()

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,0


In [198]:
del df
del test_data_features
del forest