# Bag of Words Meets Bags of Popcorn
####  Use Google's Word2Vec for movie reviews
<a href='https://www.kaggle.com/c/word2vec-nlp-tutorial#description'>Link</a>

# Part 1: 项目描述
在本教程比赛中，我们对情感分析进行了更深入的研究。Google的Word2Vec  是一种深度学习的启发式方法，专注于单词的含义。Word2Vec试图理解单词  之间的含义和语义关系。它的工作方式与深度方法类似，如递归神经网络或深度神经网络，但计算效率更高。**本教程重点介绍Word2Vec进行情感分析。**

情感分析是机器学习中一个具有挑战性的课题。人们用语言表达自己的情绪，语言往往被讽刺，模棱两可和言语所掩盖，所有这些对人类和计算机来说都是非常具有误导性的。还有另一个Kaggle  电影评论情感分析竞赛。在本教程中，我们将探讨Word2Vec如何应用于类似的问题。

在过去的几年里，深度学习一直是新闻报道，甚至成为纽约时报的头版。这些机器学习技术受到人类大脑结构的启发，并且最近在计算能力方面的进步使其成为可能，这些技术通过在图像识别，语音处理和自然语言任务方面的突破性成果引起了轰动。最近，深度学习方法赢得了几项Kaggle比赛，包括药物发现  任务  和猫狗图像识别。

## 教程概述
本教程将帮助您开始使用Word2Vec进行自然语言处理。它有两个目标： 

- 基本自然语言处理：  本教程的第1部分面向初学者，涵盖了本教程后面部分所需的基本自然语言处理技术。

- 深入学习文本理解：在  第2部分和第3部分中，我们将深入探讨如何使用Word2Vec来训练模型，以及如何使用生成的单词向量进行情感分析。

由于深度学习是一个快速发展的领域，大量的工作尚未发表，或仅作为学术论文存在。本教程的第3部分比规范性更具探索性 - 我们尝试了使用Word2Vec的几种方法，而不是为您提供使用输出的配方。

为了实现这些目标，我们依靠IMDB情绪分析数据集，该数据集有10万多段电影评论，既有积极的一面，也有消极的一面。

## 评估：
根据ROC曲线对提交内容进行判断。 

提交说明
您应该提交一个以25,000行加上标题行的逗号分隔文件。   
应该有2列：“ID”和“情绪”，其中包含您的二元预测：1为正面评论，0为负面评论。有关示例，请参阅数据页面上的“sampleSubmission.csv”。 
> ID，情绪   
123_45,0    
678_90,1   
12_34,0   
...   

# Part 2: NLP介绍
## 1. 什么是NLP?
NLP（自然语言处理）是一组解决文本问题的技术。  
此页面将帮助您开始加载和清理IMDB电影评论，然后应用简单的单词袋模型来获得令人惊讶的准确预测，即评论是否赞成或赞成。  

## 2. 词袋模型Bag of Words
- 数据集： labeledTrainData.tsv
- 内容：25,000条IMDB电影评论，每条都有一个Label {positive, negative}

### 2.1 加载数据

In [7]:
import pandas as pd
train = pd.read_csv('./Data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
print(train.shape)
print(train.columns.values)
train.head(3)

(25000, 3)
['id' 'sentiment' 'review']


Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


Here, "header=0" indicates that the first line of the file contains column names, "delimiter=\t" indicates that the fields are separated by tabs, and quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.



In [8]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

There are HTML tags such as "<br/>", abbreviations, punctuation - all common issues when processing text from online. Take some time to look through other reviews in the training set while you're at it - the next section will deal with how to tidy up the text for machine learning.

### 2.2 数据清洗和文本预处理

In [15]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup

# Initialize the BeautifulSoup object on a single movie review
example1 = BeautifulSoup(train['review'][0],'lxml')

# Print the raw review and then the output of get_text(), for comparsion
print(len(train['review'][0]))
print(len(example1.get_text()))

2304
2256


Calling get_text() gives you the text of the review, without tags or markup. 

In [17]:
# to remove punctuation and numbers 
import re

# Use regular expressions to do a find-and-replace
letters_only = re.sub('[^a-zA-Z]',              # the pattern to search for 
                             " ",                          # the pattern to replace it with
                             example1.get_text())   # the text to search 
print(letters_only[:100])  

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching 


In [28]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words
words[:10]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

In [21]:
# load stop words
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [22]:
# Remove stop words from 'words'
words = [w for w in words if not w in stopwords.words('english')]
print(words[:100])

['stuff', 'going', 'moment', 'mj', 've', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'm', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing']


### 2.3 Porter Stemming and Lemmatizing 词干提取 和 词形还原
For example, Porter Stemming and Lemmatizing (both available in NLTK) would allow us to treat "messages", "message", and "messaging" as the same word, which could certainly be useful. However, for simplicity, the tutorial will stop here.

### 2.4 综合上面预处理的步骤

In [33]:
def review_to_words(raw_review):
    '''
    Function to convert a raw review to a string of words. 
    The input is a single string ( a raw movie review), and the output is a single 
    string (a preprocessed movie review)
    '''
    # 1. Remove HTML Markup
    review_text = BeautifulSoup(raw_review, 'lxml').get_text()
    
    # 2. Remove non-letters
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()
    
    # 4. In Python, searching a set if much faster than searching a list, so convert
    # the stop words to a set
    stops = set(stopwords.words('english'))
    
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]
    
    # 6. Join the words back into one string separated by space,
    # and return the result. 
    return " ".join(meaningful_words)

In [34]:
clean_review = review_to_words(train['review'][0])
print(clean_review)

stuff going moment mj ve started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad m kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate wo

In [39]:
# Loop through and clean all of the training set at once

# Get the number of reviews based on the dataframe column size
num_reviews = train['review'].size

# Intialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length of the movie review list
print('Cleaning and parsing the training set movie reviews ... \n')
for i in range(num_reviews):
    if ( i+1 ) % 5000 == 0:
        print('Review %d of %d \n'%(i+1, num_reviews))
    clean_train_reviews.append( review_to_words(train['review'][i]))
    
    

Cleaning and parsing the training set movie reviews ... 

Review 5000 of 25000 

Review 10000 of 25000 

Review 15000 of 25000 

Review 20000 of 25000 

Review 25000 of 25000 



In [43]:
len(clean_train_reviews)

25000

### 2.5 用scikit-learn从词袋模型创建特征
In the IMDB data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).