<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB影评得分估计" data-toc-modified-id="IMDB影评得分估计-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB影评得分估计</a></span><ul class="toc-item"><li><span><a href="#分析：模型建立" data-toc-modified-id="分析：模型建立-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>分析：模型建立</a></span></li><li><span><a href="#数据观察" data-toc-modified-id="数据观察-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>数据观察</a></span></li><li><span><a href="#数据预处理" data-toc-modified-id="数据预处理-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>数据预处理</a></span></li></ul></li></ul></div>

# IMDB影评得分估计

## 分析：模型建立

NB：词袋法对每条电影评论进行特征向量化，并且借助CountVectorizer和TfidfVectorizer  

> 先利用无标注的影评文件训练词向量，然后将每条电影评论中所有词汇的平均向量作为特征训练梯度提升树模型。

## 数据观察

`小知识`

`TSV`是Tab-separated values的缩写，即制表符分隔值。TSV是用制表符（Tab,’\t’）作为字段值的分隔符；   
`CSV`是Comma-separated values的缩写，即逗号分隔值。CSV是用半角逗号（’,’）作为字段值的分隔符；

 
> 注意：IANA规定的标准TSV格式，字段值之中是不允许出现制表符 

>Python对TSV文件的支持：   
Python的csv模块准确的讲应该叫做dsv模块，因为它实际上是支持范式的分隔符分隔值文件（DSV，delimiter-separated values）的。 
delimiter参数值默认为半角逗号，即默认将被处理文件视为CSV。 
当delimiter=’\t’时，被处理文件就是TSV。
 

In [1]:
import pandas as pd

In [3]:
train = pd.read_csv('../Datasets/IMDB/labeledTrainData.tsv', delimiter='\t')
test = pd.read_csv('../Datasets/IMDB/testData.tsv', delimiter='\t')

In [7]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [9]:
test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [10]:
train.shape

(25000, 3)

In [11]:
test.shape

(25000, 2)

## 数据预处理

In [13]:
from bs4 import BeautifulSoup
import re

from nltk.corpus import stopwords

In [51]:
def review_to_text(review, remove_stop_words=True):
    delete_html_text = BeautifulSoup(review, 'html').get_text()
    delete_symbol_text = re.sub('[^a-zA-Z]',' ', delete_html_text)
    delete_stopwords_words = delete_symbol_text.lower().split()
    if remove_stop_words:
        stop_words = set(stopwords.words('english'))
        delete_stopwords_words = [w for w in delete_stopwords_words if w not in stop_words]
    return delete_stopwords_words

In [52]:
review = '<br /><br />This movie is full of references. Like \"Mad Max II\", \"The wild one\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future.'

review_to_text(review, remove_stop_words=True)

['movie',
 'full',
 'references',
 'like',
 'mad',
 'max',
 'ii',
 'wild',
 'one',
 'many',
 'others',
 'ladybug',
 'face',
 'clear',
 'reference',
 'tribute',
 'peter',
 'lorre',
 'movie',
 'masterpiece',
 'talk',
 'much',
 'future']

In [46]:
a = set(stopwords.words('english'))
a

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [27]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\eh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [36]:
import os
for i,a,b in os.walk(r'C:\\Users\\eh\\AppData\\Roaming\\nltk_data\\corpora\\stopwords'):
    print(b)

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'README', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
