# IMDB 영화리뷰 감정 분석을 통한 파이썬 자연어 처리

> 참고 : 인프런 - [NLP] IMDB 영화리뷰 감정 분석을 통한 파이썬 자연어 처리

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('dataset/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('dataset/testData.tsv', header=0, delimiter='\t', quoting=3)

In [3]:
train.shape

(25000, 3)

In [4]:
test.shape

(25000, 2)

In [5]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [7]:
test.columns.values

array(['id', 'review'], dtype=object)

In [8]:
train['sentiment'].value_counts()
# 긍정 부정이 각각 12500개 씩

1    12500
0    12500
Name: sentiment, dtype: int64

In [9]:
train['review'][0][:700]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely lik'

## 자연어 처리전 데이터 정재

In [10]:
!pip show BeautifulSoup4

Name: beautifulsoup4
Version: 4.6.3
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Location: /Users/ticonweb/anaconda3/lib/python3.6/site-packages
Requires: 
Required-by: conda-build


In [12]:
from bs4 import BeautifulSoup

example1 = BeautifulSoup(train['review'][0], "html5lib")

# html 태그 제거

In [14]:
example1.get_text()

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

In [15]:
import re

letters_only = re.sub('[^a-zA-Z]', ' ', example1.get_text())

# 특수문자 제거

In [16]:
letters_only[:700]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyw'

In [17]:
# 모두 소문자로. 대소문자에 따라 다른 단어로 인식할 수 있으므로

lower_case = letters_only.lower()

In [18]:
words = lower_case.split()

In [19]:
len(words)

437

In [20]:
words[:4]

['with', 'all', 'this', 'stuff']

## Stopword removal

In [22]:
import nltk
from nltk.corpus import stopwords

In [23]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [25]:
removedstopwords = [w for w in words if not w in stopwords.words('english')]

In [26]:
len(removedstopwords)

219