# [영어 감성 분석 실습: 영어 전처리 및 TDM]


### jupyter notebook 단축키

- ctrl+enter: 셀 실행   
- shift+enter: 셀 실행 및 다음 셀 이동   
- alt+enter: 셀 실행, 다음 셀 이동, 새로운 셀 생성
- a: 상단에 새로운 셀 만들기
- b: 하단에 새로운 셀 만들기
- dd: 셀 삭제(x: 셀 삭제)
- y: Code로 변경
- m: Markdown으로 변경

## 1. 모듈 불러오기

#### import '불러올 패키지명' as '그 패키지를 파이썬에서 사용할 이름'

In [1]:
# Data preprocessing
import konlpy
from konlpy.tag import Hannanum, Kkma, Komoran

import nltk
from nltk.tokenize import word_tokenize
#nltk.download()
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Document representation 
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Document classifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Classifier measure
from sklearn.metrics import accuracy_score

import re
import sys
import os
import time
import random
import pandas as pd
import numpy as np

import pickle

## 2. 데이터 불러오기: English Movie Review Data

#### 데이터 구조
- 데이터: 영어 영화 리뷰 데이터  
- 관측치 건수: 2000건
- 변수 개수: 설명변수 1개 / 반응변수 1개

#### 설명 변수(원인: 예측값을 설명할 수 있는 변수)      
- review: 영화 리뷰 데이터(Document)


#### 반응 변수(결과: 예측하고자 하는 값)
- sentiment: 리뷰의 감정 (긍정:1/ 부정:-1)

In [2]:
train_data = pd.read_csv('data/english_train.csv',encoding='utf-8', engine='python', index_col=0)
test_data = pd.read_csv('data/english_test.csv', encoding='utf-8', engine='python',index_col=0)

## 3. 데이터 구성

### - 3.1 train 데이터 및 test 데이터 갯수 확인

In [3]:
print('DATA COLUMNS NAMES: {}'.format(list(train_data.columns)))
print('\n')
print('TRAIN DATA SHAPE: {}'.format(train_data.shape))
print('TEST  DATA SHAPE: {}'.format(test_data.shape))

DATA COLUMNS NAMES: ['reviews', 'sentiment']


TRAIN DATA SHAPE: (1800, 2)
TEST  DATA SHAPE: (200, 2)


### - 3.2 train/test 데이터 병합

In [4]:
whole_data = pd.concat([train_data, test_data]).reset_index(drop=True)

In [5]:
whole_data.head() #.head(): 상위 5개 예시

Unnamed: 0,reviews,sentiment
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,""" jaws "" is a rare film that grabs your atten...",1
4,moviemaking is a lot like being the general ma...,1


## 4. 데이터 텍스트 전처리

## 5. 토큰화 전처리

## 6. 토큰화 데이터 탐색

## 7. Term-Document Matrix

In [6]:
vectorizer = CountVectorizer()

In [7]:
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

### - 7.1 train / test 데이터 나누기

In [8]:
with open('data/train_docs.txt','rb') as f:
       train_docs=pickle.load(f)  
        
with open('data/test_docs.txt','rb') as f:
       test_docs=pickle.load(f)  
        
        
test_docs[0]

['1912',
 'ship',
 'set',
 'sail',
 'maiden',
 'voyage',
 'across',
 'atlantic',
 'america',
 'ship',
 'wa',
 'built',
 'largest',
 'ship',
 'world',
 'wa',
 'wa',
 'also',
 'build',
 'one',
 'luxurious',
 'wa',
 'finally',
 'wa',
 'built',
 'unsinkable',
 'unfortunately',
 'wa',
 'get',
 'ticket',
 'voyage',
 'either',
 'spent',
 'life',
 'saving',
 'get',
 'america',
 'start',
 'life',
 'anew',
 'part',
 'upper',
 'class',
 'money',
 'spare',
 'finally',
 'lucky',
 'enough',
 'full',
 'house',
 'poker',
 'match',
 'dock',
 'like',
 'jack',
 'dawson',
 'jack',
 'dawson',
 'make',
 'trip',
 'happens',
 'right',
 'place',
 'right',
 'time',
 'rose',
 'dewitt',
 'bukater',
 'first',
 'class',
 'passenger',
 'climb',
 'railing',
 'aft',
 'ship',
 'thought',
 'jumping',
 'thus',
 'started',
 'tale',
 'romance',
 'intrigue',
 'tale',
 'death',
 'tragedy',
 'movie',
 'tragic',
 'event',
 'took',
 'place',
 'great',
 'many',
 'year',
 'ago',
 'even',
 'taken',
 'lightly',
 'bit',
 'historical

### - 7.2 train_docs 의 토큰 결합

In [9]:
train_docs = [' '.join(doc) for doc in train_docs]

In [10]:
train_docs[0]

'film adapted comic book plenty success whether superheroes batman superman spawn geared toward kid casper arthouse crowd ghost world never really comic book like hell starter wa created alan moore eddie campbell brought medium whole new level mid 80 12part series called watchman say moore campbell thoroughly researched subject jack ripper would like saying michael jackson starting look little odd book graphic novel 500 page long includes nearly 30 consist nothing footnote word dismiss film source get past whole comic book thing might find another stumbling block hell director albert allen hughes getting hughes brother direct seems almost ludicrous casting carrot top well anything riddle better direct film set ghetto feature really violent street crime mad genius behind menace ii society ghetto question course whitechapel 1888 london east end filthy sooty place whore called unfortunate starting get little nervous mysterious psychopath ha carving profession surgical precision first stif

### - 7.3 test_docs 의 토큰 결합

In [11]:
test_docs = [' '.join(doc) for doc in test_docs]

In [12]:
test_docs[0]

'1912 ship set sail maiden voyage across atlantic america ship wa built largest ship world wa wa also build one luxurious wa finally wa built unsinkable unfortunately wa get ticket voyage either spent life saving get america start life anew part upper class money spare finally lucky enough full house poker match dock like jack dawson jack dawson make trip happens right place right time rose dewitt bukater first class passenger climb railing aft ship thought jumping thus started tale romance intrigue tale death tragedy movie tragic event took place great many year ago even taken lightly bit historical trivia movie titanic show happened maybe 100 degree accuracy still show realisticaly titanic story backdrop story serf admirably brining forth interesting story although simple simple premise captivating movie emotional simply alone enough story brought certain style make much emotional much effective movie forgotten quickly unfortunately something produced hollywood great frequency attent

### - 7.4 Term-Document Matrix 생성 (train_x)

In [13]:
train_x = vectorizer.fit_transform(train_docs)

### - 7.5 Term-Document Matrix 생성 (test_x)

In [14]:
test_x = vectorizer.transform(test_docs)

### - 7.6 train_y와 test_y 생성

In [15]:
train_y = whole_data.sentiment.values.tolist()[:len(train_data)]
test_y = whole_data.sentiment.values.tolist()[-len(test_data):]

## 8. 분류 모델 생성 및 분류 성능 구하기

### - 8.1 SVM rbf 모델 생성 및 예측 

In [16]:
classifier_rbf = svm.SVC()
classifier_rbf.fit(train_x, train_y)
prediction_rbf = classifier_rbf.predict(test_x)

print(accuracy_score(test_y, prediction_rbf))



0.57


### - 8.2. SVM linear 모델 생성 및 예측

In [17]:
classifier_linear = svm.SVC(kernel='linear')
classifier_linear.fit(train_x, train_y)
prediction_linear = classifier_linear.predict(test_x)

print(accuracy_score(test_y, prediction_linear))

0.83


### - 8.3. Random Forest 모델 생성 및 예측

In [18]:
rf = RandomForestClassifier()
rf.fit(train_x, train_y)
predict_rf = rf.predict(test_x)

print(accuracy_score(test_y, predict_rf))

0.67




### - 8.4. Gradient Boosting 분류 모델 생성 및 예측

In [19]:
gb = GradientBoostingClassifier()
gb.fit(train_x, train_y)
predict_gb = gb.predict(test_x)

print(accuracy_score(test_y, predict_gb))

0.795


# sentiment classification results
# Term-Document Matrix 

#### - SVM rbf                         : 0.57
#### - SVM linear                    : 0.83
#### - Random Forest            : 0.675
#### - Gradient Boosting       : 0.795

