## IMDB 영화평 감성분석(이진 분류)
- Kaggle - Bag of Words meets Bags of Popcorns

#### 1. 데이터 탐색

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t')
df.head(3)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [3]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3)      # 3: Quote None
df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [5]:
print(df.review[0][:1000])      # 데이터에 <br /> 포함되어 있음

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [6]:
# 결측치 체크
df.isna().sum().sum()

0

In [7]:
# 중복여부 체크
df.review.nunique()

24904

In [12]:
# 중복 데이터 제거
df.drop_duplicates(subset=['review'], inplace=True)
df.shape

(24904, 3)

#### 2. 텍스트 전처리

In [13]:
# <br /> 태그는 공백으로 변환
df.review = df.review.str.replace('<br />', ' ')

In [14]:
# 구둣점, 숫자 제거 - 영문자 이외의 문자는 공백으로 변환
df.review = df.review.str.replace('[^A-Za-z]', ' ', regex=True)

#### 3. 데이터셋 분리

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values, df.sentiment.values, stratify=df.sentiment.values,
    test_size=0.2, random_state=2023
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((19923,), (4981,), (19923,), (4981,))

#### 4. 텍스트 인코딩

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')

In [17]:
# train과 test dataset의 변환후 사이즈가 동일해야 함
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((19923, 66641), (4981, 66641))

#### 5. 학습 및 평가

In [20]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023, max_iter=500)

In [21]:
%time lrc.fit(X_train_cv, y_train)

CPU times: total: 27 s
Wall time: 7.01 s


In [22]:
lrc.score(X_test_cv, y_test)

0.8813491266813893

#### 6. Bigram

In [23]:
cvect2 = CountVectorizer(stop_words='english', ngram_range=(1,2))
cvect2.fit(X_train)
X_train_cv2 = cvect2.transform(X_train)
X_test_cv2 = cvect2.transform(X_test)
X_train_cv2.shape, X_test_cv2.shape

((19923, 1454639), (4981, 1454639))

In [24]:
lrc2 = LogisticRegression(random_state=2023, max_iter=500)
%time lrc2.fit(X_train_cv2, y_train)

CPU times: total: 4min 32s
Wall time: 1min 13s


In [25]:
lrc2.score(X_test_cv2, y_test)

0.8958040554105601

#### 7. 변환기/모델 저장/로드

In [26]:
import joblib

In [27]:
# 변환기/모델 저장
joblib.dump(cvect2, 'model/imdb_cvect2.pkl')
joblib.dump(lrc2, 'model/imdb_lrc2.pkl')

['model/imdb_lrc2.pkl']

In [28]:
# 변환기/모델 로드
new_cvect = joblib.load('model/imdb_cvect2.pkl')
new_lrc = joblib.load('model/imdb_lrc2.pkl')

#### 8. 실제 데이터로 검증

In [35]:
# 긍정 리뷰(별 10개), 부정 리뷰(별 3개)
reviews = [
    '''Maleficent is magnificent. The story is sophisticated enough to delight adult audiences with a brilliant take on the beloved tale with a delightful twist including the meaning of true love. The characters are sympathetic and there is enough excitement.
The art direction and cinematography are beautiful. The fairy land scenes resemble a pre Raphaelite painting. The castle was a bit generic CGI. The right blend of human faces with CGI so it didn't look too animated. The director Stromberg who did Oz the Great and Powerful did an even better job here.
Angelina Jolie's expressive face is the perfect showcase for the character - it is the role of her lifetime. Like the way they did her cheekbones to make it like the Disney cartoon. Sam Riley as her sidekick morphs into many fairy tale creatures crow, dragon horse. The creatures are well done not awkward in movement and not overwhelming. Elle Fanning is sweet and picture perfect for the role of Aurora and Brenton Thwaites plays her prince. The fairies including Juno Temple and Imelda Staunton are cute too.
Liked this more than the Snow White movies 'Mirror Mirror' and 'Snow White and the Hunstman'. The first was fun but a bit silly and the second was too grim. Maleficent is the perfect blend of excitement and fairy tale. Most enjoyable film of the year.
''', '''This movie was somewhat of a disappointment. The one action scene in the entire movie was short lived and lame. This movie has so many story lines. The acting was not that great and the movie was corny. It felt like a cross between Lord of the rings, harry potter, and the hobbits. The movie deserves a 3 rating because that is all that it is worth. The story was okay and nothing great. The characterization was also good. All together this movies was just dry and boring. If you like good plot twist and good story lines, this movie is not for you. It was like watching a Disney playhouses version of an action flick. Oh yeah, the action was some of the worst I had ever seen. I wanted to yank my eye balls out. The movie was also kind of annoying. The entire plot seemed long and pointless.
'''
]

In [36]:
# 텍스트 전처리
import re
reviews = map(lambda x: re.sub('^[A-Za-z]', ' ', x), reviews)

In [37]:
# feature 변환
reviews_cv = new_cvect.transform(reviews)
reviews_cv.shape

(2, 1454639)

In [38]:
# 예측
new_lrc.predict(reviews_cv)

array([1, 0], dtype=int64)