IMDB 데이터는 영화 사이트 IMDB 에서 제공하는 데이터입니다.

영화에 대한 리뷰와 해당 리뷰가 긍정인지(1), 부정인지(0) 두가지 정보가 있습니다.

IMDB 데이터를 활용해 자연어처리(NLP, Natural Language Processing)중 텍스트의 감성 분류를 시도해 보겠습니다. 

# 데이터 살펴보기

먼저, IMDB 리뷰 데이터를 불러와 보겠습니다.

데이터는 케라스로 다음과 같이 쉽게 불러올 수 있습니다.

In [None]:
import numpy as np
from tensorflow.keras.datasets import imdb

In [None]:
# # # imdb.load_data()코드 실행시  pickle =False 에러가 나서 추가한 셀(에러 안나고 잘 불러오면 무시)

# # ## 먼저 기존의 np.load를 np_load_old에 저장
# np_load_old = np.load

# # ## 기존의 parameter을 변경
# np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

In [None]:
(X_train, y_train), (X_test, y_test) = imdb.load_data()

# (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 8000)
# num_words 파라미터를 통해 가져올 데이터에 추출할 단어수에 제한을 줄 수 있습니다.
# 만약 8,000개로 설정했다면, 단어 집합의 크기는 8,000개가 됩니다. 

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((25000,), (25000,), (25000,), (25000,))

In [None]:
print('학습 데이터 수 : {}'.format(len(X_train)))
print('테스트 데이터 수 : {}'.format(len(X_test)))

학습 데이터 수 : 25000
테스트 데이터 수 : 25000


이처럼 imdb는 훈련 데이터와 테스트 데이터를 1:1로 제공하며, 비율을 다르게 가져오는 기능은 제공하지 않습니다.

이제 학습용 데이터에서 X, Y를 자세히 살펴보겠습니다. 

먼저, y_train은 리뷰가 긍정인지, 부정인지를 나타내며 0이면 부정, 1이면 긍정을 의미합니다

In [None]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

X_train은 리뷰 글이 담겨 있습니다. 

그러나 영어로 된 자연어 문장이 아닌 숫자들로 처리되어 있습니다. 

컴퓨터는 문자가 아닌 숫자로 작업하기 때문에 리뷰를 단어 형태로 잘라 숫자로 매핑하여 사용합니다. 

예를 들면 다음의 과정을 통해 X_train이 만들어 졌다고 이해할 수 있습니다.

[I love you] -> ['I'/'love'/'you'] -> {'I':1, 'love':2, 'you':3 } -> [1, 2, 3]

In [None]:
print(X_train[0]) #첫번째 리뷰 글

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


X_train[0]이 원래 어떤 단어들로 이루어진 리뷰였는지 확인해보겠습니다.

In [None]:
word_index = imdb.get_word_index()
index_word = {}

for key, value in word_index.items():
    index_word[value+3] = key  # +3을 해주는 이유는 IMDB 데이터에서 0~3 까지는 특별한 용도로 사용하고 있기 때문에 비워줍니다.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [None]:
for index, token in enumerate(("<pad>", "<sos>", "<unk>", "<eos>")): 
    index_word[index]=token

index_word[0],index_word[1], index_word[2], index_word[3]

('<pad>', '<sos>', '<unk>', '<eos>')

In [None]:
index_word[4] #4부터는 빈도가 가장 높은 단어들이 차례대로 들어가므로 [4]는 가장 많이 등장하는 단어 입니다.

'the'

* 특수한 토큰이란?

1. sos (Start-Of-Sentence) : 문장의 시작을 의미합니다.
2. eos (End-Of-Sentence) : 문장의 끝을 의미합니다.
3. unk (Unkown) : 보통 3회 이하로 너무 적게 등장한 단어들을 하나로 묶어둡니다. 
4. pad (Padding) : 모자란 길이 만큼 채울때, 무의미한 값이라는 표시입니다.

이렇게 sos, unk, eos, pad 등 특수한 토큰을 사용해 학습 효율을 높이는데 사용합니다.

In [None]:
print(' '.join([index_word[index] for index in X_train[0]]))

<sos> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and shoul

# 감성분류 하기

In [None]:
# 직접 해보세요!



# 활용

* 우리의 모델을 활용해 실제 영화리뷰를 가져와 감정분석을 시도해보겠습니다. 
* IMDB 사이트에서 '기생충' 영화의 10점 리뷰와 1점 리뷰를 하나씩 가져와 보겠습니다.

[출처] https://www.imdb.com/title/tt6751668/reviews?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=1

In [None]:
best_review = """As I write this, I want to describe my raw initial state after I finished the film, 
I'm in a state of complete awe, staring into the wall kind of awe, Parasite is truly a work of art, a sheer masterpiece. 
This film oozes with mastery, every little detail tells a story of its own, I was drawn to it like a moth to a flame, 
it grips hard and it never lets go, it sways between genres gracefully, it offers comedy both dark and light, drama, horror, thrill, 
and it's all packaged so seamlessly, conveyed to us throughout breathtaking performances across the board, 
I've watched my share of Korean Cinema to know that's a common thing but Parasite takes it to a whole other level, 
it materializes thoughts and ideas, things words can not communicate, it is extremely rare to experience such a thing in film, 
I can only name a small number of movies that actually made me feel this way, just incredible. 
Director Bong Joon Ho proves that he is a master within his own cross-genre domain, 
he takes charge and you actually feel like you're in safe hands watching this, he shapes the scenes perfectly, A true master of his trade, Thank You. 
In all honesty, I feel like I went through an experiment where the time went still and I experienced a piece of art such as this, 
I absloutely, wholeheartedly loved every single second of this film, MINDBLOWING."""

worst_review = """This movie is trying to say that capitalism is causing both the rich and poor to become parasites - 
the poor are scam artists lying for money, and the rich are using the poor to nurture their kids. 
But it's a mess of a story, not a "masterpiece." In the real world, the poor definitely have to fight major barriers to get ahead, 
but this movie doesn't have any depth - the symbolism is amateurish, the characters are one dimensional and inconsistent, 
and the dialogue between rich and poor is totally unrealistic. The movie makes it seem like the poor are all scamming, bumbling psychopaths, 
and the rich are all naive and emotionally clueless. Sorry, it's an ignorant view of the world. 
It makes me realize how disconnected from reality the entertainment industry and its official reviewers have become. 
Or maybe Hollywood is trying to gain more money and audience share in overseas markets so they promoted this one."""

두개의 리뷰로 감성분석을 하기 위해서는 두가지 과정이 필요합니다.
1. 텍스트 -> 숫자로 인코딩하는 과정
2. 인코딩된 리뷰를 모델에 적용해 예측하는 과정

이 두가지 과정을 처리할 수 있는 함수를 만들어 사용해보겠습니다.

In [None]:
 import re

word_index = imdb.get_word_index()

def review_analysis(review, max_features, max_len):
    pass 

In [None]:
max_features = 15000
max_len = 500

print(review_analysis(best_review, max_features, max_len)) 
print(review_analysis(worst_review, max_features, max_len))

99% 의 확률로 긍정 리뷰입니다.
100% 의 확률로 부정 리뷰입니다.
