IMDB 영화평 감성분석

In [1]:
import numpy as np 
import pandas as pd

In [2]:
df = pd.read_csv('../00.data/IMDB/labeledTrainData.tsv',
                 header=0, sep='\t', quoting=3)
df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [4]:
df.review[0][:1000]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [6]:
# <br /> 태그는 공백으로 변환
df['review'] = df.review.str.replace('<br />', ' ')

In [7]:
# 영어 이외의 문자는 공백으로 변환
import re

df['review'] = df.review.apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

In [8]:
from sklearn.model_selection import train_test_split

feature_df = df.drop(['id', 'sentiment'], axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(
    feature_df, df.sentiment, test_size=0.3, random_state=156
)
X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [9]:
X_train.head()

Unnamed: 0,review
3724,This version moved a little slow for my taste...
23599,I really enjoyed this film because I have a t...
11331,Saw this in the theater in and fell out o...
15745,Recently I was looking for the newly issued W...
845,Escaping the life of being pimped by her fath...


In [10]:
y_train.head()

3724     0
23599    1
11331    1
15745    1
845      1
Name: sentiment, dtype: int64

In [12]:
len(X_test)

7500

In [49]:
df_test = pd.DataFrame(X_test, columns=['review'])
df_test['sentiment'] = y_test
df_test.to_csv('F:/workspace/Flask/03_Module/static/data/imdb_test1.csv', index=False)

In [52]:
df_test = pd.read_csv('F:/workspace/Flask/03_Module/static/data/imdb_test1.csv')
df_test.head(10)

Unnamed: 0,review,sentiment
0,My girlfriend and I were stunned by how bad t...,0
1,What do you expect when there is no script to...,0
2,This is a German film from that is somet...,0
3,Richard Tyler is a little boy who is scared o...,0
4,I run a group to stop comedian exploitation a...,0
5,I can watch a good gory film now and then I ...,0
6,I should admit first I am a huge fan of The D...,1
7,I got to see this film at a preview and was d...,1
8,Homeward Bound The Incredible Journey is...,1
9,Still Crazy has been compared to the Spinal T...,1


In [11]:
X_train.to_csv('F:\workspace/Flask/03_Module/static/data/imdb_train.csv',sep=',',encoding='utf8',index=False)
X_test.to_csv('F:\workspace/Flask/03_Module/static/data/imdb_test.csv',sep=',',encoding='utf8',index=False)


In [14]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

CountVectorizer

In [14]:
count_vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
count_vect.fit(X_train.review)
X_train_count = count_vect.transform(X_train.review)
X_test_count = count_vect.transform(X_test.review)

In [15]:
lr_clf = LogisticRegression(C=10)
lr_clf.fit(X_train_count, y_train)
pred = lr_clf.predict(X_test_count)
accuracy_score(y_test, pred)

0.886

TfidfVectorizer

In [16]:
tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf_vect.fit(X_train.review)
X_train_tfidf = tfidf_vect.transform(X_train.review)
X_test_tfidf = tfidf_vect.transform(X_test.review)

In [17]:
lr_clf = LogisticRegression(C=10)
lr_clf.fit(X_train_tfidf, y_train)
pred = lr_clf.predict(X_test_tfidf)
accuracy_score(y_test, pred)

0.8936

모델 저장하고 불러오기

In [18]:
import joblib
joblib.dump(tfidf_vect,'model/imdb_vect.pkl')
joblib.dump(lr_clf,'model/imdb_lr.pkl')

['model/imdb_lr']

In [19]:
del tfidf_vect
del lr_clf

In [20]:
new_vect = joblib.load('model/imdb_vect.pkl')
new_lr = joblib.load('model/imdb_lr.pkl')

In [21]:
new_X_test =new_vect.transform(X_test.review)

In [22]:
pred = new_lr.predict(new_X_test)
accuracy_score(y_test,pred)

0.8936

Pipeline 을 써서 학습/예측/평가

In [26]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('count_vect', CountVectorizer(stop_words='english',ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])
pipeline.fit(X_train.review, y_train)
pred =pipeline.predict(X_test.review)
acc = accuracy_score(y_test,pred)
print(f'CountVectorizer + LogisticRegression 정확도: {acc:.4f}')

CountVectorizer + LogisticRegression 정확도: 0.8860


In [27]:
joblib.dump(pipeline,'model/pipeline.pkl')

['model/pipeline.pkl']

In [30]:
new_pipe = joblib.load('model/pipeline.pkl')

In [32]:
pred =new_pipe.predict(X_test.review)
acc = accuracy_score(y_test,pred)
print(f'CountVectorizer + LogisticRegression 정확도: {acc:.4f}')

CountVectorizer + LogisticRegression 정확도: 0.8860


과제:

In [15]:
from sklearn.pipeline import Pipeline

In [16]:
n_pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english',ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])


In [17]:
params = {
    'tfidf_vect__ngram_range': [(1,1), (1,2)],
    'tfidf_vect__max_df': [300, 700],
    'lr_clf__C': [1, 10]
}

In [18]:
from sklearn.model_selection import GridSearchCV

grid_pipe = GridSearchCV(n_pipeline, param_grid=params, cv=3,
                         scoring='accuracy', verbose=1)
grid_pipe.fit(X_train.review, y_train)
print(grid_pipe.best_params_, grid_pipe.best_score_)



Fitting 3 folds for each of 8 candidates, totalling 24 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  5.6min finished
{'lr_clf__C': 10, 'tfidf_vect__max_df': 700, 'tfidf_vect__ngram_range': (1, 2)} 0.8822287469759523


In [19]:
#best_count_lr = grid_pipe.best_estimator_  #선생님꺼
#pred_count_lr = best_count_lr.predict(df_test.review.values)
#accuracy_score(df_test.sentiment.values, pred_count_lr)

pred =grid_pipe.predict(X_test.review)
acc = accuracy_score(y_test,pred)
print(f'TfidfVectorizer + LogisticRegression 정확도: {acc:.4f}')

TfidfVectorizer + LogisticRegression 정확도: 0.8872


In [None]:
#joblib.dump(best_count_lr, '../static/model/imdb_count_lr.pkl')

In [21]:
import joblib
joblib.dump(n_pipeline, 'F:/workspace/Flask/03_Module/static/model/pipeline_tl.pkl')

['F:/workspace/Flask/03_Module/static/model/pipeline_tl.pkl']

test>

In [25]:
df_test = pd.read_csv('F:/workspace/Flask/03_Module/static/data/imdb_test.csv')
df_test.head()

Unnamed: 0,review
0,My girlfriend and I were stunned by how bad t...
1,What do you expect when there is no script to...
2,This is a German film from that is somet...
3,Richard Tyler is a little boy who is scared o...
4,I run a group to stop comedian exploitation a...


In [31]:
#Test
index = 4
test_data = (df_test.iloc[index].values) #test_data.append(df_test.iloc[index, 0]) 선생님꺼
#label = test_data.iloc[index, -1]

test_data,test_data.shape
#,label

(array([' I run a group to stop comedian exploitation and I just spent the past   months hearing horror stories from comedians who attempted to audition for    Last Comic Standing    If they don t have a GOOD agent  then they don t even get a chance to audition so more than     of the comedians who turn up are rejected before they can show anyone that they have talent  If they do make it to an audition  I was told that it s   pre determined   if they get a second chance  So what the TV audience sees is NOT the best comics in the US   If the comics do make it to the show  then most of them don t get IMDb credits  I know this because I did the credits for all   seasons of    Last Comic Standing   and I don t get paid for doing the Producers  job  It s really a disgrace  A month ago  I asked    Last Comic Standing     on Facebook why the Producers aren t giving IMDb credits and I was banned from their Facebook Page    I am not a comedian so I do not have a personal stake in this  I just w

In [39]:
test_data = (df_test.iloc[4].values) ##==>1차원  문자 어레이 이다.
test_data

array([' I run a group to stop comedian exploitation and I just spent the past   months hearing horror stories from comedians who attempted to audition for    Last Comic Standing    If they don t have a GOOD agent  then they don t even get a chance to audition so more than     of the comedians who turn up are rejected before they can show anyone that they have talent  If they do make it to an audition  I was told that it s   pre determined   if they get a second chance  So what the TV audience sees is NOT the best comics in the US   If the comics do make it to the show  then most of them don t get IMDb credits  I know this because I did the credits for all   seasons of    Last Comic Standing   and I don t get paid for doing the Producers  job  It s really a disgrace  A month ago  I asked    Last Comic Standing     on Facebook why the Producers aren t giving IMDb credits and I was banned from their Facebook Page    I am not a comedian so I do not have a personal stake in this  I just wa

In [40]:
test_data.shape

(1,)

In [45]:
test = []
test.append(df_test.iloc[4].values)
test[0].shape

(1,)

In [37]:
test_data = test_data.reshape(1,-1) #1차원 데이터는 predict를 할수 없다.그래서 reshape 함#==>2차원 어레이 문자이다.
test_data

array([[' I run a group to stop comedian exploitation and I just spent the past   months hearing horror stories from comedians who attempted to audition for    Last Comic Standing    If they don t have a GOOD agent  then they don t even get a chance to audition so more than     of the comedians who turn up are rejected before they can show anyone that they have talent  If they do make it to an audition  I was told that it s   pre determined   if they get a second chance  So what the TV audience sees is NOT the best comics in the US   If the comics do make it to the show  then most of them don t get IMDb credits  I know this because I did the credits for all   seasons of    Last Comic Standing   and I don t get paid for doing the Producers  job  It s really a disgrace  A month ago  I asked    Last Comic Standing     on Facebook why the Producers aren t giving IMDb credits and I was banned from their Facebook Page    I am not a comedian so I do not have a personal stake in this  I just w

In [38]:
test_data.shape

(1, 1)

In [34]:
df_test.review[index] #==>0차원 스칼라 문자이다.

' I run a group to stop comedian exploitation and I just spent the past   months hearing horror stories from comedians who attempted to audition for    Last Comic Standing    If they don t have a GOOD agent  then they don t even get a chance to audition so more than     of the comedians who turn up are rejected before they can show anyone that they have talent  If they do make it to an audition  I was told that it s   pre determined   if they get a second chance  So what the TV audience sees is NOT the best comics in the US   If the comics do make it to the show  then most of them don t get IMDb credits  I know this because I did the credits for all   seasons of    Last Comic Standing   and I don t get paid for doing the Producers  job  It s really a disgrace  A month ago  I asked    Last Comic Standing     on Facebook why the Producers aren t giving IMDb credits and I was banned from their Facebook Page    I am not a comedian so I do not have a personal stake in this  I just want peop

In [48]:
review = '오늘도 잠이 오지만 달립니다'
test1= [ ]
test1.append(review) #list(review) 이거는 안된다!
test1

['오늘도 잠이 오지만 달립니다']

In [None]:
#여기부터 다시! 나는 긍정 부정을 않넣었네.

In [None]:
label = df_test.sentiment[index]
label