# **Ch08.텍스트분석 실습 #3**
**Ch08-05. 감성 분석**

* 데이터 : IMDB 영화평 https://www.kaggle.com/c/word2vec-nlp-tutorial/data

1. 지도학습 기반 실습 - Count, TF-IDF 벡터화 적용
2. 비지도학습 기반 실습 - SentiWordNet, VADER 감성 사전 이용

## **data**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip
/kaggle/input/word2vec-nlp-tutorial/sampleSubmission.csv
/kaggle/input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip
/kaggle/input/word2vec-nlp-tutorial/testData.tsv.zip


In [None]:
import zipfile
         
with zipfile.ZipFile('/kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip') as existing_zip:
        existing_zip.extractall()

In [None]:
import pandas as pd

# 파일은 탭(\t)문자로 분리되어 있음을 명시
review_df = pd.read_csv('labeledTrainData.tsv', header=0, sep="\t", quoting=3) 
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."



* sentiment 1: 긍정적 평가, 0: 부정적 평가


In [None]:
# 텍스트 구성 확인
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [None]:
# 정규 표현식 지원 모듈 호출
import re

# <br />태그는 replace 함수로 공백으로 변환
review_df['review']=review_df['review'].str.replace('<br />',' ')

# 영어 문자열이 아닌 문자는 모두 공백으로 변환
review_df['review']=review_df['review'].apply(lambda x: re.sub("[^a-zA-Z]"," ",x))

In [None]:
from sklearn.model_selection import train_test_split

# 타겟값 데이터셋
class_df = review_df['sentiment']

# 피처 데이터셋
feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)

# 학습/테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size=0.3, random_state=156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

## **지도학습**
### Count벡터화 적용

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 스톱 워드는 English, filtering, ngram은 (1, 2)로 설정해 CountVectorization 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

# Pipeline 객체를 이용해 fit(),predict()로 학습/예측 수행.
# predict_proba()는 roc_auc 때문에 수행.
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_probs)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


예측 정확도는 0.8860, ROC-AUC는 0.9503


### TF-IDF 벡터화 적용

In [None]:
# 스톱 워드는 English, filtering, ngram은 (1, 2)로 설정해 CountVectorization 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

# Pipeline 객체를 이용해 fit(),predict()로 학습/예측 수행.
# predict_proba()는 roc_auc 때문에 수행.
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_probs)))

예측 정확도는 0.8936, ROC-AUC는 0.9598


## **비지도학습**
### **SentiWordNet Lexicon 기반으로 IMDB 영화 감상평 분석 수행**

1. 문서를 문장 단위로 분해
2. 문장을 단어 단위로 토큰화하고 어근 추출(Lemmatization)과 품사 태깅(POS Tagging)적용
3. 품사 태깅된 단어 기반으로 synset 객체와 senti_synset 객체를 생성
4. Senti_synset에서 긍정/부정 감성 지수를 구하고 이를 모두 합산해 특정 임계치 값 이상일 때 긍정 감성으로, 아니면 부정 감성으로 결정

In [None]:
from nltk.corpus import wordnet as wn

# 간단한 NLTK PennTreebank Tag를 기반으로 WordNet 기반의 품사 Tag로 변환
def penno_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

# 문서를 문장->단어토큰->품사 태깅 후 SentiSynset 클래스 생성하고 Polarity Score합산 하는 함수
# 총 감성 지수 : 각 단어의 긍정 감성 지수 + 부정 감성 지수
# 총 감성 지수>0 ->긍정 감성 else 부정감성

def swn_polarity(text):
    # 감성 지수 초기화
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    
    for raw_sentence in raw_sentences:
        # NLTK기반의 품사 태깅 문장 추출
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word, tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penno_to_wn(tag)
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
                continue # 동사면 패스
            lemma = lemmatizer.lemmatize(word, pos=wn_tag) # 동사 아닌것은 어근추출
            if not lemma:
                continue # 어근추출안되도 패스?
            # 어근 추출한 단어와 WordNet 기반 품사 태깅 입력해 Synset 객체 생성
            synsets = wn.synsets(lemma, pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 synset 추출 후 감성 지수 계산
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score()-swn_synset.neg_score())
            tokens_count +=1
            
    if not tokens_count:
        return 0
    
    # 총 score가 0이상일 경우 긍정 1, 아니면 부정 0 반환
    if sentiment>=0:
        return 1
    
    return 0

> 생성한 swn_polarity(text)함수를 IMDB 감상평의 개별 문서에 적용해 긍정/ 부정 감성 예측 수행

1. apply lambda 구문으로 swn_polarity(text)를 개별 문서에 적용
2. 지도학습 기반에서 생성한 review_df 사용
3. review_df 의 새로운 칼럼으로 'preds' 추가 -> swn_polarity(text)반환 값 input
4. 'sentiment'칼럼 (Target)과 'preds'칼럼(Pred) 비교

In [None]:
review_df['preds'] = review_df['review'].apply(lambda x : swn_polarity(x))
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print("정확도:", np.round(accuracy_score(y_target, preds), 4))
print("정밀도:", np.round(precision_score(y_target, preds), 4))
print("재현율:", np.round(recall_score(y_target, preds), 4))

[[7668 4832]
 [3636 8864]]
정확도: 0.6613
정밀도: 0.6472
재현율: 0.7091


### VADER를 이용한 감성 분석

1. NLTK 서브 모듈로 SentimentIntensityAnalyzer임포트
2. IMDB 감상평 감성 분석 수행

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [None]:
def vader_polarity(review, threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound 값에 기반해 threshold 입력값 보다 크면 1, 아니면 0 반환
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda 이용해 레코드별로 vader_polarity수행 후 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply(lambda x: vader_polarity(x,0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print(confusion_matrix(y_target, vader_preds))
print("정확도:", np.round(accuracy_score(y_target, vader_preds), 4))
print("정밀도:", np.round(precision_score(y_target, vader_preds), 4))
print("재현율:", np.round(recall_score(y_target, vader_preds), 4))

**[결과]**
* 정확도 : 66.13% -> 69.48% SentiWordNet보다 향상
* 재현율 : 70.91% -> 85.06% 매우크게 향상