<a href="https://colab.research.google.com/github/ekqlsrla/ESAA-2/blob/main/HW/1104_CH0805_%EA%B0%90%EC%84%B1%EB%B6%84%EC%84%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **파이썬 머신러닝 완벽 가이드**
---
# **| Ch05** 감성분석



## 1. 감성분석 

: 문서의 주관적인 감성/의견/감정/기분 등을 파악하기 위한 방법

* 문서 내 텍스트가 나타내는 여러 가지 주관적인 단어와 문맥을 기반으로 **감성(Sentiment)수치**를 계산하는 방법 이용
* 감정 지수 = 긍정감정지수 / 부정 감정 지수
* 머신러닝 관점
  * `지도학습` : 학습 데이터와 타킷 레이블 값을 기반으로 감성 분석 학습을 수행한 뒤 이를 기반으로 다른 데이터의 감성 분석을 예측하는 방법으로 일반적인 텍스트 기반의 분류와 동일
  * `비지도학습` : *Lexicon*이라는 감성 어휘사전 이용하여 문서의 긍정적, 부정적 감성 여부 판단

---
## 2. 지도학습 기반 감성 분석 실습 - `IMDB 영화평`

* `id` : 각 데이터의 id
* `sentiment` : 영화평의 결과 값 -> Target Label
* `review` : 영화평의 텍스트

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

review_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA-2/DATA/IMDB 영화평/labeledTrainData.tsv.zip', header = 0, sep = '\t', quoting = 3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


* 칼럼 텍스트값 살펴보기

1. HTML 형식에서 `<br/>` 태그가 존재 -> 삭제
2. replace()를 str에 적용해 `<br/>` 태그를 공백으로 바꿈
3. 영어가 아닌 숫자/특수문자 역시 피처로 의미가 없으므로 공란으로 변경


In [3]:

print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [4]:
import re

#<br> 태그 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br/>', ' ')

#파이썬의 정규 표현식 모듈인 re를 이용해 영어 문자열이 아닌 문자는 공백으로 변환
review_df['review'] = review_df['review'].apply(lambda x: re.sub('[^a-zA-Z]', " ", x))

In [5]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'],axis = 1, inplace = False)
X_train,X_test,y_train,y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)

X_train.shape,X_test.shape

((17500, 1), (7500, 1))

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

#스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization 수행

pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words = 'english', ngram_range = (1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

# fit,predict 수행
pipeline.fit(X_train['review'],y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test,pred),
                                                 roc_auc_score(y_test,pred_probs)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


예측 정확도는 0.8865, ROC-AUC는 0.9506


In [7]:
#TF-IDF 벡터화를 적용해 예측 성능 측정

pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1,2))),
    ('lr_clf',LogisticRegression(C=10))
])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test,pred),
                                                 roc_auc_score(y_test,pred_probs)))

예측 정확도는 0.8932, ROC-AUC는 0.9600


---
## 3. 비지도학습 기반 감성 분석 소개

* `NLP 패키지의 WordNet` : 방대한 영어 어휘 사전
  * 다양한 상황에서 같은 어휘라도 다르게 사용되는 어휘의 시맨틱 정보 제공
  * 예측 성능은 좋지 못하다는 단점

* `SentiWordNet` : 감성 단어 전용의 WordNet을 구현한 것
  * **Synset**별로 3가지 감성 점수 할당
  * 문장별로 단어들의 긍정감성지수와 부정감성지수를 합산하여 최종 감성 지수를 계산하고 이에 기반해 감성이 긍정인지 부정인지 결정

* `VADER` : 소셜 미디어의 텍스트에 대한 감성 분석 제공
  * 뛰어난 감성 분석 결과 제공
  * 비교적 빠른 수행시간

* `Pattern` : 예측 성능 측면에서 가장 주목받는 패키지


---
## 4. SentiWordNet을 이용한 감성 분석


### 1) WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

In [8]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

True

In [10]:
from nltk.corpus import wordnet as wn

synsets = wn.synsets(term)
print('synsets() 반환 type : ', type(synsets))
print('synsets() 반환 값 개수 : ', len(synsets))
print('synsets() 반환 값 : ', synsets)

synsets() 반환 type :  <class 'list'>
synsets() 반환 값 개수 :  18
synsets() 반환 값 :  [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


In [12]:
#synset객체의 속성 살펴보기

for synset in synsets :
  print('##### Synset name : ', synset.name(), '#####')
  print('POS : ', synset.lexname())
  print('Definition:', synset.definition())
  print('Lemmas : ', synset.lemma_names())

##### Synset name :  present.n.01 #####
POS :  noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas :  ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS :  noun.possession
Definition: something presented as a gift
Lemmas :  ['present']
##### Synset name :  present.n.03 #####
POS :  noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas :  ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS :  verb.perception
Definition: give an exhibition of to an interested audience
Lemmas :  ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS :  verb.communication
Definition: bring forward and present to the mind
Lemmas :  ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS :  verb.creation
Definition: perform (a play), especially on a stage
Lemmas :  ['stage', 'p

In [15]:
#synset 객체를 단어별로 생성

tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree,lion, tiger, cat, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]

for entity in entities :
  similarity = [round(entity.path_similarity(compared_entity),2) for compared_entity in entities]
  similarities.append(similarity)

similarity_df = pd.DataFrame(similarities, columns = entity_names, index = entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


In [16]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type : ', type(senti_synsets))
print('senti_synsets() 반환 값 개수 : ', len(senti_synsets))
print('senti_synsets() 반환 값 : ', senti_synsets)

senti_synsets() 반환 type :  <class 'list'>
senti_synsets() 반환 값 개수 :  11
senti_synsets() 반환 값 :  [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


In [17]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수 : ',father.pos_score())
print('father 부정감성 지수 : ',father.neg_score())
print('father 객관성 지수 : ',father.obj_score())
print('\n')

fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수 : ', fabulous.pos_score())
print('fabulous 부정감성 지수 : ', fabulous.neg_score())

father 긍정감성 지수 :  0.0
father 부정감성 지수 :  0.0
father 객관성 지수 :  1.0


fabulous 긍정감성 지수 :  0.875
fabulous 부정감성 지수 :  0.125


### 2) SentiWordNet을 이용한 영화 감상평 감성 분석

1. 문서를 문장 단위로 분해
2. 다시 문장을 단어 단위로 토큰화하고 품사 태깅
3. 품사 태깅된 단어 기반으로 Synset 객체와 senti_synset 객체를 생성
4. Senti_synset에서 긍정 감성/부정 감성 지수를 구하고 이를 모두 합산해 특정 임계치 값 이상일 때 긍정 감성으로, 그렇지 않을 떄는 부정 감성으로 결정

In [18]:
from nltk.corpus import wordnet as wn

def penn_to_wn(tag) :
  if tag.startswith('J') :
    return wn.ADJ
  elif tag.startswith('N') :
    return wn.NOUN
  elif tag.startswith('R') :
    return wn.ADV
  elif tag.startswith('V') :
    return wn.VERB

In [19]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text) :
  sentiment = 0.0
  tokens_count = 0

  lemmatizer = WordNetLemmatizer()
  raw_sentences = sent_tokenize(text)
  #분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산
  for raw_sentences in raw_sentences :
    #NTLK 기반의 품사 태깅 문장 추출
    tagged_sentence = pos_tag(word_tokenize(raw_sentences))
    for word,tag in tagged_sentence :
      wn_tag = penn_to_wn(tag)
      if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
        continue
      lemma = lemmatizer.lemmatize(word,pos = wn_tag)
      if not lemma :
        continue
        #어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력
        synsets = wn.synsets(lemma,pos = wn_tag)
        if not synsets :
          continue
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())
        sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
        tokens_count +=1

    if not tokens_count :
      return 0
    
    if sentiment >= 0 :
      return 1
    return 0

In [24]:
review_df['preds'] = review_df['review'].apply(lambda x : swn_polarity(x))
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

In [25]:
#감성 분석 예측 성능

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target,preds))
print('정확도 : ', np.round(accuracy_score(y_target,preds),4))
print('정밀도 : ', np.round(precision_score(y_target, preds),4))
print('재현율 : ', np.round(recall_score(y_target,preds),4))

[[12500     0]
 [12500     0]]
정확도 :  0.5
정밀도 :  0.0
재현율 :  0.0


  _warn_prf(average, modifier, msg_start, len(result))


---
## 5. VADER를 이용한 감성 분석

* 특정 임곗값 이상이면 긍정 / 아니면 부정으로 판단
* **neu**는 중립, **pos**는 긍정 **neg**는 부정

In [26]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.127, 'neu': 0.747, 'pos': 0.125, 'compound': -0.7943}


In [27]:
def vader_polarity(review, threshold = 0.1) :
  analyzer = SentimentIntensityAnalyzer()
  scores = analyzer.polarity_scores(review)

  agg_score = scores['compound']
  final_sentiment = 1 if agg_score >= threshold else 0
  return final_sentiment

review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x,0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print(confusion_matrix(y_target, vader_preds))
print('정확도 : ', np.round(accuracy_score(y_target, vader_preds),4))
print('정밀도 : ', np.round(precision_score(y_target, vader_preds),4))
print('재현율 : ', np.round(recall_score(y_target,vader_preds),4))

[[ 6749  5751]
 [ 1856 10644]]
정확도 :  0.6957
정밀도 :  0.6492
재현율 :  0.8515
