---
title:  "Text Analysis Sentiment"
excerpt: "Text Analysis"  

categories:  
  - Deep-Learning  
tags:  
  - Text Analysis
  - Opinion Review  
  - Sentimental  
last_modified_at: 2020-08-10T15:20:00-05:00
---

## Reference  
* 파이썬 머신러닝 완벽가이드 - 권철민
* NCIA shkim.hi@gmail.com  
* [딥 러닝을 이용한 자연어 처리 입문](https://wikidocs.net/44249)

## 지도학습 기반 감성 분석 실습 – IMDB 영화평
* [dataset](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

#### kaggle 에서, download 가능  
[kaggle 주소](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

In [1]:
import pandas as pd

# review_df = pd.read_csv('D:/202007_JJH/AI-RNN/lab_data/ch01/dataset/labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df = pd.read_csv('D:/★2020_ML_DL_Project/Alchemy/dataset/labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df.head(3)
## header = 0 : 0번째 row 를 header 로 쓰겠다.

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [2]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [3]:
review_df['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

* 데이터 사전 처리 - html 태그 제거 및 숫자/문자 제거

In [4]:
import re

# <br> html 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />',' ')

# 파이썬의 정규 표현식 모듈인 re를 이용하여 영어 문자열이 아닌 문자는 모두 공백으로 변환 
review_df['review'] = review_df['review'].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )

In [6]:
# 영문 Text만 남기고 나머지는 제거 한다.
print(review_df['review'][0])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for  

* 학습/테스트 데이터 분리

In [7]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)

X_train, X_test, y_train, y_test= train_test_split(feature_df, class_df, test_size=0.3, random_state=156)
X_train.shape, X_test.shape

((17500, 1), (7500, 1))

class_df : class = label  
feature_df : 영어문단

In [11]:
print(feature_df[:5])

                                              review
0   With all this stuff going down at the moment ...
1     The Classic War of the Worlds   by Timothy ...
2   The film starts with a manager  Nicholas Bell...
3   It must be assumed that those who praised thi...
4   Superbly trashy and wondrously unpretentious ...


In [12]:
for i in range(10):
    print(len(X_train.review.values[i]))

1792
2469
598
1871
2032
1061
733
1172
713
592


각각 문장들의 길이가 모두 다르다는 것을 알 수 있따.

* Pipeline을 통해 Count기반 피처 벡터화 및 머신러닝 학습/예측/평가

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

In [14]:
# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization수행. 
# LogisticRegression의 C는 10으로 설정. 
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(solver='liblinear', C=10))])

In [15]:
# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc때문에 수행.  
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),roc_auc_score(y_test, pred_probs)))

예측 정확도는 0.8860, ROC-AUC는 0.9503


In [16]:
# for test by SOO
t = pipeline.predict_proba(X_test['review'])
print(t[:5])

[[9.99999759e-01 2.40712242e-07]
 [9.60692236e-01 3.93077644e-02]
 [9.85290870e-01 1.47091302e-02]
 [6.42742280e-01 3.57257720e-01]
 [9.69684380e-01 3.03156197e-02]]


In [17]:
# for test by SOO
t = pipeline.predict_proba(X_test['review'])[:,1]
print(t[:5])

[2.40712242e-07 3.93077644e-02 1.47091302e-02 3.57257720e-01
 3.03156197e-02]


## 비지도학습 기반 감성 분석 - 감성 어휘 사전을 이용한 분석
* 비지도 감성 분석은 Lexicon(어휘집)을 기반으로 하는 것임
* Lexicon은 일반적으로 어휘집을 의미하지만 여기서는 주로 감성만을 분석하기 위해 지원하는 감성 어휘 사전임

### SentiWordNet을 이용한 Sentiment Analysis 
* WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

In [None]:
import nltk
nltk.download('all')

* WordNet(워드넷)은 자연어 처리(NLP, Natural Language Processing)를 위한 특화된 사전이라고 볼 수 있다.
* WordNet(워드넷)은 영어의 의미 어휘목록이다. WordNet은 영어 단어를 'synset'이라는 유의어 집단(동의어 집합)으로 분류하여 간략하고 일반적인 정의를 제공하고, 이러한 어휘목록 사이의 다양한 의미 관계를 기록한다.
* WordNet은 다양한 상황에서 같은 어휘라도 다르게 사용되는 어휘의 semantic 정보를 제공하며, 이를 위해 각각의 품사(명사, 동사, 형용사, 부사)로 구성된 개별 단어를 Synset이라는 개념을 이용해 표현함

In [None]:
from nltk.corpus import wordnet as wn ## 동의사전, 유의사전

term = 'present'

# 'present'라는 단어로 wordnet의 synsets 생성. 
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 갯수:', len(synsets))
print('synsets() 반환 값 :', synsets)
print(synsets[0].name())
print(synsets[0].definition())

In [None]:
for synset in synsets :
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :',synset.lexname()) ## 문법적인 이름이 뭔지를 보여준다.
    print('Definition:',synset.definition())
    print('Lemmas:',synset.lemma_names())

In [None]:
# synset 객체를 단어별로 생성합니다. 
tree = wn.synset('tree.n.01') ## Synset 객체만들기
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')
entities = [tree , lion , tiger , cat , dog]

entity_names = [ entity.name().split('.')[0] for entity in entities]

In [None]:
entities[0].name().split('.')[0]

In [None]:
entity_names

In [None]:
print(tree.path_similarity(lion), lion.path_similarity(tiger), dog.path_similarity(cat))

In [None]:
type(entities)

In [None]:
# 단어별 synset 들을 iteration 하면서 다른 단어들의 synset과 유사도를 측정합니다. 
similarities = []
for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame형태로 저장합니다.  
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df

In [None]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 갯수:', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)

In [None]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수: ', father.pos_score())
print('father 부정감성 지수: ', father.neg_score())
print('father 객관성 지수: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수: ',fabulous .pos_score())
print('fabulous 부정감성 지수: ',fabulous .neg_score())
print('fabulous 객관성 지수: ',fabulous .obj_score())

In [None]:
from nltk.corpus import wordnet as wn

## PennTreebank 은 또다른 사전. NLTK와는 다른 또다른 사전이다.
# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return 

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        # NTLK 기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV): # 동사같은 품사는 건너뛴다는 의미, 주어도 건너뛴다.
                continue                   
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma: ## 알수없는 단어라면, 버린다.
                continue
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성. 
            synsets = wn.synsets(lemma , pos=wn_tag) ## 이해가능한 단어들만, 유의어를 찾는다.
            if not synsets: # 근데 유의어가 없으면 또 버리고
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
        return 1
    
    return 0

In [None]:
review_df['preds'] = review_df['review'].apply( lambda x : swn_polarity(x) )
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

In [None]:
print(y_target[:10])
print(preds[:10])

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, preds))
print("정확도:", accuracy_score(y_target , preds))
print("정밀도:", precision_score(y_target , preds))
print("재현율:", recall_score(y_target, preds))

## VADER lexicon을 이용한 Sentiment Analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)  # compound : neg/neu/pos를 조합한 값 -1 ~ 1, 보통 0.1이상이면 긍정 

Vader라고 하는 것들은 sentiwordnet 과는 다른 지수를 제공해준다. 이정도로 이해하면 된다.

In [None]:
def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound 값에 기반하여 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환 
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda 식을 이용하여 레코드별로 vader_polarity( )를 수행하고 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x, 0.1) )
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

In [None]:
print('#### VADER 예측 성능 평가 ####')
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, vader_preds))
print("정확도:", accuracy_score(y_target , vader_preds))
print("정밀도:", precision_score(y_target , vader_preds))
print("재현율:", recall_score(y_target, vader_preds))