Bag of Words Meets Bags of Popcorn : 
https://www.kaggle.com/c/word2vec-nlp-tutorial/data

지도학습 기반 감성 분석 실습 – IMDB 영화평 : https://github.com/wikibook/ml-definitive-guide/blob/master/8%EC%9E%A5/8.5%20%EA%B0%90%EC%84%B1%20%EB%B6%84%EC%84%9D.ipynb

In [1]:
import pandas as pd

review_df = pd.read_csv('D:\일학습과제2018\캐글\word2vec-nlp-tutorial\labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [2]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [3]:
import re

#<br>html tag trans to "" with replace function
review_df['review'] = review_df['review'].str.replace('<br />',' ')

#use re (regular expression?) trans everything except eng to ""
review_df['review'] = review_df['review'].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )


In [4]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id', 'sentiment'], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size=0.3, random_state=156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# stop word/CountVectorization: english, filtering. ngram=(1,2)
# LogisticRegression C=10
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

# using PipeLine Object. train/predict with fit(), predict() , predict_proba() for roc_auc
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('predic accuracy is {0:.4f}, ROC-AUC is {1:.4f}'.format(accuracy_score(y_test , pred),
                                                             roc_auc_score(y_test, pred_probs)))



predic accuracy is 0.8860, ROC-AUC is 0.9503


In [11]:
# stop word : english, filtering, ngram=(1,2) >>TF-IDF Vectorize
# LogisticRegression C=10
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('predict accuracy is {0:.4f}, ROC-AUC is {1:.4f}'.format(accuracy_score(y_test , pred),
                                                              roc_auc_score(y_test, pred_probs)))



predict accuracy is 0.8936, ROC-AUC is 0.9598



### 비지도학습 기반 감성 분석 소개
### SentiWordNet을 이용한 Sentiment Analysis
* WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

In [13]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nlt

[nltk_data]    | Downloading package omw to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\omw.zip.
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\opinion_lexicon.zip.
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\paradigms.zip.
[nltk_data]    | Downloading package pil to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pil.zip.
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[n

[nltk_data]    |   Unzipping corpora\wordnet_ic.zip.
[nltk_data]    | Downloading package words to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\words.zip.
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ycoe.zip.
[nltk_data]    | Downloading package rslp to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\rslp.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     C:\Users\Administrator.WIN-
[nltk_data]    |     UASG1AEGV4D\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\A

True

In [None]:
from nltk.corpus import wordnet as wn

term = 'present'

# create synsets from wordnet with 'present'
synsets = wn.synsets(term)
print('synsets() return type :', type(synsets))
print(syn)