# 사전 지식
### 감정분석: 문장에 사용된 단어로 감정을 예측
- 예 ) "이 영화는 좀 길지만 재미있고 신난다"
 - 길다 -> 부정
 - 재미있다 -> 긍정
 - 신나다 -> 긍정 

## 예측분석
 - 선형회구분석 : 사전 기반 
  - 선형성을 가정한다. 
 - SVM
 - RandomForest 
 - Deep Learning 
 
 다양한 분석이 가능하다. 

### 회귀분석(선형, 직선모형)
 - 키가 1cm증가할때마다 몸무게가 1kg 증가한다.
 - 월소득이 100만원 증가할 때마다 몸무게가 1kg 감소 
 - 부정단어가 1개 증가할 때 마다 평점 .1 점 감점
 - 긍정단어가 1개 증가할 때 마다 평점 .1 점 증가 
 
### 회귀분석의 문제 
 - 변수가 많아지면 과적합(overfitting)이 발생
  - 적은 수의 단어도 넣게 된다면 회귀계수가 적더라도 영향을 미친다. R-Square 증가 
 - 회귀계수가 극단적으로 커지거나 작아짐
 - 예측력이 떨어짐
 - 과적합을 막아주는 방법이 필요.

### 과적합을 막는 법
 - Lasso : 작은 회귀계수를 0으로 만듬.
  - 상관관계가 높은 두단어가 있으면 하나는 없어지는 경우가 생긴다. 
 - Ridge : 전반적으로 회귀계수를 줄여줌
 - Elastic net : Lasso + Ridge
 - 감정분석에서 Lasso를 쓰면 감정 단어만 추출됨. -> 중립적인 단어들이 사라진다. 

In [1]:
mobile <- read.csv('mobile2014.csv', stringsAsFactors = F)

In [3]:
dim(mobile)

In [4]:
names(mobile)

In [2]:
head(mobile)

Unnamed: 0,X,Title,Author,ReviewID,Texts,YMD,Sentiment
1,127335,Ripoff don't buy it. Would like to know how to get my money back.,Alex Cropper,R112X6CB1GTVF7,Phone does not work. Does not allow outgoing text . Incoming calls. Would not disable South African settings. Waste of money,1/14/14,0
2,161579,I am not satisfied with the product,Monica Heredia,R3MQ3FY4PWPQFM,"I am not satisfied with the iPhone 5s, because I bought an unlocked iPhone but the one that I recieved is not unlocked, I am not any more in the US, could you tell me how Will you solve this problem.In despiste of your decision, I want yo tell you I am completely not satisfied with the product.",1/7/14,0
3,152064,Stay away buy something else,Mark Ducette monte carlo,R1QVAAZ9DWLN5Q,"EATS BATTERIES, BRAND NEW BATTERIES IT EATS FASTER, LG KNOWN PROBLEM WITH THIS MODEL THOUGHT IT WAS MY DAUGHTER, BOY WAS I WRONG.",1/28/14,0
4,180094,wondering,eaamber,R3AR8LYIC3BOI9,I'm wondering if this phone is good or bad I just bought him and there's good reviews and bad and does it take att go phone sim card or a special kind please let me know,1/15/14,0
5,180037,virus,pamela,RCYC822DI3R6K,When i got the phone it had a virus on it within 3 days the phone stopped working all together.,1/10/14,0
6,137918,Sprint Error 34,Jane Paulson,R9C4R6HIQLKM2,Have the phone with Verizon and Sprint. Sprint has said they have a text issue (Error 34) that there are delays / complete shut downs along the network. It varies among phones and the Samsung (on the Sprint network) appears to be more prone to this issue than other manufacturers - love the phone on Verizon - hate it on Sprint.,1/3/14,0


In [5]:
mobile[2,]
mobile[1035,]

Unnamed: 0,X,Title,Author,ReviewID,Texts,YMD,Sentiment
2,161579,I am not satisfied with the product,Monica Heredia,R3MQ3FY4PWPQFM,"I am not satisfied with the iPhone 5s, because I bought an unlocked iPhone but the one that I recieved is not unlocked, I am not any more in the US, could you tell me how Will you solve this problem.In despiste of your decision, I want yo tell you I am completely not satisfied with the product.",1/7/14,0


Unnamed: 0,X,Title,Author,ReviewID,Texts,YMD,Sentiment
1035,41535,works,Noah Crabtree,R384KSITCNYYVL,Another way to pay then having to go to wal mart or deal with the straight talk website or the phone :),1/21/14,1


In [7]:
table(mobile$Sentiment) # 부정적인것 0 긍정적인것 1 


   0    1 
 999 1000 

## DocumentTermMatrix 만들기 

In [8]:
library(tm)

: package 'tm' was built under R version 3.3.2Loading required package: NLP
: package 'NLP' was built under R version 3.3.2

In [10]:
corpus <- Corpus(VectorSource(mobile$Texts)) # 말뭉치 만들기 

In [15]:
stopwords() # 제거해야될 단어. 

In [12]:
stopwords("SMART") # 리스트가 적을 수도 있다. 조금 더 많은 양의 단어. 

 - weightTfIdf : TermFrequency 에서 보정. (SNS에 설명이 더 있다. )

In [13]:
dtm <- DocumentTermMatrix(corpus,
                         control = list(tolower = T, 
                                       removePunctuation = T,
                                       removeNumbers = T,
                                       stopwords=stopwords("SMART"),
                                       weighting=weightTfIdf))

In weighting(x): empty document(s): 534 1947

In [14]:
dtm

<<DocumentTermMatrix (documents: 1999, terms: 8453)>>
Non-/sparse entries: 46358/16851189
Sparsity           : 100%
Maximal term length: 132
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

## 회귀분석으로 감정 사전 만들기 

In [16]:
library(glmnet)

: package 'glmnet' was built under R version 3.3.2Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5



In [18]:
X <- as.matrix(dtm)
Y <- mobile$Sentiment

In [19]:
X[1:5,1:5] # 첫번째 사람이 말했던 5개 단어에서 

Unnamed: 0,aah,aaps,aarp,abd,ability
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0
5,0,0,0,0,0


In [20]:
dim(X)

In [22]:
Y[1:5] # 부정 

In [23]:
Y[1001:1005] # 긍정 

### 회귀분석

In [21]:
res.lm <- glmnet(X, Y, family = "binomial", lambda = 0) # family : 두가지 값일때는 binomial, 일반적 회귀분석에선 lambda = 0

In [25]:
res.lm


Call:  glmnet(x = X, y = Y, family = "binomial", lambda = 0) 

       Df  %Dev Lambda
[1,] 8453 0.999      0

In [24]:
summary(res.lm)

           Length Class     Mode     
a0            1   -none-    numeric  
beta       8453   dgCMatrix S4       
df            1   -none-    numeric  
dim           2   -none-    numeric  
lambda        1   -none-    numeric  
dev.ratio     1   -none-    numeric  
nulldev       1   -none-    numeric  
npasses       1   -none-    numeric  
jerr          1   -none-    numeric  
offset        1   -none-    logical  
classnames    2   -none-    character
call          5   -none-    call     
nobs          1   -none-    numeric  

In [26]:
coef.lm <- coef(res.lm)[,1]
pos.lm <- coef.lm[coef.lm > 0]
neg.lm <- coef.lm[coef.lm < 0]
pos.lm <- sort(pos.lm, decreasing = T) # 정렬
neg.lm <- sort(neg.lm, decreasing = F) # 정렬

In [28]:
coef.lm[1:5] # 8453단어 중에서 5개만.

In [29]:
pos.lm[1:5]

In [30]:
neg.lm[1:5]

In [31]:
length(pos.lm) + length(neg.lm) # 회귀계수가 0인게 없다. 어느 선에서 끊을 것인가를 정해야 된다. Lasso를 이용하면 알아서 0으로 만들어준다.

## Lambda (패널티) 이해하기 
![overfitting](1.PNG)
- 가장 예측이 잘된 것이 녹색선에 잘맞게 된 것.
- M : 차항 
- 차항이 너무 높아지면 Overfitting이 일어난다. 

![overfitting](2.PNG)
 - 제약을 키우게 된다면 Training에서는 에러가 늘어나겠지만은 Test에서는 줄어들게 된다. 
 - 하지만 일정 수준을 넘어서게 된다면 에러가 둘 다 높아진다. 

### Lasso vs Ridge 
![LassoRidge](3.PNG)
 - Lasso는 마름모와 등고선이 맞나는 지점에서 정해진다. 
  - $\beta_1$ 값이 0이 되버린다. 
 - Ridge는 원과 만나는 점에서 정해진다. 
  - $\beta$ 값들이 줄어든다. 