# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (Check inside your classroom for a discount code)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem" (this lesson)

- [Curate a Dataset](#lesson_1)
- [Developing a "Predictive Theory"](#lesson_2)
- [**PROJECT 1**: Quick Theory Validation](#project_1)


- [Transforming Text to Numbers](#lesson_3)
- [**PROJECT 2**: Creating the Input/Output Data](#project_2)


- Putting it all together in a Neural Network (video only - nothing in notebook)
- [**PROJECT 3**: Building our Neural Network](#project_3)


- [Understanding Neural Noise](#lesson_4)
- [**PROJECT 4**: Making Learning Faster by Reducing Noise](#project_4)


- [Analyzing Inefficiencies in our Network](#lesson_5)
- [**PROJECT 5**: Making our Network Train and Run Faster](#project_5)


- [Further Noise Reduction](#lesson_6)
- [**PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary](#project_6)


- [Analysis: What's going on in the weights?](#lesson_7)

# Lesson: Curate a Dataset<a id='lesson_1'></a>


In [2]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

**Note:** `reviews.txt` 데이터에는 모든 리뷰가 소문자로 preprocessing 되어있다.

In [3]:
len(reviews)

25000

In [4]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory<a id='lesson_2'></a>



In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


label과 review를 보기 좋게 정렬하고 어떻게 시작해야 할지 생각해보자.
Negative Review에는 'terrible'같은 단어가 많이 등장하고, postive review에는 'excellent'같은 단어가 많이 나온다.

Review와 label사이의 correlation은 어떤 관계가 있을까?
알파벳 하나 하나 단위로 보면 어떨까. 알파벳 'm'하나만 보면, 이 리뷰가 Positive인지 Negative인지 알 수 있을까? 'm','t',....이것만 봐서는 알 수 없다. 그럼 '단어'단위로 넣어보자. 'this','movie' 같은 단어는 감정이 없지만 'terrible', 'trash'만 들으면 부정적이라는 느낌이 오고 'excellent', 'genious'같은 단어를 보면 긍정적인 느낌이 느껴진다. 많이 등장할수록 그 느낌은 더 강해질 것 같다.

그래, 단어 단위로 input data를 구성하자.

# Project 1: Quick Theory Validation<a id='project_1'></a>

코드는 단어를 세는 방식이 되면 괜찮지 않을까?
python의 [Counter](https://docs.python.org/2/library/collections.html#collections.Counter) Class를 사용해보자.


In [9]:
from collections import Counter
import numpy as np

이제 3개의 `Counter` object를 만든다.

각각 positive 단어, negative 단어, 중립적인 단어들이다.

In [10]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

이제 모든 Positive 리뷰를 돌아가면서 단어 별 Count를 올려준다.
그리고 모든 Negative 리뷰를 돌아가면서 단어 별 Count를 올려준다.

In [11]:
# TODO: Loop over all the words in all the reviews and increment the counts in the appropriate counter objects

for index in range(len(reviews)):  
    if labels[index] == 'POSITIVE':
        for word in reviews[index].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else :
        for word in reviews[index].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

이제 각 count를 내림차순으로 정렬한다

In [14]:
# Examine the counts of the most common words in positive reviews
positive_counts.most_common()
# Examine the counts of the most common words in negative reviews
negative_counts.most_common()

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

위 결론에서 볼 수 있듯이, 긍정이든 부정이든 'the', 'a'같은 단어가 매우 자주 등장하는 것을 볼 수 있다. 우리는 가장 많이 등장하는 단어를 찾는 것이 아니라 긍정/부정을 잘 표현하는 단어를 찾고싶다. 따라서 **긍정 리뷰와 부정 리뷰 사이 단어의 비율**을 찾아본다.

> **TODO:** 모든 단어들의 positive : negative 비율을 찾아 `pos_neg_ratios`에 저장한다.


In [13]:
# Create Counter object to store positive/negative ratios
pos_neg_ratios = Counter()

# TODO: Calculate the ratios of positive and negative uses of the most common words
#       Consider words to be "common" if they've been used at least 100 times
for word in total_counts:
    pos_neg_ratios[word] = positive_counts[word] / (float(negative_counts[word])+1)

이렇게 조사한 단어의 몇가지 예를 보자

Examine the ratios you've calculated for a few words:

In [None]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

이제 몇가지 규칙이 보인다.

> * 긍정적인 느낌의 단어들 (예를 들어 "amazing") 은 ratio가 1보다 크다.
> 긍정적인 느낌이 더 많이 들고 자주 등장할수록 1에서 멀어진다.
> 
> * 부정적인 느낌의 단어들 (예를 들어 "terrible") 은 ratio가 1보다 작다.
> 마찬가지로 부정적인 느낌이 더 많이 들고 자주 등장할 수록 0과 가까워진다.
> 
> * 중립적인 느낌의 단어들 (예를 들어 "the") 은 ratio가 1과 비슷하다.

이제 우리는 ratio를 이용해 감정을 구별할 수 있겠다.
하지만 아직은 계산을 하기가 조금 어렵다. 매우 긍적적인 단어 "amazing"은 ratio 값이 4이고, 부정적인 단어 "terrible"은 값이 0에 가깝기 때문에 몇가지 문제가 발생한다.

> * 1은 중립적이다. "amazing"은 값이 4이고 "terrible"은 값이 0.18인데, 두 단어와 1과의 거리는 몇 배나 차이난다. 따라서 이 값으로 직접 비교는 할 수 없다.
> * 따라서 중립적인 값을 중심으로 이러한 불균형을 보정해주는 과정이 필요하다.
> * 중간 값은 1보다 0이 계산하기 편하다.

우리는 ratio를 구하는데 나누기을 사용했다.
이런 종류의 보정에는 log함수가 잘 작동한다.

> **TODO:** 모든 ratio값을 log로 변환한다.


In [None]:
# TODO: Convert ratios to logs
print (pos_neg_ratios["iraq"])

for word in pos_neg_ratios:
    if pos_neg_ratios[word] != 0:
        pos_neg_ratios[word] = np.log(pos_neg_ratios[word])

print(pos_neg_ratios["iraq"])
#     pos_neg_ratios[word] = np.log(pos_neg_ratios[word])

이제 새롭게 정의된 값들을 보자.

In [None]:
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))

이제 중립적인 단어는 0에 가까운 값을 가지고, 긍정적인 단어 "Amazing"과 부정적인 단어 "terrible"의 1보다 큰 값을 가지면서 서로 부호가 다른 것을 볼 수 있다. 이는 납득할 수 있는 것이다. 값이 양수이면 긍정적, 음수이면 부정적인 느낌이 강하다고 볼 수 있다.

이제 가장 긍정적인 느낌의 단어를 보기위해 값들을 내림차순 정리해주자.


In [None]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()

이제 부정적인 리뷰에 가장 빈번하게 출연했던 단어들을 살펴보자. 

In [None]:
list(reversed(pos_neg_ratios.most_common()))[0:30]
#pos_neg_ratios.most_common()[:-31:-1] 라고 쓸 수도 있다.

위 결과를 보면, 중립적인 단어들은 0에 가깝고 긍정적인 리뷰에 더 많이 등장한 단어는 +3 정도의 최댓값을, 부정적인 리뷰에 더 많이 등장한 단어는 -3정도의 최댓값을 가진다. 이것이 우리가 logarithm을 사용한 이유이다.

In [None]:
from IPython.display import Image

review = "This was a horrible, terrible movie."

Image(filename='sentiment_network.png')

In [None]:
review = "The movie was excellent"

Image(filename='sentiment_network_pos.png')

# Project 2: Creating the Input/Output Data<a id='project_2'></a>

**TODO:** 이제 모든 단어를 담고있는 `Vocab`이라는 [set](https://docs.python.org/3/tutorial/datastructures.html#sets) 을 만든다.

In [None]:
# TODO: Create set named "vocab" containing all of the words from all of the reviews
vocab = set(total_counts.keys())

아래 cell을 실행시키면 vocab의 사이즈를 볼 수 있다.

In [None]:
vocab_size = len(vocab)
print(vocab_size)

아래 이미지를 보자.
이제 우리는 아래 이미지와 같은 신경망을 코딩할 것이다.
`layer_0`는 input layer, `layer_1`은 hidden_layer, `layer_2`는 output layer에 해당한다.

In [None]:
from IPython.display import Image
Image(filename='sentiment_network_2.png')

**TODO:** `layer_0`에 해당하는 numpy array를 만들고 모두 0으로 초기화한다.이때 `layer_0`는 1개의 row와 `vocab_size`만큼의 columns를 가진 2차원 matrix이다.

In [None]:
layer_0 = np.zeros(( 1,vocab_size ))

아래 셀을 실행시키면, `(1, 74074)`가 나와야 한다.

In [None]:
layer_0.shape

In [None]:
from IPython.display import Image
Image(filename='sentiment_network.png')

`layer_0` 는 모든 단어에 대해 한 개의 entry를 가지고 있다.
이제 각 단어에 대한 index를 알아야 한다. 따라서 모든 단어에 대한 index를 저장한 lookup table을 만든다.

In [None]:
# Create a dictionary of words in the vocabulary mapped to index positions
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
    word2index[word] = i
    
# display the map of words to indices
word2index

**TODO:**  `update_input_layer`는 각 단어가 주어진 review에 얼마나 많이 등장한지 그 횟수를 세어서 저장한 layer이다.

In [None]:
def update_input_layer(review):
    """ Modify the global layer_0 to represent the vector form of review.
    The element at a given index of layer_0 should represent
    how many times the given word occurs in the review.
    Args:
        review(string) - the string of the review
    Returns:
        None
    """
    global layer_0
    # clear out previous state by resetting the layer to be all 0s
    layer_0 *= 0
        
    # TODO: count how many times each word is used in the given review and store the results in layer_0
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

아래 cell을 실행시켜서 첫번째 review에 대한 input layer를 업데이트 해보자.

In [None]:
update_input_layer(reviews[0])
layer_0

**TODO:** `get_target_for_labels` 함수를 만든다. 이 함수는 주어진 label이 `NEGATIVE` 이나 `POSITIVE`인지에 따라 `0` 이나 `1`을 return한다.

In [None]:
def get_target_for_label(label):
    """Convert a label to `0` or `1`.
    Args:
        label(string) - Either "POSITIVE" or "NEGATIVE".
    Returns:
        `0` or `1`.
    """
    if label == "POSITIVE":
        return 1
    else :
        return 0

`'POSITIVE'` label을 입력하면 `1`이 출력된다.

In [None]:
labels[0]

In [None]:
get_target_for_label(labels[0])

`'NEGATIVE'` label을 입력하면 `0`이 출력된다.

In [None]:
labels[1]

In [None]:
get_target_for_label(labels[1])

# Project 3: Building a Neural Network<a id='project_3'></a>

**TODO:** 이제 `SentimentalNetwork` Class를 만들어보자.
- 기본적인 신경망 구조를 코딩한다. 이 구조는 input layer, hidden layer 그리고 output layer로 구성되어있다.
- 이때, hidden layer에 비선형성(non-linearity)을 **구현하지 않는다.** 이 말은 hidden layer의 출력에 activation function을 사용하지 않는다는 말이다.
- 앞에서 만든 코드들을 활용한다.
- `train` 함수는 전체 corpus에 대해서 학습하는 함수이다.

In [None]:
import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        
        # TODO: populate review_vocab with all of the words in the given reviews
        #       Remember to split reviews into individual words 
        #       using "split(' ')" instead of "split()".
        review_vocab = set()
        
        for review in reviews:
            for word in review.split(' '):
                review_vocab.add(word)
        
        # Convert the vocabulary set to a list so we can access words via indices
        self.review_vocab = list(review_vocab)
        
        
        # TODO: populate label_vocab with all of the words in the given labels.
        #       There is no need to split the labels because each one is a single word.
        label_vocab = set()
        
        for label in labels :
            label_vocab.add(label)
        
        # Convert the label vocabulary set to a list so we can access labels via indices
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of the review and label vocabularies.
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        # TODO: populate self.word2index with indices for all the words in self.review_vocab
        #       like you saw earlier in the notebook
        self.word2index = {}
        for i, word in enumerate(self.review_vocab) :
            self.word2index[word] = i
        
        # Create a dictionary of labels mapped to index positions
        # TODO: do the same thing you did for self.word2index and self.review_vocab, 
        #       but for self.label2index and self.label_vocab instead
        self.label2index = {}
        for i, label in enumerate(self.label_vocab) :
            self.label2index[label] = i
            
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Store the number of nodes in input, hidden, and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights
        
        # TODO: initialize self.weights_0_1 as a matrix of zeros. These are the weights between
        #       the input layer and the hidden layer.
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        
        # TODO: initialize self.weights_1_2 as a matrix of random values. 
        #       These are the weights between the hidden layer and the output layer.
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,(self.hidden_nodes, self.output_nodes))
        
        # TODO: Create the input layer, a two-dimensional matrix with shape 
        #       1 x input_nodes, with all values initialized to zero
        self.layer_0 = np.zeros((1,input_nodes))
    
        
    def update_input_layer(self,review):
        # TODO: You can copy most of the code you wrote for update_input_layer 
        #       earlier in this notebook. 
        #
        #       However, MAKE SURE YOU CHANGE ALL VARIABLES TO REFERENCE
        #       THE VERSIONS STORED IN THIS OBJECT, NOT THE GLOBAL OBJECTS.
        #       For example, replace "layer_0 *= 0" with "self.layer_0 *= 0"
        
        self.layer_0 *= 0
        for word in review.split(" "):
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] += 1
        
                
    def get_target_for_label(self,label):
        # TODO: Copy the code you wrote for get_target_for_label 
        #       earlier in this notebook. 
        
        if label == 'POSITIVE':
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        # TODO: Return the result of calculating the sigmoid activation function
        #       shown in the lectures
        return (1/(1 + np.exp(-x)))
    
    def sigmoid_output_2_derivative(self,output):
        # TODO: Return the derivative of the sigmoid activation function, 
        #       where "output" is the original output from the sigmoid fucntion 
        return (1 - output)*(output)

    def train(self, training_reviews, training_labels):
        
        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0
        
        # Remember when we started for printing time statistics
        start = time.time()

        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            # TODO: Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            
            # TODO: Implement the forward pass through the network. 
            #       That means use the given review to update the input layer, 
            #       then calculate values for the hidden layer,
            #       and finally calculate the output layer.
            # 
            #       Do not use an activation function for the hidden layer,
            #       but use the sigmoid activation function for the output layer.
            
            #Update Input Layer
            self.update_input_layer(review)
            
            #Layer 1
            layer_1 = np.dot(self.layer_0,self.weights_0_1)
            
            #Layer 2
            layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
            
            # TODO: Implement the back propagation pass here. 
            #       That means calculate the error for the forward pass's prediction
            #       and update the weights in the network according to their
            #       contributions toward the error, as calculated via the
            #       gradient descent and back propagation algorithms you 
            #       learned in class.
            
            #Output Error
            # error = y - hat{y}
            # output_delta = error * sigmoid`{y}
            layer_2_error = layer_2 - get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            #Backpropagate Error
            layer_1_error = np.dot(layer_2_delta,self.weights_1_2.T)
            layer_1_delta = layer_1_error
            
            #Update Weight
            self.weights_0_1 -= np.dot(self.layer_0.T, layer_1_delta) * self.learning_rate
            self.weights_1_2 -= np.dot(layer_1.T, layer_2_delta) * self.learning_rate
            
            
            # TODO: Keep track of correct predictions. To determine if the prediction was
            #       correct, check that the absolute value of the output error 
            #       is less than 0.5. If so, add one to the correct_so_far count.
            if layer_2 >= 0.5 and label == "POSITIVE" : 
                correct_so_far += 1
            elif layer_2 < 0.5 and label == "NEGATIVE" :
                correct_so_far += 1
            
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        # TODO: Run a forward pass through the network, like you did in the
        #       "train" function. That means use the given review to 
        #       update the input layer, then calculate values for the hidden layer,
        #       and finally calculate the output layer.
        #
        #       Note: The review passed into this function for prediction 
        #             might come from anywhere, so you should convert it 
        #             to lower case prior to using it.
        self.update_input_layer(review)
        layer_1 = np.dot(self.layer_0,self.weights_0_1)
        layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
        
        
        # TODO: The output layer should now contain a prediction. 
        #       Return `POSITIVE` for predictions greater-than-or-equal-to `0.5`, 
        #       and `NEGATIVE` otherwise.
        if layer_2[0] >= 0.5: return "POSITIVE"
        else: return "NEGATIVE"

이제 `SentimentNetwork`을 만든다.
마지막 1,000개의 리뷰는 testing data이다.
여기서, `learning rate = 0.1` 로 설정해보았다.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

마지막 1,000개 리뷰 (test set)에 대해서 테스트를 한번 진행해본다.
아직 네트워크를 학습시키지 않았으므로 정확도는 50%가 나온다.

In [None]:
mlp.test(reviews[-1000:],labels[-1000:])

이제 학습을 진행해본다.

In [None]:
mlp.train(reviews[:-1000],labels[:-1000])

아마 학습이 잘 진행되지 않을 것이다. 그 이유는 바로 learning rate가 너무 크게 설정되었기 때문이다. 조금 더 작은 값인 `0.01`로 설정하고 다시 학습을 진행해본다.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.01)
mlp.train(reviews[:-1000],labels[:-1000])

아직도 잘 되지 않을 것이다. 조금 더 작게 해보자. 이번에는 `0.001`로 설정하고 학습을 진행해본다.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)
mlp.train(reviews[:-1000],labels[:-1000])

learning rate가 `0.001`일때, 비로소 신경망이 유의미한 추정을 하기 시작했다. 아직 썩 맘에 들지는 않지만, 이러한 방법의 가능성은 엿볼 수 있었다. 이제 이 신경망을 더욱 개선하도록 해보자.

# Understanding Neural Noise<a id='lesson_4'></a>


In [None]:
from IPython.display import Image
Image(filename='sentiment_network.png')

In [None]:
def update_input_layer(review):
    
    global layer_0
    
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])

In [None]:
layer_0

In [None]:
review_counter = Counter()

In [None]:
for word in reviews[0].split(" "):
    review_counter[word] += 1

In [None]:
review_counter.most_common()

# Project 4: Reducing Noise in Our Input Data<a id='project_4'></a>

신경망은 input data의 질에 따라서 성능이 크게 달라진다. input data는 유의미한 값 뿐만 아니라 각종 noise들도 많이 포함되어 있다. 우리는 이러한 noise를 이해하고 지워줄 것이다.

**TODO:** `update_input_layer` 함수를 수정하자. 각 단어별로 '얼마나 많이 있는지' 세지 말고, '있는지' 확인만 하도록 하자. 즉, 단어별로 count를 세지 말고 있으면 0, 없으면 1이 되는 것이다.

```diff
    def update_input_layer(self,review):

        self.layer_0 *= 0
        for word in review.split(" "):
            if(word in self.word2index.keys()):
+++                self.layer_0[0][self.word2index[word]] += 1
---                self.layer_0[0][self.word2index[word]] = 1
```
딱 `+` 하나만 지워준다.
그 결과는 놀라울 것이다.

In [None]:
# TODO: -Copy the SentimentNetwork class from Projet 3 lesson
#       -Modify it to reduce noise, like in the video 

import time
import sys
import numpy as np

class SentimentNetwork:
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        np.random.seed(1)
        self.pre_process_data(reviews, labels)
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        
        review_vocab = set()
        for review in reviews:
            for word in review.split(' '):
                review_vocab.add(word)

        self.review_vocab = list(review_vocab)
        label_vocab = set()
        for label in labels :
            label_vocab.add(label)
        
        self.label_vocab = list(label_vocab)
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab) :
            self.word2index[word] = i

        self.label2index = {}
        for i, label in enumerate(self.label_vocab) :
            self.label2index[label] = i
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Store the number of nodes in input, hidden, and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,(self.hidden_nodes, self.output_nodes))
        self.layer_0 = np.zeros((1,input_nodes))
    
    
    def update_input_layer(self,review):

        self.layer_0 *= 0
        for word in review.split(" "):
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] = 1
        
                
    def get_target_for_label(self,label):
        
        if label == 'POSITIVE':
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return (1/(1 + np.exp(-x)))
    
    def sigmoid_output_2_derivative(self,output):
        return (1 - output)*(output)

    def train(self, training_reviews, training_labels):
        
        assert(len(training_reviews) == len(training_labels))
        correct_so_far = 0
        start = time.time()

        for i in range(len(training_reviews)):
            review = training_reviews[i]
            label = training_labels[i]
            
            #Update Input Layer
            self.update_input_layer(review)
            
            #Layer 1
            layer_1 = np.dot(self.layer_0,self.weights_0_1)
            
            #Layer 2
            layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
            
            #Output Error
            layer_2_error = layer_2 - get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            #Backpropagate Error
            layer_1_error = np.dot(layer_2_delta,self.weights_1_2.T)
            layer_1_delta = layer_1_error
            
            #Update Weight
            self.weights_0_1 -= np.dot(self.layer_0.T, layer_1_delta) * self.learning_rate
            self.weights_1_2 -= np.dot(layer_1.T, layer_2_delta) * self.learning_rate
            
            if layer_2 >= 0.5 and label == "POSITIVE" : 
                correct_so_far += 1
            elif layer_2 < 0.5 and label == "NEGATIVE" :
                correct_so_far += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        correct = 0
        start = time.time()

        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        self.update_input_layer(review)
        layer_1 = np.dot(self.layer_0,self.weights_0_1)
        layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
        
        if layer_2[0] >= 0.5: return "POSITIVE"
        else: return "NEGATIVE"
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        self.learning_rate = learning_rate

        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,(self.hidden_nodes, self.output_nodes))
        self.layer_0 = np.zeros((1,input_nodes))
    
        
    def update_input_layer(self,review):
        
        self.layer_0 *= 0
        
        for word in review.split(" "):
            if(word in self.word2index.keys()):
                self.layer_0[0][self.word2index[word]] = 1
        
                
    def get_target_for_label(self,label):

        if label == 'POSITIVE':
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return (1/(1 + np.exp(-x)))
    
    def sigmoid_output_2_derivative(self,output):
        return (1 - output)*(output)

    def train(self, training_reviews, training_labels):
        
        assert(len(training_reviews) == len(training_labels))
        correct_so_far = 0
        start = time.time()

        for i in range(len(training_reviews)):

            review = training_reviews[i]
            label = training_labels[i]
            self.update_input_layer(review)

            layer_1 = np.dot(self.layer_0,self.weights_0_1)
            layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
            
            #Output Error
            layer_2_error = layer_2 - get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            #Backpropagate Error
            layer_1_error = np.dot(layer_2_delta,self.weights_1_2.T)
            layer_1_delta = layer_1_error
            
            #Update Weight
            self.weights_0_1 -= np.dot(self.layer_0.T, layer_1_delta) * self.learning_rate
            self.weights_1_2 -= np.dot(layer_1.T, layer_2_delta) * self.learning_rate
            
            if layer_2 >= 0.5 and label == "POSITIVE" : 
                correct_so_far += 1
            elif layer_2 < 0.5 and label == "NEGATIVE" :
                correct_so_far += 1

            #debug
            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """

        correct = 0
        start = time.time()

        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """

        self.update_input_layer(review)
        layer_1 = np.dot(self.layer_0,self.weights_0_1)
        layer_2 = self.sigmoid(np.dot(layer_1,self.weights_1_2))
        
        if layer_2[0] >= 0.5: return "POSITIVE"
        else: return "NEGATIVE"

다시 학습을 해보자. 
`learning rate`는 `0.1`부터 시작한다.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)
mlp.train(reviews[:-1000],labels[:-1000])

놀랍다. learning rate가 0.1인데도 불구하고 정확도가 놀라울 정도로 향상됬다.
이전에는 *전혀* 학습이 되질 않았던 것에 비하면 정말 놀라운 효과이다.
`learning rate`를 `0.001`로 두고 다시 한번 학습을 해보자.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001)
mlp.train(reviews[:-1000],labels[:-1000])

~~놀랍게도(?)~~ learning rate를 1/100으로 줄였지만 효과는 매우 미미했다.
이제 test set에 대해서 test를 진행해본다.

In [None]:
mlp.test(reviews[-1000:],labels[-1000:])

# Analyzing Inefficiencies in our Network<a id='lesson_5'></a>

In [None]:
Image(filename='sentiment_network_sparse.png')

In [None]:
layer_0 = np.zeros(10)

In [None]:
layer_0

In [None]:
layer_0[4] = 1
layer_0[9] = 1

In [None]:
layer_0

In [None]:
weights_0_1 = np.random.randn(10,5)

In [None]:
layer_0.dot(weights_0_1)

In [None]:
indices = [4,9]

In [None]:
layer_1 = np.zeros(5)

In [None]:
for index in indices:
    layer_1 += (1 * weights_0_1[index])

In [None]:
layer_1

In [None]:
Image(filename='sentiment_network_sparse_2.png')

In [None]:
layer_1 = np.zeros(5)

In [None]:
for index in indices:
    layer_1 += (weights_0_1[index])

In [None]:
layer_1

# Project 5: Making our Network More Efficient<a id='project_5'></a>
이전 project에서는 Noise를 제거해서 신경망의 정확도를 높였다.
이번 project에서는 불필요한 연산을 줄여 (최적화) 학습속도를 빠르게 만들겠다.

**TODO:** 
* 이제 `update_input_layer` 함수는 필요없다.
* `init_network`를 수정한다.:
>* 이제 input layer를 분리하지 않을 것이다. 따라서 `self.layer_0`에 관련된 표현은 모두 삭제한다.
>* hidden layer는 더욱 직접적으로 다뤄줄 것이다. 따라서 `self.layer_1`를 만든다. 이 layer는 1 x hidden_nodes의 차원을 가지고 있는 2차원 matrix이다. 초기값은 모두 0이다.

* `train` 함수를 수정한다.:
>* 입력 변수 `training_reviews`를 `training_reviews_raw`로 수정한다.
>* 함수의 앞부분에 모든 review들을 indices의 list로 바꿔줄 것이다. (`word2index`를 사용한다.) 그리고 `training_reviews_raw`의 각 review를 할당한 `training_reviews`라는 이름의 local `list`를 만든다. 이 list들은 review의 각 단어에 대한 indice를 담고 있다.
>* `update_input_layer` 호출을 삭제한다.
>* local `layer_1` 대신 `self`의 `layer_1`을 사용한다.
>* forward pass에서 `layer_1`을 업데이트 하는 과정을 삭제한다. input value가 1 아니면 0이기 때문에, 우리는 곱셈의 과정을 생략하고 ( num * 1 = num ), input이 1이면 weight를 더해주고 0이면 지나갈 것이다.
* `run` 함수를 수정한다.:
>* `update_input_layer` 호출을 삭제한다.
>* local `layer_1` 대신 `self`의 `layer_1`을 사용한다.
>* `train`에 햇던 것처럼 `review`에 대한 pre-process 과정을 추가한다.

In [None]:
# TODO: -Copy the SentimentNetwork class from Project 4 lesson
#       -Modify it according to the above instructions 

import time
import sys
import numpy as np

class SentimentNetwork:
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """

        np.random.seed(1)
        self.pre_process_data(reviews, labels)
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):

        review_vocab = set()
        
        for review in reviews:
            for word in review.split(' '):
                review_vocab.add(word)
        self.review_vocab = list(review_vocab)
        label_vocab = set()
        
        for label in labels :
            label_vocab.add(label)
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab) :
            self.word2index[word] = i

        self.label2index = {}
        for i, label in enumerate(self.label_vocab) :
            self.label2index[label] = i
            
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        self.learning_rate = learning_rate
        
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,(self.hidden_nodes, self.output_nodes))

# 이제 input layer를 분리하지 않을 것이다. 따라서 `self.layer_0`에 관련된 표현은 모두 삭제한다.
# hidden layer는 더욱 직접적으로 다뤄줄 것이다. 따라서 `self.layer_1`를 만든다. 이 layer는 1 x hidden_nodes의 차원을 가지고 있는 2차원 matrix이다. 초기값은 모두 0이다.        
        self.layer_1 = np.zeros((1,hidden_nodes))
        
                
    def get_target_for_label(self,label):
        if label == 'POSITIVE':
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return (1/(1 + np.exp(-x)))
    
    def sigmoid_output_2_derivative(self,output):
        return (1 - output)*(output)

    def train(self, training_reviews_raw, training_labels):

# **TODO**:
# >* 입력 변수 `training_reviews`를 `training_reviews_raw`로 수정한다.
# >* 함수의 앞부분에 모든 review들을 indices의 list로 바꿔줄 것이다. (`word2index`를 사용한다.) 그리고 `training_reviews_raw`의 각 review를 할당한 `training_reviews`라는 이름의 local `list`를 만든다. 이 list들은 review의 각 단어에 대한 indice를 담고 있다.
# >* `update_input_layer` 호출을 삭제한다.
# >* local `layer_1` 대신 `self`의 `layer_1`을 사용한다.
# >* forward pass에서 `layer_1`을 업데이트 하는 과정을 삭제한다. input value가 1 아니면 0이기 때문에, 우리는 곱셈의 과정을 생략하고 ( num * 1 = num ), input이 1이면 weight를 더해주고 0이면 지나갈 것이다.

        training_reviews = list()
        for review_raw in training_reviews_raw :
            review = set()
            for word in review_raw.split(" "):
                if (word in self.word2index.keys()):
                    review.add(self.word2index[word])
            training_reviews.append(list(review))
        
        assert(len(training_reviews) == len(training_labels))
        correct_so_far = 0
        start = time.time()

        for i in range(len(training_reviews)):
            review = training_reviews[i]
            label = training_labels[i]
            self.layer_1 *= 0

            for j in review:
                self.layer_1 += (self.weights_0_1[j])
            
            layer_2 = self.sigmoid(np.dot(self.layer_1,self.weights_1_2))
            
            layer_2_error = layer_2 - get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            layer_1_error = np.dot(layer_2_delta,self.weights_1_2.T)
            layer_1_delta = layer_1_error

            for j in review:
                self.weights_0_1[j] -= layer_1_delta[0] * self.learning_rate
            
            self.weights_1_2 -= np.dot(self.layer_1.T, layer_2_delta) * self.learning_rate

            if layer_2 >= 0.5 and label == "POSITIVE" : 
                correct_so_far += 1
            elif layer_2 < 0.5 and label == "NEGATIVE" :
                correct_so_far += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        correct = 0
        start = time.time()

        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")

    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """

# >* `update_input_layer` 호출을 삭제한다.
# >* local `layer_1` 대신 `self`의 `layer_1`을 사용한다.
# >* `train`에 햇던 것처럼 `review`에 대한 pre-process 과정을 추가한다.
    
        
        self.layer_1 *= 0
        indices = set()
        for word in review.lower().split(" "):
            if (word in self.word2index.keys()):
                    indices.add(self.word2index[word])
        
        for i in indices:
                self.layer_1 += self.weights_0_1[i]   
                
        layer_2 = self.sigmoid(np.dot(self.layer_1,self.weights_1_2))

        if layer_2[0] >= 0.5: return "POSITIVE"
        else: return "NEGATIVE"

학습을 진행해보자.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)
mlp.train(reviews[:-1000],labels[:-1000])

`Speed(reviews/sec)`에 주목하자.
학습속도를 나타낸 것이다.
이전에 `300reviews/sec`이었던 학습속도가 몇가지 코드를 수정하니 무려 `1600reviews/sec`으로 빨라졌다!
무려 5배 이상 빨라진 것이다. 놀라운 개선이다.

In [None]:
mlp.test(reviews[-1000:],labels[-1000:])

In [None]:
Image(filename='sentiment_network_sparse_2.png')

In [None]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()

In [None]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]

In [None]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [None]:
hist, edges = np.histogram(list(map(lambda x:x[1],pos_neg_ratios.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Word Positive/Negative Affinity Distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

In [None]:
frequency_frequency = Counter()

for word, cnt in total_counts.most_common():
    frequency_frequency[cnt] += 1

In [None]:
hist, edges = np.histogram(list(map(lambda x:x[1],frequency_frequency.most_common())), density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="The frequency distribution of the words in our corpus")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

# Project 6: Reducing Noise by Strategically Reducing the Vocabulary<a id='project_6'></a>

**TODO:** 이제 통계적인 방법을 도입하여 `SentimentNetwork`의 performance를 더욱 개선할 것이다. 
단어들 중 중립적인 단어가 너무 많아 정확도와 연산량이 낭비되고 있다. 
중립적인 감정의 단어들은 cutoff한 후 신경망에 input할 것이다.

다음과 같은 과정을 추가하여 구현한다.
* 조금 등장한 단어들 (예-50회 이하)은 vocabulary에 모두 넣어준다.
* 많이 등장한 단어들 (예-50회 이상)은 positive-negative-ratio가 일정 (cutoff) 이상을 때에만 vocabulary에 넣어준다.

* `pre_process_data` 함수를 수정한다.
>* `min_count` 과 `polarity_cutoff` 변수를 추가한다.
>* 리뷰에 사용된 단어의 positive-to-negative ratios를 계산한다. 다른 점이 있다면, 함수가 아니라 별도의 Class로 만들어준다.
>* cutoff에 따라 postive-to-negative ratio가 기준 이상인 단어들만 학습에 사용되게 할 수 있다. 적당한 cutoff를 선택한다.  
>* 단어가 `min_count` 번 이상 나오는 경우에만 vocabulary에 추가되도록 한다.
>* `polarity_cutoff` 이상의 postive-to-negative를 가지는 경우에만 vocabulary에 추가되도록 한다.
* `__init__`함수를 수정한다.:
>* `min_count` 와 `polarity_cutoff` 변수를 추가해 `pre_process_data`를 호출할때 사용한다.

In [None]:
# TODO: -Copy the SentimentNetwork class from Project 5 lesson
#       -Modify it according to the above instructions 

import time
import sys
import numpy as np
from collections import Counter

class SentimentNetwork:
    def __init__(self, reviews, labels, min_count = 10, polarity_cutoff = 0.1, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """

        np.random.seed(1)
       
# >* `min_count` 와 `polarity_cutoff` 변수를 추가해 `pre_process_data`를 호출할때 사용한다.
        self.pre_process_data(reviews, labels, polarity_cutoff, min_count)
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels, polarity_cutoff, min_count):

# >* `min_count` 과 `polarity_cutoff` 변수를 추가한다.
# >* 리뷰에 사용된 단어의 positive-to-negative ratios를 계산한다. 다른 점이 있다면, 함수가 아니라 별도의 Class로 만들어준다.

        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()

        for i in range(len(reviews)) :
            if labels[i] == 'POSITIVE':
                for word in reviews[i].split(" "):
                    positive_counts[word] += 1
                    total_counts[word] += 1
            else:
                for word in reviews[i].split(" "):
                    negative_counts[word] += 1
                    total_counts[word] += 1

    # >* cutoff에 따라 postive-to-negative ratio가 기준 이상인 단어들만 학습에 사용되게 할 수 있다. 적당한 cutoff를 선택한다.
    # >* 단어가 `min_count` 번 이상 나오는 경우에만 vocabulary에 추가되도록 한다.
    # >* `polarity_cutoff` 이상의 postive-to-negative를 가지는 경우에만 vocabulary에 추가되도록 한다.

        pos_neg_ratios = Counter()

        for word, counts in list(total_counts.most_common()):
            if (counts >= 50):
                pos_neg_ratio = positive_counts[word] / float(negative_counts[word]+1)
                pos_neg_ratios[word] = pos_neg_ratio

        for word, ratio in pos_neg_ratios.most_common():
            if ratio > 1:
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = np.log(1 / (ratio + 0.01))
        #
        ## end New for Project 6
        ## ----------------------------------------

        review_vocab = set()
        for review in reviews:
            for word in review.split(' '):
                if(total_counts[word] > min_count):
                    if((pos_neg_ratios[word]>=polarity_cutoff) or (pos_neg_ratios[word]<= -polarity_cutoff)):
                        review_vocab.add(word)
                else:
                    review_vocab.add(word)
        self.review_vocab = list(review_vocab)

        label_vocab = set()
        for label in labels :
            label_vocab.add(label)
        self.label_vocab = list(label_vocab)
        
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        self.word2index = {}
        for i, word in enumerate(self.review_vocab) :
            self.word2index[word] = i

        self.label2index = {}
        for i, label in enumerate(self.label_vocab) :
            self.label2index[label] = i
            
        
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        self.learning_rate = learning_rate
        
        self.weights_0_1 = np.zeros((self.input_nodes,self.hidden_nodes))
        self.weights_1_2 = np.random.normal(0.0, self.output_nodes**-0.5,(self.hidden_nodes, self.output_nodes))

        self.layer_1 = np.zeros((1,hidden_nodes))
        
                
    def get_target_for_label(self,label):
        if label == 'POSITIVE':
            return 1
        else:
            return 0
        
    def sigmoid(self,x):
        return (1/(1 + np.exp(-x)))
    
    def sigmoid_output_2_derivative(self,output):
        return (1 - output)*(output)

    def train(self, training_reviews_raw, training_labels):

        training_reviews = list()
        for review_raw in training_reviews_raw :
            review = set()
            for word in review_raw.split(" "):
                if (word in self.word2index.keys()):
                    review.add(self.word2index[word])
            training_reviews.append(list(review))
        
        assert(len(training_reviews) == len(training_labels))
        correct_so_far = 0
        start = time.time()

        for i in range(len(training_reviews)):
            review = training_reviews[i]
            label = training_labels[i]
            self.layer_1 *= 0

            for j in review:
                self.layer_1 += (self.weights_0_1[j])
            
            layer_2 = self.sigmoid(np.dot(self.layer_1,self.weights_1_2))
            
            layer_2_error = layer_2 - get_target_for_label(label)
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)

            layer_1_error = np.dot(layer_2_delta,self.weights_1_2.T)
            layer_1_delta = layer_1_error

            for j in review:
                self.weights_0_1[j] -= layer_1_delta[0] * self.learning_rate
            
            self.weights_1_2 -= np.dot(self.layer_1.T, layer_2_delta) * self.learning_rate

            if layer_2 >= 0.5 and label == "POSITIVE" : 
                correct_so_far += 1
            elif layer_2 < 0.5 and label == "NEGATIVE" :
                correct_so_far += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        correct = 0
        start = time.time()

        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")

    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        self.layer_1 *= 0
        indices = set()
        for word in review.lower().split(" "):
            if (word in self.word2index.keys()):
                    indices.add(self.word2index[word])
        
        for i in indices:
                self.layer_1 += self.weights_0_1[i]   
                
        layer_2 = self.sigmoid(np.dot(self.layer_1,self.weights_1_2))

        if layer_2[0] >= 0.5: return "POSITIVE"
        else: return "NEGATIVE"

코드를 완성했다면, 네트워크를 훈련시켜보자.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.05,learning_rate=0.01)
mlp.train(reviews[:-1000],labels[:-1000])

그리고 아래 cell을 실행해 test를 해본다.

In [None]:
mlp.test(reviews[-1000:],labels[-1000:])

polarity cutoff를 더욱 크게 해서 다시 학습해보자.

In [None]:
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=20,polarity_cutoff=0.8,learning_rate=0.01)
mlp.train(reviews[:-1000],labels[:-1000])

테스트를 진행한다.

In [None]:
mlp.test(reviews[-1000:],labels[-1000:])

# Analysis: What's Going on in the Weights?<a id='lesson_7'></a>

In [None]:
mlp_full = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=0,polarity_cutoff=0,learning_rate=0.01)

In [None]:
mlp_full.train(reviews[:-1000],labels[:-1000])

In [None]:
Image(filename='sentiment_network_sparse.png')

In [None]:
def get_most_similar_words(focus = "horrible"):
    most_similar = Counter()

    for word in mlp_full.word2index.keys():
        most_similar[word] = np.dot(mlp_full.weights_0_1[mlp_full.word2index[word]],mlp_full.weights_0_1[mlp_full.word2index[focus]])
    
    return most_similar.most_common()

In [None]:
get_most_similar_words("excellent")

In [None]:
get_most_similar_words("terrible")

In [None]:
import matplotlib.colors as colors

words_to_visualize = list()
for word, ratio in pos_neg_ratios.most_common(500):
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)
    
for word, ratio in list(reversed(pos_neg_ratios.most_common()))[0:500]:
    if(word in mlp_full.word2index.keys()):
        words_to_visualize.append(word)

In [None]:
pos = 0
neg = 0

colors_list = list()
vectors_list = list()
for word in words_to_visualize:
    if word in pos_neg_ratios.keys():
        vectors_list.append(mlp_full.weights_0_1[mlp_full.word2index[word]])
        if(pos_neg_ratios[word] > 0):
            pos+=1
            colors_list.append("#00ff00")
        else:
            neg+=1
            colors_list.append("#000000")

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(vectors_list)

In [None]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="vector T-SNE for most polarized words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_to_visualize,
                                    color=colors_list))

p.scatter(x="x1", y="x2", size=8, source=source, fill_color="color")

word_labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(word_labels)

show(p)

# green indicates positive words, black indicates negative words