# Sentiment Classification

Material by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

Edits & solution by Andrei Dyomin

 - **GitHub**: https://github.com/adyomin

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [2]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('./data/reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('./data/labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [3]:
len(reviews)

25000

In [4]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [6]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [7]:
from collections import Counter
import numpy as np
import re

In [8]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [9]:
# Using Python's Regular Expressions library to filter out all 'small' words

for review, label in zip(reviews, labels):
    if label == 'POSITIVE':
            positive_counts.update(re.findall('\w{3,}', review.lower()))
    else:
            negative_counts.update(re.findall('\w{3,}', review.lower()))

total_counts.update(positive_counts)
total_counts.update(negative_counts)

## Creating a stop list

Filtering out https://en.wikipedia.org/wiki/Stop_words

In [10]:
f = open('./data/stop_list.csv','r')
stop_list_general = set(f.read().split(', '))
f.close()
print('stop_list_general - {0} words'.format(len(stop_list_general)))

stop_list_reviews = set()
for word, count in total_counts.most_common():
    if count >= 2000:
        stop_list_reviews.add(word)
len(stop_list_reviews)

print('stop_list_reviews - {0} words'.format(len(stop_list_reviews)))

stop_list_general - 119 words
stop_list_reviews - 291 words


Have to do some manual editing to shape up reviews specific stop list

In [11]:
stop_list_reviews

{'about',
 'acting',
 'action',
 'actor',
 'actors',
 'actually',
 'after',
 'again',
 'all',
 'almost',
 'also',
 'although',
 'always',
 'american',
 'and',
 'another',
 'any',
 'anyone',
 'anything',
 'are',
 'around',
 'audience',
 'away',
 'back',
 'bad',
 'beautiful',
 'because',
 'been',
 'before',
 'being',
 'believe',
 'best',
 'better',
 'between',
 'big',
 'bit',
 'black',
 'book',
 'both',
 'but',
 'can',
 'cast',
 'character',
 'characters',
 'come',
 'comedy',
 'comes',
 'could',
 'course',
 'day',
 'did',
 'didn',
 'different',
 'director',
 'does',
 'doesn',
 'don',
 'done',
 'down',
 'during',
 'dvd',
 'each',
 'effects',
 'else',
 'end',
 'ending',
 'enough',
 'especially',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'excellent',
 'fact',
 'family',
 'far',
 'father',
 'feel',
 'few',
 'film',
 'films',
 'find',
 'first',
 'for',
 'found',
 'from',
 'fun',
 'funny',
 'get',
 'gets',
 'girl',
 'give',
 'goes',
 'going',
 'good',
 'got',
 'great',
 'guy',
 

In [12]:
# super subjective list
keep_words = set('bad, dvd, nothing, old, pretty, again, beautiful, best, big, excellent, first, fun, funny, good, great, horror, interesting, little, like, love, many, money, most, never, must, nice, off, original, point, special, star, true, very, well, worst'.split(', '))
len(keep_words)

35

In [13]:
full_stop_list = stop_list_general | stop_list_reviews - keep_words
len(full_stop_list)

300

In [14]:
for word in full_stop_list:
    del positive_counts[word]
    del negative_counts[word]
    del total_counts[word]

In [15]:
pos_neg_ratios = Counter()
threshold = 256
max_ratio = 3
max_ratio_raw = np.exp(max_ratio)

for word, count in total_counts.most_common():
    if (count > threshold) & (negative_counts[word] > 0) :
        pos_neg_ratio = positive_counts[word]/negative_counts[word]
    elif (count > threshold) & (negative_counts[word] == 0) :
        pos_neg_ratio = max_ratio_raw
    pos_neg_ratios[word] = pos_neg_ratio
    
for word, ratio in pos_neg_ratios.most_common():
    if ratio > 0:
        pos_neg_ratios[word] = np.log(ratio)
    elif ratio == 0:
        print (word)
        pos_neg_ratios[word] = -max_ratio

In [16]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common(10)

[('wonderfully', 2.0485643031153966),
 ('delightful', 1.8262456452992242),
 ('beautifully', 1.7784436932522829),
 ('superb', 1.7189076208420597),
 ('touching', 1.6514021115331325),
 ('stewart', 1.6249021381316819),
 ('friendship', 1.588384503236268),
 ('magnificent', 1.5686159179138452),
 ('wonderful', 1.5680329974659779),
 ('finest', 1.5668782980153044)]

In [17]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:10]

[('unfunny', -2.6882475738060303),
 ('waste', -2.6186484579840514),
 ('pointless', -2.4531579514734201),
 ('redeeming', -2.3648889763302003),
 ('worst', -2.2865847516476046),
 ('laughable', -2.2617630984737906),
 ('awful', -2.2265521924307397),
 ('poorly', -2.2192034840549946),
 ('sucks', -1.9830278120118159),
 ('lame', -1.9802348915963879)]

# Transforming Text into Numbers

**review** = "This was a horrible, terrible movie."

<img src = './data/sentiment_network.png'>

**review** = "The movie was excellent"

<img src = './data/sentiment_network_pos.png'>

# Project 2: Creating the Input/Output Data

In [18]:
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)

73297


In [19]:
list(vocab)

['deutschland',
 'ecological',
 'pinet',
 'theissen',
 'eally',
 'ghostly',
 'stargazer',
 'mnm',
 'casablanka',
 'dimensional',
 'reprint',
 'fixture',
 'nominates',
 'volcanoes',
 'costa',
 'saigon',
 'gone',
 'slinging',
 'protelco',
 'pvc',
 'detlef',
 'haggle',
 'jaa',
 'ukulele',
 'stratham',
 'program',
 'wembley',
 'downed',
 'becoming',
 'lectured',
 'prote',
 'nickleodeon',
 'opinon',
 'syringe',
 'toolbox',
 'flamengo',
 'bellboy',
 'bayless',
 'helium',
 'tehrani',
 'lowbrow',
 'recruitment',
 'kindlings',
 'amg',
 'herredia',
 'acquaintaces',
 'jarvis',
 'infuriatingly',
 'pounding',
 'schneebaum',
 'crazed',
 'boobytraps',
 'morale',
 'whitfield',
 'manic',
 'rapacious',
 'swineherd',
 'shortcoming',
 'objectors',
 'cobb',
 'galloni',
 'cussword',
 'yomada',
 'cadaver',
 'ejemplo',
 'regresses',
 'commuppance',
 'redstacey',
 'frisch',
 'alexej',
 'symbolically',
 'landholdings',
 'elective',
 'incongruous',
 'luxuriously',
 'isabella',
 'craftiness',
 'unhumorous',
 'hoo

In [20]:
import numpy as np

layer_0 = np.zeros((1,vocab_size))
layer_0

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

<img src = './data/sentiment_network.png'>

In [21]:
word2index = {}

for i, word in enumerate(vocab):
    word2index[word] = i
word2index

{'deutschland': 0,
 'ecological': 1,
 'cabarnet': 24557,
 'pinet': 2,
 'theissen': 3,
 'stargazer': 6,
 'mnm': 7,
 'casablanka': 8,
 'reprint': 10,
 'fixture': 11,
 'nominates': 12,
 'costa': 14,
 'protelco': 18,
 'pvc': 19,
 'slinging': 17,
 'detlef': 20,
 'eally': 4,
 'jaa': 22,
 'ukulele': 23,
 'stratham': 24,
 'program': 25,
 'wembley': 26,
 'downed': 27,
 'lectured': 29,
 'hesteria': 23252,
 'prote': 30,
 'nickleodeon': 31,
 'accidentally': 63281,
 'policemen': 1483,
 'walentin': 61144,
 'agony': 22820,
 'opinon': 32,
 'syringe': 33,
 'fount': 52301,
 'toolbox': 34,
 'flamengo': 35,
 'bellboy': 36,
 'bayless': 37,
 'helium': 38,
 'tehrani': 39,
 'lowbrow': 40,
 'recruitment': 41,
 'kindlings': 42,
 'herredia': 44,
 'dimensional': 9,
 'infuriatingly': 47,
 'crazed': 50,
 'boobytraps': 51,
 'morale': 52,
 'whitfield': 53,
 'wallah': 56494,
 'qute': 61146,
 'shortcoming': 57,
 'rapacious': 55,
 'swineherd': 56,
 'schweibert': 48964,
 'objectors': 58,
 'cobb': 59,
 'galloni': 60,
 'cu

In [22]:
%%time

features = np.zeros((len(reviews), len(vocab)))
for i, review in enumerate(reviews):
    for word in review.split():
        if word in vocab:
            features[i, word2index[word]] += 1
            
features.shape

CPU times: user 4.04 s, sys: 1.61 s, total: 5.65 s
Wall time: 5.89 s


In [23]:
print(features.shape)
labels = np.array(labels, ndmin=2).T
print(labels.shape)

(25000, 73297)
(25000, 1)


# Project 3: Building a Neural Network

- Start with your neural network from the last chapter
- 3 layer neural network
- no non-linearity in hidden layer
- use our functions to create the training data
- create a "pre_process_data" function to create vocabulary for our training data generating functions
- modify "train" to train over the entire corpus

In [24]:
import sys
nn_path = '/Users/adyomin/Yandex.Disk.localized/Projects/Cornerstone'
sys.path.append(nn_path)
import network as nn

In [25]:
help(nn.Network.__init__)

Help on function __init__ in module network:

__init__(self, size, h_activation='sigmoid', o_activation='pass_input', c_function='quadratic', weights=None)
    Network class constructor method.
    
    Parameters
    ----------
    size : tuple
        size[0] - n_features, input layers width.  size[-1] - n_targets,
        width of the output layer.
    
    h_activation : string
        Choice of activation function for all hidden layers.  Current
        options include:
         - 'sigmoid' - returns 1/(1 + numpy.exp(-x))
         - 'pass_input' - returns x
    
    o_activation : string
        Choice of output layer activation function.  Current options
        include:
         - 'sigmoid' - returns 1/(1 + numpy.exp(-x))
         - 'pass_input' - returns x
    
    c_function : string
        Choice of cost/loss/objective function prime for the network.
        Current options include:
         - 'quadratic' - returns error, (0.5*(error**2))' = error
    
    weights : numpy.ar

In [31]:
nn_model = nn.Network((73297, 256, 64, 1))

In [None]:
help(nn.Network.train)

In [32]:
nn_model.train(features[:128], labels[:128], batch_size=16, eta=0.01, n_epochs=1)

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'