# Sentiment analysis

Steps:
- Load data
- Pre-process the data, encoding characters as integers
- Pad the data such that each review is a stanndard sequence length
- Define an RNN with embeddng and hidden LTSM layers that predicts the sentiment of a given review
- Train the RNN
- See how it performs on test data

In [1]:
import numpy as np
from string import punctuation
from collections import Counter, OrderedDict
import itertools


In [2]:
with open('deep-learning-v2-pytorch/sentiment-analysis-network/reviews.txt', 'r') as f:
    reviews = f.read()
with open('deep-learning-v2-pytorch/sentiment-analysis-network/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
# it is importing character by character, not complete words
reviews[:20]

'bromwell high is a c'

## Preprocessing

- Eliminate ponctuation
- Separate reviews using the delimiter \n
- Combine all reviews

In [4]:
# lowering all the words
reviews = reviews.lower()

In [5]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
all_text = ''.join([c for c in reviews if c not in punctuation])

In [7]:
# characters 
all_text[:10]

'bromwell h'

In [8]:
# splitting by new lines and spaces
# gives complete lines separated by \n
reviews_split = all_text.split('\n')

In [9]:
# combining all the sentences again. Still getting just characters
all_text = ' '.join(reviews_split)

In [10]:
all_text[:10]

'bromwell h'

In [11]:
# creating a list of words
words = all_text.split()

## Encoding

- Create a dictionary that maps words to integers
- pad input vector with zeros so make sure that the integers start at 1
- convert reviews to integers
- store reviews in a list called reviews_ints

In [12]:
# Building a dict that maps words to ints

counts = Counter(words)

In [13]:
out = dict(itertools.islice(counts.items(), 3))  

In [14]:
out

{'bromwell': 8, 'high': 2161, 'is': 107328}

In [15]:
# ordering the dict
vocab_ordered = sorted(counts, key=counts.get, reverse=True)

In [16]:
vocab_to_int = dict(enumerate(vocab_ordered, start=1))

In [17]:
out = dict(itertools.islice(vocab_to_int.items(), 3))  

In [18]:
out

{1: 'the', 2: 'and', 3: 'a'}

In [19]:
integer_to_vocab = {}
for integer, word in vocab_to_int.items():
    integer_to_vocab[word] = integer

In [20]:
print('unique words: ', len(integer_to_vocab))

unique words:  74072


In [21]:
# have reviews mapped by integers
reviews_int = []
for review in reviews_split:
    for word in review.split():
        reviews_int.append(integer_to_vocab[word])

## Encoding the labels

- positive = 1
- negative = 0

In [52]:
test = labels

In [53]:
tokens = test.split('\n')

In [49]:
len(tokens

['positive', 'negative']

In [50]:
encoded_labels = []
for word in tokens:
    if word == 'positive':
        encoded_labels.append(1)
    else:
        encoded_labels.append(0)

In [51]:
encoded_labels

[1, 0]