<a href="https://colab.research.google.com/github/garodisk/Text-Mining-using-word2vec-distributed-representation/blob/master/Sentiment_Analysis_without_using_wordtovec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

UMICH SI650 - Sentiment Classification

This is a text classification task - sentiment classification. Every document (a line in the data file) is a sentence extracted from social media (blogs). Your goal is to classify the sentiment of each sentence into "positive" or "negative".

The training data contains 7086 sentences, already labeled with 1 (positive sentiment) or 0 (negative sentiment). The test data contains 33052 sentences that are unlabeled. The submission should be a .txt file with 33052 lines. In each line, there should be exactly one integer, 0 or 1, according to your classification results.

I will use 3 approaches for this:

**Approach 1** : Just using google's word2vec weights built on google news

**Approach 2** : Fine Tune the weights from google using some extra convolutional layers

**Approach 3** : Just using deep learning without pre-trained weights

This notebook is on approach 3

In [0]:
import tensorflow as tf

In [0]:
import numpy as np
import pandas as pd
from tensorflow import keras

In [0]:
from keras.layers.core import Dense, SpatialDropout1D
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import collections
import nltk
import numpy as np
#from make_tensorboard import make_tensorboard
import codecs

Using TensorFlow backend.


In [0]:
pip install tensorboardcolab




In [0]:
from tensorboardcolab import *

In [0]:
tbc=TensorBoardColab()


Wait for 8 seconds...
TensorBoard link:
https://0b02d026.ngrok.io


In [0]:
tbc=TensorBoardColab(startup_waiting_time=30)


Wait for 30 seconds...
TensorBoard link:
https://0b02d026.ngrok.io


In [0]:
from google.colab import files
uploaded = files.upload()

Saving training.txt to training.txt


In [0]:
np.random.seed(42)

INPUT_FILE = "training.txt"
VOCAB_SIZE = 5000
EMBED_SIZE = 100
NUM_FILTERS = 256
NUM_WORDS = 3
BATCH_SIZE = 64
NUM_EPOCHS = 20

counter = collections.Counter()

In [0]:
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
maxlen = 0

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
for line in fin:
    _, sent = line.strip().split("\t")
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    if len(words) > maxlen:
        maxlen = len(words)
    for word in words:
        counter[word] += 1
fin.close()

In [0]:
maxlen

42

In [0]:
word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
# Adding one because UNK.
# It means representing words that are not seen in the vocubulary
vocab_sz = len(word2index) + 1
index2word = {v: k for k, v in word2index.items()}

In [0]:
xs, ys = [], []
fin = codecs.open(INPUT_FILE, "r", encoding='utf-8')
for line in fin:
    label, sent = line.strip().split("\t")
    ys.append(int(label))
    words = [x.lower() for x in nltk.word_tokenize(sent)]
    wids = [word2index[word] for word in words]
    xs.append(wids)
fin.close()
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

In [0]:
Xtrain, Xtest, Ytrain, Ytest = \
    train_test_split(X, Y, test_size=0.3, random_state=42)
print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)

(4960, 42) (2126, 42) (4960, 2) (2126, 2)


In [0]:
Xtrain

array([[  0,   0,   0, ..., 250,   6,  91],
       [  0,   0,   0, ...,  99, 135, 226],
       [  0,   0,   0, ...,  18, 165,   3],
       ...,
       [  0,   0,   0, ...,  39,   6,  91],
       [  0,   0,   0, ..., 207,  46,   3],
       [  0,   0,   0, ...,  17, 139,   3]], dtype=int32)

In [0]:
model = Sequential()
model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(filters=NUM_FILTERS,
                 kernel_size=NUM_WORDS,
                 activation="relu"))
model.add(GlobalMaxPooling1D())
model.add(Dense(2, activation="softmax"))

model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.




In [0]:
#tensorboard, _ = make_tensorboard(
 #   set_dir_name='keras_learn_embedding_from_scratch')

history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    callbacks=[TensorBoardColabCallback(tbc)],
                    validation_data=(Xtest, Ytest))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 4960 samples, validate on 2126 samples







Epoch 1/20

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [0]:
# evaluate model
score = model.evaluate(Xtest, Ytest, verbose=1)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score[0], score[1]))

Test score: 0.019, accuracy: 0.995
