# I. Data Loading and Preprocessing

Data source from `Dreaddit: A Reddit Dataset for Stress Analysis in Social Media`.

Following tasks are undertaken:
* Columns Selection
* Feature Transformation
* Handling Missing Values
* Column Encoding

## Categorical Data Encoding, Feature Selection and Training...

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

%matplotlib inline
%load_ext lab_black
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

# Data import
dreaddit_train_df = pd.read_csv("../data/unprocessed/dreaddit-train.csv")
dreaddit_test_df = pd.read_csv("../data/unprocessed/dreaddit-test.csv")

In [2]:
dreaddit_train_df.head()

Unnamed: 0,subreddit,post_id,sentence_range,text,id,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_pleasantness,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
0,ptsd,8601tu,"(15, 20)","He said he had not felt that way before, sugge...",33181,1,0.8,1521614353,5,1.806818,...,1.0,1.125,1.0,1.77,1.52211,1.89556,0.86,1,3.253573,-0.002742
1,assistance,8lbrx9,"(0, 5)","Hey there r/assistance, Not sure if this is th...",2606,0,1.0,1527009817,4,9.429737,...,1.125,1.0,1.0,1.69586,1.62045,1.88919,0.65,2,8.828316,0.292857
2,ptsd,9ch1zh,"(15, 20)",My mom then hit me with the newspaper and it s...,38816,1,0.8,1535935605,2,7.769821,...,1.0,1.1429,1.0,1.83088,1.58108,1.85828,0.67,0,7.841667,0.011894
3,relationships,7rorpp,"[5, 10]","until i met my new boyfriend, he is amazing, h...",239,1,0.6,1516429555,0,2.667798,...,1.0,1.125,1.0,1.75356,1.52114,1.98848,0.5,5,4.104027,0.141671
4,survivorsofabuse,9p2gbc,"[0, 5]",October is Domestic Violence Awareness Month a...,1421,1,0.8,1539809005,24,7.554238,...,1.0,1.125,1.0,1.77644,1.64872,1.81456,1.0,1,7.910952,-0.204167


In [3]:
print(dreaddit_train_df.text[5])

I think he doesn't want to put in the effort for the relationship to work (and we're both so difficult that we have to work on our relationships, doesn't matter with whom) but he can't be without me either. What should I do? I'm afraid this is gonna happen over and over again, because I'm always forgiving him at some point. Am I being strung along? TL;DR: Boyfriend [28,M] broke up with me [23,F] after on-off for 1.5 years, I thought we just got it together and am devastated...don't know what to do, want to keep fighting but should I?


In [None]:
#def preprocess(ReviewText):
#    ReviewText = ReviewText.str.replace("(<br/>)", "")
#    ReviewText = ReviewText.str.replace('(<a).*(>).*(</a>)', '')
#    ReviewText = ReviewText.str.replace('(&amp)', '')
#    ReviewText = ReviewText.str.replace('(&gt)', '')
#    ReviewText = ReviewText.str.replace('(&lt)', '')
#    ReviewText = ReviewText.str.replace('(\xa0)', ' ')  
#    return ReviewText
#df['Review Text'] = preprocess(dreaddit_train_df['Review Text'])

In [4]:
cols = dreaddit_train_df.columns

In [5]:
new_cols = [
    "text",
    "label",
]

In [6]:
train_df = dreaddit_train_df[new_cols]
train_df.head()

Unnamed: 0,text,label
0,"He said he had not felt that way before, sugge...",1
1,"Hey there r/assistance, Not sure if this is th...",0
2,My mom then hit me with the newspaper and it s...,1
3,"until i met my new boyfriend, he is amazing, h...",1
4,October is Domestic Violence Awareness Month a...,1


In [7]:
test_cols = dreaddit_test_df.columns
dreaddit_test_df.info()
test_df = dreaddit_test_df[new_cols]
test_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 715 entries, 0 to 714
Columns: 116 entries, id to sentiment
dtypes: float64(107), int64(5), object(4)
memory usage: 648.1+ KB


Unnamed: 0,text,label
0,"Its like that, if you want or not.“ ME: I have...",0
1,I man the front desk and my title is HR Custom...,0
2,We'd be saving so much money with this new hou...,1
3,"My ex used to shoot back with ""Do you want me ...",1
4,I haven’t said anything to him yet because I’m...,0


In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2838 entries, 0 to 2837
Data columns (total 2 columns):
text     2838 non-null object
label    2838 non-null int64
dtypes: int64(1), object(1)
memory usage: 44.5+ KB


In [9]:
# train_df.subreddit.value_counts()

In [10]:
train_df.label.value_counts()

1    1488
0    1350
Name: label, dtype: int64

# III. TF Feature Columns to Classify Structure Data

In [11]:
import os
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from tensorflow import keras

import numpy as np

## Create an input pipeline using tf.data

In [12]:
# Convert to tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("label")
    ds = tf.data.Dataset.from_tensor_slices((dataframe, labels))
    return ds

In [13]:
train_ds = df_to_dataset(train_df)
test_ds = df_to_dataset(test_df)

In [14]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [15]:
train_dataset = (
    train_ds.shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
test_dataset = test_ds.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

In [16]:
train_dataset.element_spec

(TensorSpec(shape=(None, 1), dtype=tf.string, name=None),
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE
)
encoder.adapt(train_ds.map(lambda text, label: text))

In [19]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'i', 'to', 'and', 'the', 'a', 'my', 'of', 'me',
       'that', 'in', 'it', 'for', 'was', 'is', 'but', 'have', 'with',
       'this'], dtype='<U13')

In [20]:
for example, label in train_dataset.take(1):
    print("texts: ", example.numpy()[:1])
    print()
    print("labels: ", label.numpy()[:1])

texts:  [[b'I\xe2\x80\x99m an army reserve officer who has 6.5 years of service. I have a VA disability rating of 40% due to PTSD/Anxiety/depression and ringing in my ears from a tour in Afghanistan. I recently applied to go into the IRR and was turned down. I have been told that I\xe2\x80\x99m not deployable due to my mental health, but I don\xe2\x80\x99t think this was at all mentioned in my IRR packet. just moved for a job in DC and my old unit is now too far to travel to.']]

labels:  [0]


In [21]:
encoded_example = encoder(example)[:1].numpy()
encoded_example

array([[ 88,  60,   1,   1,   1,  76,  57,   1,  83,   8, 861,   2,  17,
          6,   1,   1,   1,   8,   1, 319,   3,   1,   4,   1,  11,   7,
          1,  50,   6,   1,  11,   1,   2, 280,   1,   3,  73, 106,   5,
          1,   4,  14, 499, 151,   2,  17,  45, 114,  10,  88,  28,   1,
        319,   3,   7, 296, 286,  16,   2, 180,  87,  19,  14,  33,  48,
        908,  11,   7,   1,   1,  29, 345,  13,   6, 131,  11,   1,   4,
          7, 227,   1,  15,  64, 138, 459,   3, 963,   3,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0

In [22]:
model = tf.keras.Sequential(
    [
        encoder,
        tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1),
    ]
)

In [23]:
sample_text = "I have been experiencing a depressing mood and my stress levels are rising tremendously. This should indicate something"
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

[0.00601296]


In [24]:
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, save_weights_only=True, verbose=1
)
logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

In [25]:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(1e-4),
    metrics=["accuracy"],
)

In [26]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 64)          64000     
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         66048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 6

In [27]:
history = model.fit(
    train_dataset,
    epochs=10,
    callbacks=[cp_callback, tensorboard_callback],
    validation_data=test_dataset,
    validation_steps=30,
)

Epoch 1/10
Instructions for updating:
use `tf.profiler.experimental.stop` instead.

Epoch 00001: saving model to training_1/cp.ckpt
Epoch 2/10
Epoch 00002: saving model to training_1/cp.ckpt
Epoch 3/10
Epoch 00003: saving model to training_1/cp.ckpt
Epoch 4/10
Epoch 00004: saving model to training_1/cp.ckpt
Epoch 5/10
Epoch 00005: saving model to training_1/cp.ckpt
Epoch 6/10
Epoch 00006: saving model to training_1/cp.ckpt
Epoch 7/10
Epoch 00007: saving model to training_1/cp.ckpt
Epoch 8/10
Epoch 00008: saving model to training_1/cp.ckpt
Epoch 9/10
Epoch 00009: saving model to training_1/cp.ckpt
Epoch 10/10
Epoch 00010: saving model to training_1/cp.ckpt


In [28]:
# Save the entire model as a SavedModel.
!mkdir -p savedmodel
model.save("savedmodel/redmodel")

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: savedmodel/redmodel/assets


In [29]:
test_loss, test_acc = model.evaluate(test_dataset)

print("Test Loss: {}".format(test_loss))
print("Test Accuracy: {}".format(test_acc))

Test Loss: 0.5856103301048279
Test Accuracy: 0.7454545497894287


In [30]:
# Predict on a sample to test accuract and padding.

sample_text = "I have been experiencing a depressing mood and my stress levels are rising tremendously. This should indicate something."
predictions = model.predict(np.array([sample_text]))
print(predictions)

[[-0.5048018]]


In [32]:
from tensorboard.plugins import projector

# Save the weights we want to analyse as a variable. Note that the first
# value represents any unknown word, which is not in the metadata, so
# we will remove that value.
weights = tf.Variable(model.layers[0].get_weights()[0][1:])
# Create a checkpoint from embedding, the filename and key are
# name of the tensor.
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(logs, "embedding.ckpt"))

# Set up config
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(logs, config)

NameError: name 'logs' is not defined

In [33]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

%tensorboard --logdir logs/scalars

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 7559), started 0:02:53 ago. (Use '!kill 7559' to kill it.)