**The same task as Lab 1 except using `RNNs`**

**Watch out!<br>
Preproceesing here is a a bit different**

**Dataset**
labeled datasset collected from twitter

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Total Estimated Time = 60-90 Mins**

> **Load the `clean data` preprocessed in `Lab 1`, then handle it to be used with `RNNs`**

### Import Libraries

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import pandas as pd
import numpy as np
import nltk
import tensorflow as tf

### Load Dataset

In [3]:
df = pd.read_csv('/content/drive/MyDrive/ITI/NLP/new_dataset.csv')
df.head(5)

Unnamed: 0,label,tweet,clean_tweet
0,0,@user when a father is dysfunctional and is s...,father dysfunctional selfish drags kids dysfun...
1,0,@user @user thanks for #lyft credit i can't us...,thanks left credit not use cause not offer whe...
2,0,bihday your majesty,birthday majesty
3,0,#model i love u take with u all the time in ...,model love a take a time or
4,0,factsguide: society now #motivation,factsguide society motivation


In [5]:
df.dropna(inplace=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29497 entries, 0 to 29529
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   label        29497 non-null  int64 
 1   tweet        29497 non-null  object
 2   clean_tweet  29497 non-null  object
dtypes: int64(1), object(2)
memory usage: 921.8+ KB


### Preprocessing

In [7]:
df = df[['clean_tweet','label']]

In [13]:
df['clean_tweet'][0]

'father dysfunctional selfish drags kids dysfunction run'

In [26]:
def tokenize_and_pad_tweets(df_column):
    """Tokenizes tweets and pads the sequences to the length of the longest sequence in the dataset.

    Args:
        df_column (pandas.Series): A DataFrame column containing tweets.

    Returns:
        tuple:
            numpy.ndarray: An array of padded sequences.
            int: The vocabulary size.
            int: The maximum sequence length.
            Tokenizer: The tokenizer object used for the tokenization.
    """
    # Create tokenizer
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(df_column)

    # Convert text to sequences of integers
    sequences = tokenizer.texts_to_sequences(df_column)

    # Get length of longest sequence in dataset
    max_seq_length = max(len(seq) for seq in sequences)

    # Pad sequences to ensure uniform length
    sequences = pad_sequences(sequences, maxlen=max_seq_length,padding='post')

    # Get vocabulary size
    vocab_size = len(tokenizer.word_index) + 1

    return sequences, vocab_size, max_seq_length, tokenizer


In [27]:
padded_sequences, vocab_size, max_seq_length, tokenizer = tokenize_and_pad_tweets(df['clean_tweet'])
df['tokenized_tweets'] = padded_sequences.tolist()
df.head()

Unnamed: 0,clean_tweet,label,tokenized_tweets
0,father dysfunctional selfish drags kids dysfun...,0,"[145, 13433, 2287, 5840, 148, 7151, 305, 0, 0,..."
1,thanks left credit not use cause not offer whe...,0,"[88, 210, 410, 1, 322, 498, 1, 1330, 7152, 715..."
2,birthday majesty,0,"[12, 3007, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
3,model love a take a time or,0,"[680, 3, 2, 104, 2, 15, 91, 0, 0, 0, 0, 0, 0, ..."
4,factsguide society motivation,0,"[2587, 1280, 208, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."


In [43]:
from sklearn.model_selection import train_test_split
features = np.array(df['tokenized_tweets'].values.tolist())
labels = np.array(df['label'].values)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42,stratify=labels)

### Modelling

In [44]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

embedding_dim = 64
rnn_units = 64
# Define model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length))
model.add(SimpleRNN(units=rnn_units))
model.add(Dense(units=1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [45]:
history = model.fit(X_train, y_train, epochs=10, batch_size=64,
                    validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Evaluation

In [46]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.29791364073753357
Test Accuracy: 0.9408474564552307


### Enhancement

In [47]:
from tensorflow.keras.layers import Bidirectional

embedding_dim = 64
rnn_units = 64

# Define model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length))
model.add(Bidirectional(SimpleRNN(units=rnn_units)))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [48]:
history = model.fit(X_train, y_train, epochs=10, batch_size=64,
                    validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [49]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.33355093002319336
Test Accuracy: 0.9506779909133911


### Results & Conclusion

Bidirectional LSTM was slightly higher than Normal LSTM in this task, and both were higher than TFIDF in the previous task, that is due to LSTMs that captures sequential context

#### Done!