# Imporing Libraries

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('/kaggle/input/sarcasm/train-balanced-sarcasm.csv')

In [3]:
df.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


In [4]:
df.shape

(1010826, 10)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   label           1010826 non-null  int64 
 1   comment         1010771 non-null  object
 2   author          1010826 non-null  object
 3   subreddit       1010826 non-null  object
 4   score           1010826 non-null  int64 
 5   ups             1010826 non-null  int64 
 6   downs           1010826 non-null  int64 
 7   date            1010826 non-null  object
 8   created_utc     1010826 non-null  object
 9   parent_comment  1010826 non-null  object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


#### Checking null values

In [8]:
df.isnull().sum()

label              0
comment           55
author             0
subreddit          0
score              0
ups                0
downs              0
date               0
created_utc        0
parent_comment     0
dtype: int64

#### Checking duplicated values

In [10]:
df.duplicated().sum()

28

------

#### As you can see here are 10 lakh rows, so we will decided to views only first 10000 rows of data.

In [11]:
df=df[:10000]

#### Removing unwanted columns

In [12]:
df=df[['label','comment']]

In [13]:
df.head()

Unnamed: 0,label,comment
0,0,NC and NH.
1,0,You do know west teams play against west teams...
2,0,"They were underdogs earlier today, but since G..."
3,0,"This meme isn't funny none of the ""new york ni..."
4,0,I could use one of those tools.


In [14]:
df.shape

(10000, 2)

----------

# 1) Data Pre-processing

#### Now Checking the null values agsin

In [15]:
df.isnull().sum()

label      0
comment    1
dtype: int64

#### Rempving the null values

In [16]:
df.dropna(inplace=True)

In [17]:
df.isnull().sum()

label      0
comment    0
dtype: int64

--------

# 2) Data Cleaning

### i) Removing unwanted numbers, characters or letters using re

In [18]:
df['comment']=df['comment'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

### ii) Converting all of the text into lowercase

In [19]:
def lowercase(text):
    return text.lower()

df['comment']=df['comment'].apply(lowercase)

In [20]:
df.head()

Unnamed: 0,label,comment
0,0,nc and nh
1,0,you do know west teams play against west teams...
2,0,they were underdogs earlier today but since gr...
3,0,this meme isnt funny none of the new york nigg...
4,0,i could use one of those tools


-------

# 3) Tokenization

In [21]:
from transformers import BertTokenizer

In [23]:
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

1. We have two types of Bert Models here, one is BERT base and another one is BERT large.

2. In BERT base model, we have 12 encoders stacked on top of each other.

3. In BERT large model, we have 24 encoders stacked on top of each other.

4. Actualy, BERT large model takea a larger space in memoery and here we are just only using 10 Thousand rows and so we can use base model only.

-----------

## Function for Tokenization

1. max_length --> If we say max_length=100 means every sentence present in this dataset contains 100 words, not more than that, not less than that.

2. truncation=True --> It means, suppose we have 150 words in a sentence but our maxlen is 100. So the extra 50 words will be removed or truncated from that sentence.

3. padding='max_length' --> It means, if our sentence contains only 70 words, the extra 30 words will be added in the sentence to make it 100.(100 is the default max_length, we are using here.)

4. return_tensors=np --> If we are using 'tf'(Tensorflow) instead of 'np', we will be not split our dataset in train test split caused it return tensor format. So we used 'np' (numpy format here/)

In [29]:
def tokenize_data(text, max_length=100):
    return tokenizer(
        
        # Converting the text to list
        text.tolist(),
        
        # Sentence length
        max_length=max_length,
        
        # Truncation words
        truncation=True,
        
        # Padding
        padding='max_length',
        
        # Return value
        return_tensors='np'
    )

In [30]:
tokenized_data=tokenize_data(df['comment'])

In [31]:
tokenized_data

{'input_ids': array([[  101, 13316,  1998, ...,     0,     0,     0],
       [  101,  2017,  2079, ...,     0,     0,     0],
       [  101,  2027,  2020, ...,     0,     0,     0],
       ...,
       [  101,  5095,  2305, ...,     0,     0,     0],
       [  101, 29420,  2015, ...,     0,     0,     0],
       [  101,  2016, 28719, ...,     0,     0,     0]]), 'token_type_ids': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]]), 'attention_mask': array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])}

### Observing the tokenized_data 

1. As you can see " [  101,  5095,  2305, ...,     0,     0,     0]" --> It means Padding is applied here cause the zeroes would added in the last.

2. We have two important things in our tokenized_data --> i) 'input_ids' and 'attention_mask'.

3. There are two types of attention_mask, which are 1 and 0 --> 0 means tokens are not padded and 1 means padded.

--------

# 4) Train Test Split

In [32]:
from sklearn.model_selection import train_test_split

#### We are only passing 'input_ids' in our tokenized_data and not passing 'attention_mask' 

In [33]:
X=tokenized_data['input_ids']
Y=df['label']

In [34]:
X_train,X_test, Y_train, Y_test=train_test_split(X, Y, test_size=0.2, random_state=42)

In [35]:
print(X.shape, X_train.shape, X_test.shape)

(9999, 100) (7999, 100) (2000, 100)


**Here 100 means every single sentence contains 100 words.**

----------

# 5) Building the Model according to the proposed architecture

In [36]:
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

In [43]:
class HierarchicalBERT(tf.keras.Model):
    def __init__(self, bert_model, lstm_units, cnn_filters, dense_units):
        super(HierarchicalBERT, self).__init__()
        self.bert = bert_model

        # Sentence Encoding Layer
        self.dense_sentence = tf.keras.layers.Dense(768, activation='relu')

        # Context Summarization Layer
        self.mean_pooling = tf.keras.layers.GlobalAveragePooling1D()

        # Context Encoder Layer
        self.bilstm_encoder = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units, return_sequences=True))

        # CNN Layer
        self.conv = tf.keras.layers.Conv1D(cnn_filters, 2, activation='relu')
        self.pool = tf.keras.layers.GlobalMaxPooling1D()

        # Fully Connected Layer
        self.dense_fc = tf.keras.layers.Dense(dense_units, activation='relu')
        self.output_layer = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, inputs):
        # BERT Embeddings
        bert_output = self.bert(inputs)[0]  # (batch_size, seq_len, hidden_size)

        # Sentence Encoding Layer
        sentence_encoded = self.dense_sentence(bert_output)  # (batch_size, seq_len, 768)

        # Context Summarization Layer
        context_summarized = self.mean_pooling(sentence_encoded)  # (batch_size, 768)

        # Expand dimensions to match the input shape required by LSTM
        context_summarized = tf.expand_dims(context_summarized, 1)  # (batch_size, 1, 768)

        # Context Encoder Layer
        context_encoded = self.bilstm_encoder(context_summarized)  # (batch_size, 1, 2 * lstm_units)

        # Squeeze the second dimension to match the input shape required by Conv1D
        context_encoded_squeezed = tf.squeeze(context_encoded, axis=1)  # (batch_size, 2 * lstm_units)

        # Add the channels dimension to match the input shape required by Conv1D
        context_encoded_expanded = tf.expand_dims(context_encoded_squeezed, axis=-1)  # (batch_size, 2 * lstm_units, 1)

        # CNN Layer
        conv_output = self.conv(context_encoded_expanded)  # (batch_size, new_seq_len, cnn_filters)
        pooled_output = self.pool(conv_output)  # (batch_size, cnn_filters)

        # Fully Connected Layer
        dense_output = self.dense_fc(pooled_output)  # (batch_size, dense_units)

        # Output Layer
        final_output = self.output_layer(dense_output)  # (batch_size, 1)
        return final_output

--------

In [38]:
from transformers import TFBertModel

# 6) Loading the pretrained BERT model

In [39]:
bert_model= TFBertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

------

# 7) Defining the Hirarchical BERT model

In [44]:
model=HierarchicalBERT(bert_model, lstm_units=128, cnn_filters=64, dense_units=32)

### Compiling the model

In [45]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

### Fitting the Model

In [46]:
model.fit(X_train, Y_train, epochs=5, batch_size=32)

Epoch 1/5
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 143ms/step - accuracy: 0.6366 - loss: 0.6613
Epoch 2/5
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 143ms/step - accuracy: 0.6291 - loss: 0.6600
Epoch 3/5
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 143ms/step - accuracy: 0.6327 - loss: 0.6585
Epoch 4/5
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 142ms/step - accuracy: 0.6306 - loss: 0.6592
Epoch 5/5
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 142ms/step - accuracy: 0.6363 - loss: 0.6562


<keras.src.callbacks.history.History at 0x78e339adad70>

### Test Accuracy

In [47]:
loss, accuracy=model.evaluate(X_test, Y_test)

print(f"Model Accuracy: {accuracy*100}")

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 138ms/step - accuracy: 0.6398 - loss: 0.6540
Model Accuracy: 63.749998807907104


### As you can see Our Model is performing on test set is better cause both Training and Testing Accuracy is 63%.

------