# Project Part 3

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/brearenee/NLP-Project/blob/main/part3.ipynb)


**NLP Problem:** Predicting the speaker from Star Trek: The Next Generation script lines for 8 main characters.

In this second phase of my project, I'm developing a deep learning model for this NLP task.

As learned in Part 1 and Part 2, the initial dataset's structure is less than ideal. To start Part 3, we must once again parse and clean the raw JSON data and transform it into a structured DataFrame.

In [51]:
import pandas as pd
import json
import requests
url = 'https://raw.githubusercontent.com/brearenee/NLP-Project/main/dataset/StarTrekDialogue_v2.json'
response = requests.get(url)

##This CodeBlock is thanks to ChatGPT :-) 
if response.status_code == 200:
    json_data = json.loads(response.text)
    lines = []
    characters = []
    episodes = []
  
    # extract the information from the JSON file for the "TNG" series
    for series_name, series_data in json_data.items():
        if series_name == "TNG": 
            for episode_name, episode_data in series_data.items():
                for character_name, character_lines in episode_data.items():
                    for line_text in character_lines:
                        lines.append(line_text)
                        characters.append(character_name)
                        episodes.append(episode_name)
                     
    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'Line': lines,
        'Character': characters,
    })

    # Remove duplicate lines, keeping the first occurrence (preserving the original order)
    df = df.drop_duplicates(subset='Line', keep='first')

    # Reset the index of the DataFrame
    df.reset_index(drop=True, inplace=True)

else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    
    
##Remove Outliers (Characters with less than 1200 lines)
character_counts = df['Character'].value_counts()
characters_to_remove = character_counts[character_counts < 1207].index
df = df[~df['Character'].isin(characters_to_remove)]

##Print Value Count. 
print(df['Character'].value_counts())


Character
PICARD     10798
RIKER       6454
DATA        5699
LAFORGE     4111
WORF        3185
CRUSHER     2944
TROI        2856
Name: count, dtype: int64


# BERT

In [52]:
#https://www.analyticsvidhya.com/blog/2021/12/multiclass-classification-using-transformers/


#Split the data 
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=20)

#Converting our Character column into Categorical data
encoded_dict = {'PICARD':0,'RIKER':1, 'DATA':2, 'LAFORGE':3, 
                'WORF':4, 'CRUSHER':5, 'TROI':6}
train_df['Character'] = train_df.Character.map(encoded_dict)
val_df['Character'] = val_df.Character.map(encoded_dict)

print(train_df['Character'].value_counts())
print(val_df['Character'].value_counts())




Character
0    8684
1    5130
2    4531
3    3339
4    2553
5    2341
6    2259
Name: count, dtype: int64
Character
0    2114
1    1324
2    1168
3     772
4     632
5     603
6     597
Name: count, dtype: int64


In [53]:
val_df.head(20)

Unnamed: 0,Line,Character
16254,Nope.,1
56245,I suspect the last thing Counsellor Troi would...,2
9987,It's still running. The programme didn't\r shu...,3
2965,It seems to be a network of miniature circuitr...,3
16825,We misinterpreted your actions as an attack on...,0
35320,"Peace envoy, in a stolen Vulcan ship.",0
36705,"Okay, we're going to track down any possible c...",3
9177,There was a moment when you smiled.,1
11906,Which is?,1
50132,Cellular peptides. That's exactly what the cre...,3


In [54]:
from tensorflow.keras.utils import to_categorical

y_train = to_categorical(train_df.Character)
y_test = to_categorical(val_df.Character)

#We have successfully processed our Character column( target); 
#now, it’s time to process our input text data using a tokenizer.

In [55]:
import transformers

#Loading Model and Tokenizer from the transformers package 

from transformers import AutoTokenizer,TFBertModel
#bert-base-uncased is another possible one
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
#TFBertModel = pretrained BERT model for Tensor Flow
bert = TFBertModel.from_pretrained('bert-base-uncased')

#Input Data Modeling

#Before training, we need to convert the input textual data into 
#BERT’s input data format using a tokenizer.
#Since we have loaded bert-base-cased, 
#so tokenizer will also be Bert-base-cased.
# Tokenize the input (takes some time) 
# here tokenizer using from bert-base-cased
x_train = tokenizer(
    text=train_df.Line.tolist(),
    add_special_tokens=True,
    max_length=40,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)
x_test = tokenizer(
    text=val_df.Line.tolist(),
    add_special_tokens=True,
    max_length=40,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)


#Hereafter data modelling, the tokenizer will return a dictionary (x_train) containing ‘Input_ids’, ‘attention_mask’ as key for their respective
#data.

input_ids = x_train['input_ids']
attention_mask = x_train['attention_mask']

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

# Model Building

In [56]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense

max_len = 40
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
embeddings = bert(input_ids,attention_mask = input_mask)[0] 
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(64,activation = 'relu')(out)
y = Dense(7,activation = 'softmax')(out)
model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
model.layers[3].trainable = True

# Model Compilation

Defining learning parameters and compiling the model.

In [57]:

optimizer = tf.keras.optimizers.legacy.Adam(
    learning_rate=4e-05, # 5e-05 is the learning rate is for bert model , taken from huggingface website 
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)
# Set loss and metrics
loss =CategoricalCrossentropy(from_logits = True)
metric = CategoricalAccuracy('balanced_accuracy'),
# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)

# Model Training

In [58]:
#We have the model ready with x_train, y_train. You can now train the model.
train_history = model.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} ,
    y = y_train,
    validation_data = (
    {'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']}, y_test
    ),
  epochs=4,
    batch_size=32
)


Epoch 1/4


  output, from_logits = _get_logits(


Epoch 2/4
Epoch 3/4
Epoch 4/4


# Model Evaluation

In [59]:
import numpy as np
predicted_raw = model.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})
predicted_raw[0]
y_predicted = np.argmax(predicted_raw, axis = 1)
y_true = val_df.Character

from sklearn.metrics import classification_report
print(classification_report(y_true, y_predicted))

              precision    recall  f1-score   support

           0       0.55      0.62      0.58      2114
           1       0.38      0.37      0.38      1324
           2       0.60      0.69      0.64      1168
           3       0.46      0.52      0.49       772
           4       0.48      0.36      0.41       632
           5       0.47      0.27      0.34       603
           6       0.35      0.29      0.32       597

    accuracy                           0.50      7210
   macro avg       0.47      0.45      0.45      7210
weighted avg       0.49      0.50      0.49      7210



  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


#1 1 epoch.  max_length = 70. second layer 34.  Accuracy = 42ish. 

#2 2 epoch. Max_length = 50? 40? second layer 64. Auucracy = .4797 

#3 2 epoch Max_Length = 50 second layer = 64, removed Wesley
              precision    recall  f1-score   support

           0       0.52      0.70      0.60      2114
           1       0.45      0.23      0.31      1324
           2       0.60      0.70      0.65      1168
           3       0.49      0.48      0.48       772
           4       0.43      0.41      0.42       632
           5       0.50      0.30      0.37       603
           6       0.35      0.34      0.34       597

    accuracy                           0.50      7210
   macro avg       0.48      0.45      0.45      7210
weighted avg       0.49      0.50      0.48      7210



#4 Because my dataset is kind of small, i'm going to adjust where I start to fine tune the model. Right now I have it start at layer 2. I'm going to change that to 5.  And that worked well. 

902/902 [==============================] - 290s 297ms/step - loss: 1.4421 - balanced_accuracy: 0.4498 - val_loss: 1.3221 - val_balanced_accuracy: 0.4940
Epoch 2/2
902/902 [==============================] - 268s 297ms/step - loss: 1.2568 - balanced_accuracy: 0.5312 - val_loss: 1.2979 - val_balanced_accuracy: 0.5062
226/226 [==============================] - 23s 88ms/step
              precision    recall  f1-score   support

           0       0.54      0.65      0.59      2114
           1       0.39      0.39      0.39      1324
           2       0.64      0.65      0.65      1168
           3       0.49      0.51      0.50       772
           4       0.46      0.43      0.44       632
           5       0.46      0.29      0.35       603
           6       0.41      0.25      0.31       597

    accuracy                           0.51      7210
   macro avg       0.48      0.45      0.46      7210
weighted avg       0.50      0.51      0.50      7210



#5 So lets add another epoch to this. 
226/226 [==============================] - 23s 89ms/step
              precision    recall  f1-score   support

           0       0.53      0.71      0.61      2114
           1       0.42      0.30      0.35      1324
           2       0.68      0.62      0.65      1168
           3       0.49      0.52      0.50       772
           4       0.47      0.46      0.47       632
           5       0.48      0.31      0.38       603
           6       0.33      0.31      0.32       597

    accuracy                           0.51      7210
   macro avg       0.49      0.46      0.47      7210
weighted avg       0.50      0.51      0.50      7210

902/902 [==============================] - 295s 302ms/step - loss: 1.4367 - balanced_accuracy: 0.4478 - val_loss: 1.3257 - val_balanced_accuracy: 0.4908
Epoch 2/3
902/902 [==============================] - 268s 298ms/step - loss: 1.2538 - balanced_accuracy: 0.5270 - val_loss: 1.3000 - val_balanced_accuracy: 0.5028
Epoch 3/3
902/902 [==============================] - 268s 298ms/step - loss: 1.1901 - balanced_accuracy: 0.5553 - val_loss: 1.2947 - val_balanced_accuracy: 0.5096

#6 Lets change the fine tune layer to 7. 
902/902 [==============================] - 294s 302ms/step - loss: 1.4514 - balanced_accuracy: 0.4479 - val_loss: 1.3264 - val_balanced_accuracy: 0.4896
Epoch 2/3
902/902 [==============================] - 268s 297ms/step - loss: 1.2661 - balanced_accuracy: 0.5244 - val_loss: 1.3018 - val_balanced_accuracy: 0.4964
Epoch 3/3
902/902 [==============================] - 268s 297ms/step - loss: 1.2079 - balanced_accuracy: 0.5502 - val_loss: 1.2996 - val_balanced_accuracy: 0.5037
226/226 [==============================] - 23s 89ms/step
              precision    recall  f1-score   support

           0       0.52      0.70      0.60      2114
           1       0.40      0.36      0.38      1324
           2       0.63      0.65      0.64      1168
           3       0.50      0.47      0.48       772
           4       0.52      0.36      0.43       632
           5       0.46      0.32      0.38       603
           6       0.36      0.24      0.29       597

    accuracy                           0.50      7210
   macro avg       0.48      0.44      0.46      7210
weighted avg       0.49      0.50      0.49      7210



#7 Why Am I using sigmoid? Changing to softmax. 
Epoch 1/3

/opt/conda/lib/python3.10/site-packages/keras/src/backend.py:5562: UserWarning: "`categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a Softmax activation and thus does not represent logits. Was this intended?
  output, from_logits = _get_logits(

902/902 [==============================] - 294s 302ms/step - loss: 1.4475 - balanced_accuracy: 0.4466 - val_loss: 1.3240 - val_balanced_accuracy: 0.4907
Epoch 2/3
902/902 [==============================] - 268s 298ms/step - loss: 1.2492 - balanced_accuracy: 0.5335 - val_loss: 1.3032 - val_balanced_accuracy: 0.5064
Epoch 3/3
902/902 [==============================] - 268s 297ms/step - loss: 1.1846 - balanced_accuracy: 0.5579 - val_loss: 1.2991 - val_balanced_accuracy: 0.5098

Epoch 1/3
226/226 [==============================] - 23s 89ms/step
              precision    recall  f1-score   support

           0       0.52      0.72      0.61      2114
           1       0.43      0.30      0.35      1324
           2       0.66      0.63      0.65      1168
           3       0.50      0.49      0.50       772
           4       0.44      0.42      0.43       632
           5       0.51      0.30      0.38       603
           6       0.35      0.32      0.34       597

    accuracy                           0.51      7210
   macro avg       0.49      0.45      0.46      7210
weighted avg       0.50      0.51      0.50      7210



#8? layers = 3
epoch = 10
learning rage = 5e-06
very bad, canceled.  Let me try again with a different learning rate. 
layers = 3
epoch = 4
learning rage = 4e-05

902/902 [==============================] - 295s 302ms/step - loss: 1.4806 - balanced_accuracy: 0.4329 - val_loss: 1.3555 - val_balanced_accuracy: 0.4843
Epoch 2/4
902/902 [==============================] - 268s 297ms/step - loss: 1.3109 - balanced_accuracy: 0.5014 - val_loss: 1.3250 - val_balanced_accuracy: 0.4954
Epoch 3/4
902/902 [==============================] - 268s 298ms/step - loss: 1.2637 - balanced_accuracy: 0.5226 - val_loss: 1.3151 - val_balanced_accuracy: 0.4932
Epoch 4/4
902/902 [==============================] - 268s 297ms/step - loss: 1.2327 - balanced_accuracy: 0.5390 - val_loss: 1.3144 - val_balanced_accuracy: 0.4958

226/226 [==============================] - 23s 90ms/step
              precision    recall  f1-score   support

           0       0.55      0.62      0.58      2114
           1       0.38      0.37      0.38      1324
           2       0.60      0.69      0.64      1168
           3       0.46      0.52      0.49       772
           4       0.48      0.36      0.41       632
           5       0.47      0.27      0.34       603
           6       0.35      0.29      0.32       597

    accuracy                           0.50      7210
   macro avg       0.47      0.45      0.45      7210
weighted avg       0.49      0.50      0.49      7210


