# Text Formality Classification

## 1. Load the Dataset

We will load the dataset and visualize the data outputs.

In [None]:
# Importing dataset
from google.colab import drive
drive.mount('/content/drive')

# Unzip dataset
#!unzip /content/drive/MyDrive/Text Formality Project/GYAFC_Corpus.zip -d /content/drive/MyDrive/Text Formality Project



# Load dataset
!ls "/content/drive/MyDrive/NLP Project/GYAFC_Corpus/Entertainment_Music"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
model_outputs  test  train  tune


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Wed May 29 19:10:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
base_path = '/content/drive/MyDrive/Text Formality Project/GYAFC_Corpus/Entertainment_Music'
import os
# Function to read the sentences from a file
def load_sentences(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        sentences = file.readlines()
    return [s.strip() for s in sentences]  # Strip to remove any extra whitespace

# Paths to the files
formal_file_path = os.path.join(base_path, 'train', 'formal')
informal_file_path = os.path.join(base_path, 'train', 'informal')

test_formal_file_path = os.path.join(base_path, 'test', 'formal')
test_informal_file_path = os.path.join(base_path, 'test', 'informal')

# Load the sentences
formal_sentences = load_sentences(formal_file_path)
informal_sentences = load_sentences(informal_file_path)

test_formal_sentences = load_sentences(test_formal_file_path)
test_informal_sentences = load_sentences(test_informal_file_path)

In [None]:
# Check the first 2 loaded sentences
print("First 2 Informal Sentences:", informal_sentences[:2])
print("First 2 Formal Sentences:", formal_sentences[:2])

print("First 2 Test Informal Sentences:", test_informal_sentences[:2])
print("First 2 Test Formal Sentences:", test_formal_sentences[:2])

# Count the number of sentences
num_formal_sentences = len(formal_sentences)
num_informal_sentences = len(informal_sentences)

print("Number of Formal Sentences:", num_formal_sentences)
print("Number of Informal Sentences:", num_informal_sentences)

print("Number of Test Formal Sentences:", len(test_formal_sentences))
print("Number of Test Informal Sentences:", len(test_informal_sentences))

First 2 Informal Sentences: ['the movie The In-Laws not exactly a holiday movie but funny and good!', 'that page did not give me viroses(i think)']
First 2 Formal Sentences: ["The In-Laws movie isn't a holiday movie, but it's okay.", "I don't think that page gave me viruses."]
First 2 Test Informal Sentences: ['Is Any Baby Really A Freak.', 'aspen colorado has he best music festivals, you sit all over the moutians its  on and just hang out']
First 2 Test Formal Sentences: ['I like Rhythm and Blue music.', "There's nothing he needs to change."]
Number of Formal Sentences: 52595
Number of Informal Sentences: 52595
Number of Test Formal Sentences: 1082
Number of Test Informal Sentences: 1416


## 2. Preprocessing Data

We will use **Character-based preprocessing.** This method involves converting all text to a uniform case, tokenizing at the character level, and padding sequences to a fixed length.

It has higher accuracy in the paper (https://arxiv.org/pdf/2204.08975.pdf)


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Combine the datasets
all_sentences = formal_sentences + informal_sentences

test_all_sentences = test_formal_sentences + test_informal_sentences

# Initialize tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(all_sentences)

tokenizer.fit_on_texts(test_all_sentences)
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(all_sentences)
test_sequences = tokenizer.texts_to_sequences(test_all_sentences)

# Padding sequences
max_length = max([len(seq) for seq in sequences])  # Or you can define a max length
X_padded = pad_sequences(sequences, maxlen=max_length, padding='post')
test_X_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post')

# Prepare labels
y = [1] * len(formal_sentences) + [0] * len(informal_sentences)  # 1 for formal, 0 for informal
test_y = [1] * len(test_formal_sentences) + [0] * len(test_informal_sentences)  # 1 for formal, 0 for informal

# Print the first 2 padded sequences and labels

print("Padded Sequences:\n", X_padded[:2])  # Show first two padded sequences
print("Labels:", y[:2])  # Show first two labels

# Model Input
X = X_padded
test_X = test_X_padded

# Model Output
y = y
test_y = test_y

Padded Sequences:
 [[ 3  9  2 ...  0  0  0]
 [ 6  1 12 ...  0  0  0]]
Labels: [1, 1]


## 3. Train our Models

### 3.1 BiLSTM Model - Baseline Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense

import numpy as np

# Assuming X_padded is already defined as shown in previous steps
# Convert X_padded and y to NumPy arrays if they aren't already
X = np.array(X_padded)
y = np.array(y)

# Verify that X and y are now NumPy arrays
print(type(X), X.shape)
print(type(y), y.shape)


# Define the BiLSTM model
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=50, input_length=max_length),
    Bidirectional(LSTM(units=50)),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Model summary
model.summary()

# Train the model
model.fit(X, y, epochs=3, batch_size=64, validation_split=0.2)  # Adjust epochs, batch_size, and validation_split as needed


<class 'numpy.ndarray'> (105190, 3999)
<class 'numpy.ndarray'> (105190,)
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3999, 50)          6350      
                                                                 
 bidirectional (Bidirection  (None, 100)               40400     
 al)                                                             
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 46851 (183.01 KB)
Trainable params: 46851 (183.01 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7c21f9747670>

In [None]:

#Evaluate test predictions.

from sklearn.metrics import precision_score, recall_score, f1_score
predictions = model.predict(test_X)
binary_predictions = (predictions > 0.5).astype(int)

# Evaluate the model
precision = precision_score(test_y, binary_predictions)
recall = recall_score(test_y, binary_predictions)
f1 = f1_score(test_y, binary_predictions)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Precision: 0.6693174287607687
Recall: 0.933456561922366
F1 Score: 0.7796217676572753


### 3.2 DeBerta

In [None]:


import sentencepiece
import torch
from torch.utils.data import Dataset
from transformers import DebertaV2Model, DebertaV2Config, DebertaV2Tokenizer, DebertaV2ForSequenceClassification
from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score


MODEL_NAME = 'microsoft/deberta-v3-base'
model_bert = DebertaV2ForSequenceClassification.from_pretrained('microsoft/deberta-v3-base', num_labels=2)  # Adjust num_labels accordingly
config = DebertaV2Config.from_pretrained(MODEL_NAME)
tokenizer = DebertaV2Tokenizer.from_pretrained(MODEL_NAME)

# Prepare dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=64)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments
#tokenize dataset as per DeBerta
#split the data 80 and 20 first
X_train, X_val, y_train, y_val = train_test_split(all_sentences, y, test_size=0.2, random_state=42)


train_dataset = TextDataset(X_train, y_train)
eval_dataset = TextDataset(X_val, y_val)
test_dataset = TextDataset(test_all_sentences, test_y)
#! pip install --force-reinstall accelerate transformers[torch]
import torch, gc
gc.collect()
torch.cuda.empty_cache()



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'classifier.bias', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#Training
# !pip install transformers[torch]
#!pip install accelerate -U
#! pip install --force-reinstall accelerate transformers[torch]


def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


training_args = TrainingArguments(
    output_dir='./deBerta-results',          # where to save the model
    evaluation_strategy="epoch",     # evaluate each epoch
    save_strategy="epoch",           # save model each epoch
    learning_rate=2e-5,              # learning rate
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    num_train_epochs=3,              # number of epochs
    weight_decay=0.01,               # weight decay
    logging_dir='./logs',            # where to store logs
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model_bert,
    args=training_args,
    train_dataset=train_dataset,  # encoded and prepared training dataset
    eval_dataset=eval_dataset,    # encoded and prepared validation dataset
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],

)

# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2047,0.211942,0.933359,0.94218,0.9248,0.933409
2,0.1821,0.250059,0.934499,0.920816,0.952188,0.936239
3,0.1437,0.281912,0.93583,0.923245,0.952094,0.937448


TrainOutput(global_step=31557, training_loss=0.1764912257747709, metrics={'train_runtime': 6322.4194, 'train_samples_per_second': 39.93, 'train_steps_per_second': 4.991, 'total_flos': 8303144478603264.0, 'train_loss': 0.1764912257747709, 'epoch': 3.0})

In [None]:
deberta_final_model = DebertaV2ForSequenceClassification.from_pretrained('./deBerta-results/checkpoint-31557')
test_trainer_deberta = Trainer(deberta_final_model)

deberta_raw_pred, _, _ = test_trainer_deberta.predict(test_dataset)
deberta_y_pred = np.argmax(deberta_raw_pred, axis=1)
#tokenizer = DebertaV2Tokenizer.from_pretrained(MODEL_NAME)

deberta_final_model.eval()


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine

In [None]:
import pandas as pd
data = {
    'Sentence': test_all_sentences,  # Adjust field name as necessary
    'Predicted Label': deberta_y_pred,
    'Actual Label': test_y  # Adjust field name as necessary
}
df = pd.DataFrame(data)


In [None]:
df_shuffled = df.sample(frac=1).reset_index(drop=True)
print(df_shuffled.head())


                                            Sentence  Predicted Label  \
0  my favorite english song is kiss from a rose b...                0   
1                  neither...not a big fan of either                0   
2  I am not fond of any of them. Panget Rock Band...                1   
3           Goo Goo Dolls and Relient K are awesome!                1   
4              Cheesy I know...but I love this joke.                0   

   Actual Label  
0             0  
1             0  
2             1  
3             0  
4             0  


### 3.3 Bert (un-cased)

In [None]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback


import torch, gc
gc.collect()
torch.cuda.empty_cache()

bert_uncased = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(bert_uncased)
bert_uncased_model = BertForSequenceClassification.from_pretrained(bert_uncased, num_labels=2)

# Prepare dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=64)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


from sklearn.model_selection import train_test_split
from transformers import Trainer, TrainingArguments
#tokenize dataset as per DeBerta
#split the data 80 and 20 first
X_train, X_val, y_train, y_val = train_test_split(all_sentences, y, test_size=0.2, random_state=42)


train_dataset = TextDataset(X_train, y_train)
eval_dataset = TextDataset(X_val, y_val)
test_dataset = TextDataset(test_all_sentences, test_y)



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:

#! pip install --force-reinstall accelerate transformers[torch]

# Define Trainer parameters
# Try using a different GPU or a different version of the CUDA toolkit
#!nvidia-smi
#!pip install torch==1.13.1+cu117
#!pip install transformers==4.31.0
#!pip install accelerate -U#

def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Define Trainer
args = TrainingArguments(
    output_dir="./bert-uncased",
    evaluation_strategy="epoch",     # evaluate each epoch
    save_strategy="epoch",           # save model each epoch
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    seed=0,
    load_best_model_at_end=True,
)
trainer_bert_uncased = Trainer(
    model=bert_uncased_model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

# Train pre-trained model
trainer_bert_uncased.train()



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.3818,0.355075,0.869759,0.849172,0.9024,0.874977
2,0.3145,0.358003,0.87784,0.851716,0.917929,0.883584
3,0.2688,0.400761,0.881738,0.873565,0.895435,0.884365


TrainOutput(global_step=31557, training_loss=0.33048393975896223, metrics={'train_runtime': 3822.8639, 'train_samples_per_second': 66.038, 'train_steps_per_second': 8.255, 'total_flos': 8302995573995520.0, 'train_loss': 0.33048393975896223, 'epoch': 3.0})

In [None]:
model_path = "./bert-uncased/checkpoint-31557"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)

raw_pred, _, _ = trainer_bert_uncased.predict(test_dataset)
y_pred = np.argmax(raw_pred, axis=1)
print(y_pred)



[1 1 1 ... 0 0 0]


In [None]:
import pandas as pd

data = {
    'Sentence': test_all_sentences,  # Adjust field name as necessary
    'Predicted Label': y_pred,
    'Actual Label': test_y  # Adjust field name as necessary
}
df = pd.DataFrame(data)

df_shuffled = df.sample(frac=1).reset_index(drop=True)
print(df_shuffled.head())


                                            Sentence  Predicted Label  \
0  My little brother would ask a question like that.                1   
1  he fired his sister because HE cant keep his m...                0   
2          Note: i am not looking for aladdin movie.                1   
3  Especially in regard to the chicken's pursuit ...                1   
4  If it is an old car then you should roll the w...                1   

   Actual Label  
0             1  
1             0  
2             0  
3             1  
4             1  


### 3.3 Simple CNN

In [None]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Conv1D, GlobalMaxPooling1D, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(all_sentences)

# Build model
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 256

cnn_model = Sequential()
cnn_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
cnn_model.add(Conv1D(64, 3, activation='relu'))
cnn_model.add(GlobalMaxPooling1D())
cnn_model.add(Dense(32, activation='relu'))
cnn_model.add(Dropout(0.5))
cnn_model.add(Dense(1, activation='sigmoid'))

optimizer = Adam(learning_rate=0.001)

cnn_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=["accuracy", Precision(), Recall()])

In [None]:

# Train model
X = np.array(X_padded)
y = np.array(y)
cnn_model.fit(X, y, batch_size=16, epochs=3, validation_split=0.2)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7c21f8e61840>

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
predictions = cnn_model.predict(test_X)
binary_predictions = (predictions > 0.5).astype(int)

# Evaluate the model
precision = precision_score(test_y, binary_predictions)
recall = recall_score(test_y, binary_predictions)
f1 = f1_score(test_y, binary_predictions)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Precision: 0.6909090909090909
Recall: 0.9482439926062847
F1 Score: 0.7993767043241138


In [None]:
print(cnn_model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 3999, 256)         32512     
                                                                 
 conv1d (Conv1D)             (None, 3997, 64)          49216     
                                                                 
 global_max_pooling1d (Glob  (None, 64)                0         
 alMaxPooling1D)                                                 
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                      

### Formality Score Calculations

In [None]:
##Using the currently trained Bert Model, we can use it to predict formality score
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v2-xlarge")
model = DebertaV2ForSequenceClassification.from_pretrained('./deBerta-results/checkpoint-31557')
test_trainer = Trainer(model)

raw_pred, _, _ = test_trainer.predict(test_dataset)
y_pred = np.argmax(raw_pred, axis=1)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
test_50_sentences = df_shuffled["Sentence"].tolist()[:50]
test_50_labels = df_shuffled["Actual Label"].tolist()[:50]
test_y_pred = df_shuffled["Predicted Label"].tolist()[:50]

In [None]:
model.eval()

DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine

In [None]:

tokenizer = DebertaV2Tokenizer.from_pretrained(MODEL_NAME)
inputs = tokenizer(test_50_sentences, return_tensors="pt", padding=True, truncation=True, max_length=512)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
y_pred = y_pred[:50]
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    formality_scores = probabilities[:, 1].tolist()  # List of formality scores for each sentence


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
for text, score, pred, model_prediction in zip(test_50_sentences, formality_scores, test_50_labels, test_y_pred):
    print(f"Text: {text}\nFormality Score: {score:.4f}\nOriginal Prediction: {pred}\nModel Prediction: {model_prediction}\n")

#1 for formal, 0 is informal --> the closer the score is to 0, the more informal it is, otherwise the closer it is to 1, then that is how formal it is.ok

Text: My little brother would ask a question like that.
Formality Score: 0.9905
Original Prediction: 1
Model Prediction: 1

Text: he fired his sister because HE cant keep his mouth shut!
Formality Score: 0.0002
Original Prediction: 0
Model Prediction: 0

Text: Note: i am not looking for aladdin movie.
Formality Score: 0.0009
Original Prediction: 0
Model Prediction: 1

Text: Especially in regard to the chicken's pursuit of the man during the conclusion!
Formality Score: 0.9999
Original Prediction: 1
Model Prediction: 1

Text: If it is an old car then you should roll the window down; otherwise, unlock the door.
Formality Score: 1.0000
Original Prediction: 1
Model Prediction: 1

Text: it might be an interesting show, but never got into it at all.
Formality Score: 0.0008
Original Prediction: 0
Model Prediction: 1

Text: I do not know who originally sang the song, Cry Me A River.
Formality Score: 1.0000
Original Prediction: 1
Model Prediction: 1

Text: Where in the world do you come up with