<a href="https://colab.research.google.com/github/JacobJ215/Sentiment-Analysis-with-DistilBERT/blob/main/fine_tuning_distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning DistilBERT - Amazon Polarity

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece] -q

In [None]:
import numpy as np
from datasets import load_dataset

import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay

from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer, DataCollatorWithPadding

In [None]:
import numpy as np
from datasets import load_dataset

# Load a subset of the 'amazon_polarity'
amazon_train = load_dataset('amazon_polarity', split='train[:20000]')
amazon_test = load_dataset('amazon_polarity', split='test[:2000]')

print("Train Dataset : ", amazon_train.shape)
print("Test Dataset : ", amazon_test.shape)

Train Dataset :  (20000, 3)
Test Dataset :  (2000, 3)


In [None]:
amazon_train

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 20000
})

In [None]:
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer, DataCollatorWithPadding

# Initialize BERT tokenizer and model
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
# Preprocess the data
def preprocess_data(data):
    inputs = tokenizer(data['content'], truncation=True)
    return inputs

In [None]:
# Tokenize text
tokenized_datasets = amazon_train.map(preprocess_data, batched=True)
tokenized_test_datasets = amazon_test.map(preprocess_data, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

# Create datasets
tf_train_dataset = tokenized_datasets.to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'label'],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_test_datasets.to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'label'],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)


Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [None]:
# Create learning rate scheduler
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size = 8
num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)



In [None]:
import tensorflow as tf

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=opt, metrics=['accuracy'])

model.fit(tf_train_dataset,
        validation_data=tf_validation_dataset,
        epochs=2)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7bb14dfa0190>

In [None]:
preds = model.predict(tf_validation_dataset)["logits"]
class_preds = np.argmax(preds, axis=1)
print(preds.shape, class_preds.shape)

(2000, 2) (2000,)


In [None]:
import numpy as np

# Make prediction
class_preds = np.argmax(model.predict(tf_validation_dataset)["logits"], axis=1)

# Retrieve the original sentences
original_sentences = amazon_test["content"]

# Create lists to store positive and negative sentences
positive_sentences = []
negative_sentences = []

for i, pred in enumerate(class_preds):
    if pred == 1:
        positive_sentences.append(original_sentences[i])
    else:
        negative_sentences.append(original_sentences[i])

# Print some examples
print("Positive Sentences:")
for i in range(5):  # Print first 5 positive sentences
    print(positive_sentences[i])
    print("\n")

Positive Sentences:
My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"


Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the 

In [None]:
print("Negative Sentences:")
for i in range(5):  # Print first 5 negative sentences
    print(negative_sentences[i])
    print("\n")

Negative Sentences:
I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.


I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot.


I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd

In [None]:
def predict_sentiment(input_text):
    # Preprocess input text
    inputs = tokenizer(input_text, truncation=True, padding=True, return_tensors='tf')

    # Get model prediction
    logits = model(inputs)["logits"]
    predicted_class = np.argmax(logits, axis=1)[0]

    # Determine sentiment label
    sentiment_label = "positive" if predicted_class == 1 else "negative"

    return sentiment_label

# Example usage
input_text = "This is a horrible product it sucks"
predicted_sentiment = predict_sentiment(input_text)
print(f"The sentiment is {predicted_sentiment}")

The sentiment is negative


In [None]:
model.save_pretrained('fine_tuned_distilbert')

In [None]:
from google.colab import files
!zip -r /content/fine_tuned_distilbert.zip /content/fine_tuned_distilbert/
files.download('/content/fine_tuned_distilbert.zip')

  adding: content/fine_tuned_distilbert/ (stored 0%)
  adding: content/fine_tuned_distilbert/tf_model.h5 (deflated 8%)
  adding: content/fine_tuned_distilbert/config.json (deflated 44%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>