<a href="https://colab.research.google.com/github/Yukti20/ecomm-sentiment-analysis/blob/main/ecomm_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U transformers datasets evaluate --quiet

##  This installs Hugging Face Transformers, Datasets, and Evaluate libraries.
## -U is for latest versions
## --quiet is for running without any output

In [2]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "offline"

In [3]:
## from hugging face
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

## python libraries
import numpy as np
import torch

In [4]:
dataset = load_dataset("amazon_polarity")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
print("Type of dataset: ", type(dataset))
print("Structure of dataset: \n", str(dataset))

print("\n First record in training dataset:")
dataset["train"][0]

## this dataset has 2 sections - train and test.
## each set has dictionaries with 3 features - label, title, and content.
## training dataset has 3,600,000 records whereas testing dataset has 400,000

Type of dataset:  <class 'datasets.dataset_dict.DatasetDict'>
Structure of dataset: 
 DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })
})

 First record in training dataset:


{'label': 1,
 'title': 'Stuning even for the non-gamer',
 'content': 'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'}

In [6]:
model_name = "distilbert-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, token=False)

## I'm adding a classification head to this model with 2 outputs - because I want to have only 2 outcome sentiments
## This builds a new layer: Linear(hidden_size → num_labels)
## Please note that this will not freeze the existing parameters (weights and biases) - they will have to be frozen separately if we want only the last layer to be trained. That might speed up the training process.

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, token=False)

def tokenize_function(examples):
    # Combine title + content for input
    texts = [t + " " + c for t, c in zip(examples["title"], examples["content"])]
    return tokenizer(texts, padding="max_length", truncation=True, max_length=128)

In [8]:
train_subset = dataset["train"].shuffle(seed = 20).select(range(2000))   ## taking 2K records from 3.6M
test_subset = dataset["test"].shuffle(seed = 20).select(range(500))      ## taking 500 records from 400K

In [9]:
tokenized_train = train_subset.map(tokenize_function, batched=True)
tokenized_test = test_subset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [10]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

In [11]:
training_args = TrainingArguments(
                                    output_dir="./results",
                                    eval_strategy="epoch",
                                    per_device_train_batch_size=16,
                                    per_device_eval_batch_size=16,
                                    num_train_epochs=2,
                                    logging_dir="./logs",
                                    save_strategy="no",  # to save time
                                    load_best_model_at_end=False,
                                    report_to="none"
                                )

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [12]:
trainer = Trainer(
                    model=model,
                    args=training_args,
                    train_dataset=tokenized_train,
                    eval_dataset=tokenized_test,
                    compute_metrics=compute_metrics,
                )

In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.256357,0.896
2,No log,0.302448,0.908


TrainOutput(global_step=250, training_loss=0.25661209106445315, metrics={'train_runtime': 2693.5518, 'train_samples_per_second': 1.485, 'train_steps_per_second': 0.093, 'total_flos': 132467398656000.0, 'train_loss': 0.25661209106445315, 'epoch': 2.0})

In [14]:
model.save_pretrained("yk_model")
tokenizer.save_pretrained("yk_model")

('yk_model/tokenizer_config.json',
 'yk_model/special_tokens_map.json',
 'yk_model/vocab.txt',
 'yk_model/added_tokens.json',
 'yk_model/tokenizer.json')

In [15]:
!zip -r yk_model.zip yk_model

  adding: yk_model/ (stored 0%)
  adding: yk_model/config.json (deflated 45%)
  adding: yk_model/tokenizer_config.json (deflated 75%)
  adding: yk_model/tokenizer.json (deflated 71%)
  adding: yk_model/vocab.txt (deflated 53%)
  adding: yk_model/model.safetensors (deflated 8%)
  adding: yk_model/special_tokens_map.json (deflated 42%)
