<a href="https://colab.research.google.com/github/anddennn/IAT360_TravelCompanyComparison_NLPProject/blob/main/NLP_FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

## Set up Python Libraries

In [None]:
#install some Python packages with pip

!pip install optuna nltk numpy torch datasets transformers requests beautifulsoup4 pandas evaluate --quiet

In [None]:
# let's check the version we are using

!pip freeze | grep -E '^numpy|^torch|^datasets|^transformers|^evaluate'

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup ## for scraping
import re

In [None]:
import numpy as np
import pandas as pd
from datasets import Dataset

## Instantiate Model

In [None]:
# let's import the pretrained faster tokenizer from huggingface
# source: (https://huggingface.co/distilbert-base-uncased)

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)
tokenizer

## Scrape for Trust Pilot Reviews

Scrape first 3 pages of the Expedia, TripAdvisor, Booking.com, Airline Ticket Centre and Gala Travels pages on ca.TrustPilot.

In [None]:
def soup2list(src, list_, attr=None):
    if attr:
        for val in src:
            list_.append(val[attr])
    else:
        for val in src:
            list_.append(val.get_text())

In [None]:
reviews = []
ratings = []

for i in range(1, 4): # Loop through pages 1 to 3
    url = f'https://ca.trustpilot.com/review/www.expedia.com?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser') # Use r.text and html.parser for robust parsing

    # Find review headers with class containing 'styles_reviewHeader'
    # and extract 'data-service-review-rating' attribute
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewHeader.*')}), ratings, attr='data-service-review-rating')

    # Find review content divs with class containing 'styles_reviewContent'
    # and extract their text content
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewContent.*')}), reviews)


In [None]:
for i in range(1, 4): # Loop through pages 1 to 3
    url = f'https://ca.trustpilot.com/review/www.tripadvisor.ca?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser') # Use r.text and html.parser for robust parsing

    # Find review headers with class containing 'styles_reviewHeader'
    # and extract 'data-service-review-rating' attribute
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewHeader.*')}), ratings, attr='data-service-review-rating')

    # Find review content divs with class containing 'styles_reviewContent'
    # and extract their text content
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewContent.*')}), reviews)


In [None]:
for i in range(1, 4): # Loop through pages 1 to 3
    url = f'https://ca.trustpilot.com/review/www.booking.com?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser') # Use r.text and html.parser for robust parsing

    # Find review headers with class containing 'styles_reviewHeader'
    # and extract 'data-service-review-rating' attribute
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewHeader.*')}), ratings, attr='data-service-review-rating')

    # Find review content divs with class containing 'styles_reviewContent'
    # and extract their text content
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewContent.*')}), reviews)


In [None]:
for i in range(1, 4): # Loop through pages 1 to 3
    url = f'https://ca.trustpilot.com/review/www.airlineticketcentre.ca?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser') # Use r.text and html.parser for robust parsing

    # Find review headers with class containing 'styles_reviewHeader'
    # and extract 'data-service-review-rating' attribute
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewHeader.*')}), ratings, attr='data-service-review-rating')

    # Find review content divs with class containing 'styles_reviewContent'
    # and extract their text content
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewContent.*')}), reviews)


In [None]:
for i in range(1, 4): # Loop through pages 1 to 3
    url = f'https://ca.trustpilot.com/review/galatravels.com?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser') # Use r.text and html.parser for robust parsing

    # Find review headers with class containing 'styles_reviewHeader'
    # and extract 'data-service-review-rating' attribute
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewHeader.*')}), ratings, attr='data-service-review-rating')

    # Find review content divs with class containing 'styles_reviewContent'
    # and extract their text content
    soup2list(soup.find_all('div', {'class': re.compile(r'.*styles_reviewContent.*')}), reviews)


review_data = pd.DataFrame(
{
   'text':reviews,
   'label': ratings
})


In [None]:
review_data

We need to minus 1 from all ratings for training later.

In [None]:
# Convert the 'label' column to numeric type (it's currently a string from scraping)
review_data['label'] = pd.to_numeric(review_data['label'])

# Subtract 1 from the 'label' column
review_data['label'] = review_data['label'] - 1

# Display the updated DataFrame head to confirm the change
print(review_data.head())

## Load Reviews into DataFrame and Make csv

In [None]:
# Save the raw reviews to a CSV file
import pandas as pd
review_data.to_csv('trustpilot_reviews.csv', index=False)
print(f"DataFrame with {len(review_data)} entries (text and label) saved to 'trustpilot_reviews.csv'")

## Tokenize and Split Dataset

Make function for tokenizing dataset.

In [None]:
from transformers import BertTokenizer

# Preprocessing function
def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",  # Ensures uniform input size
        max_length=128  # Adjust based on task
    )

Split csv dataset into dictionary for train, test & validate.

In [None]:
df = pd.read_csv('trustpilot_reviews.csv')
# Initial dataset before tokenization, named 'raw_dataset' to avoid confusion with the final 'dataset' DatasetDict
raw_dataset = Dataset.from_pandas(df)

# Tokenize the raw dataset
tokenized_dataset = raw_dataset.map(preprocess_function, batched=True)

# Rename the 'label' column to 'labels' as expected by the Trainer
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

# Split the tokenized_dataset into training (80%) and a temporary set (20%)
train_test_valid_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Split the temporary set (train_test_valid_split['test']) into validation (50% of temp) and test (50% of temp)
test_valid_split = train_test_valid_split['test'].train_test_split(test_size=0.5, seed=42)

# Create a DatasetDict named 'dataset' as requested by the user
dataset = {
    'train': train_test_valid_split['train'],
    'val': test_valid_split['train'],
    'test': test_valid_split['test']
}

# Now 'dataset' is a DatasetDict and can be accessed as dataset["train"][0], etc.
print(f"Dataset split into: {len(dataset['train'])} training samples, {len(dataset['val'])} validation samples, {len(dataset['test'])} test samples.")
print(dataset["train"][0]) # Example access as requested by the user

Sample tokenized output:

In [None]:
print(dataset["train"][0])

## Evaluating the Model

In [None]:
import evaluate
import numpy as np

# we setup the training to evaluate the accuracy and f1 scores

accuracy_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    # Changed 'average' from default 'binary' to 'weighted' for multiclass classification
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    return {**accuracy, **f1}

In [None]:
import os
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, IntervalStrategy

# get bert model with a sequence classification head for sentiment analysis
# source: (https://huggingface.co/distilbert-base-uncased)
checkpoint = 'distilbert-base-uncased'
num_labels = 5
id2label = {0:'1 star', 1:'2 stars', 2:'3 stars', 3:'4 stars', 4:'5 stars'}
label2id = {'1 star':0, '2 stars':1, '3 stars':2, '4 stars':3, '5 stars':4}
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id)

# setup custom training arguments
# 1. store training checkpoints to 'results' output directory
# 2. fine-tune for just 1 epoch
# 3,4. use 16 as a batch size to speed things up
# 5. evaluate validation set every 500 steps (this is the default steps)
# 6. load the best model based on the lowest validation loss at the end of training
training_args = TrainingArguments(
    seed=42,
    output_dir = './results',
    num_train_epochs = 1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    load_best_model_at_end=True,
    eval_strategy = "epoch",
    save_strategy = 'epoch'
)

# setup trainer with custom metrics (accuracy, f1)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=compute_metrics,
)

# disable wandb logging (a v4 huggingface artifact)
os.environ['WANDB_DISABLED']= "true"

Test for unfine-tuned model.

In [None]:
trainer.evaluate(dataset['test'])

In [None]:
trainer.train()

In [None]:
trainer.evaluate(dataset['test']) ## evaluate on test set

## Testing with examples

In [None]:
from transformers import pipeline
import torch

# create pipeline for sentiment classifier with custom model and tokenizer
sentiment_classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer)

In [None]:
# let's see how our model classifies a bad review (1 star)
# this is from (https://ca.trustpilot.com/review/www.expedia.ca?page=5)

review = """
Hote bill error, expedia.ca fault
I used to like Expedia. Now I will certainly reconsider further bookings with them.
I have booked a hotel near Cancun and the invoice from Expedia included a tax to be charged by the hotel.
This is normal, there are various and numerous "environmental" taxes these days. However, the hotel actually
requested the amount that was 3x more than what Expedia.ca provided in the invoice. As far as I understand,
it was Expedia's fault, they calculated it incorrectly. However, Expedia has rejected my claim and even the proposal
to compensate me the difference in Expedia points! I guess, this is the warning sign for me - with Expedia,
the customer is always wrong.
"""
sentiment_classifier(review)

In [None]:
# let's see how our model classifies a 5 star review
# this is from (https://ca.trustpilot.com/review/www.airlineticketcentre.ca)

review = """
A Grateful Customer's Appreciation for Airline Ticket
I’ve been a loyal client of Airline Ticket for over 20 years,
and their service has consistently been outstanding.
They are incredibly reliable—whether it's booking, inquiries,
or last-minute changes, they always deliver with professionalism and care.

One of the things I value most is their commitment to personal service.
They always answer their landline promptly, and it’s always a real human on
the other end—ready to help, not just route you through automated systems.
Their customer service is truly exceptional: responsive, knowledgeable, and
 genuinely dedicated to guiding and supporting customers at any time.

I’m especially thankful to Michal, Judy, and most recently Shahir
for their continued professionalism and kindness. Their expertise and
personal touch make every interaction smooth and reassuring.

Thank you for two decades of excellence!
"""
sentiment_classifier(review)

## Fine-tuning parameters

Finding the best parameters using Optuna.It uses a smarter approach to search the hyperparameter space and focuses on promising regions. I’ve used this in several projects, and it often finds better configurations than manual tuning.

In [None]:
import optuna
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="weighted")
    }


def objective(trial):
    # Hyperparameters to tune
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])

    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        num_train_epochs=1,
        weight_decay=0.01,
        eval_strategy="epoch",
        logging_dir="./logs",
        report_to="none",         # avoid wandb warnings
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        compute_metrics=compute_metrics
    )

    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"]


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

print("Best params:", study.best_params)


Put best parameters in training

In [None]:
import os
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, IntervalStrategy

# get bert model with a sequence classification head for sentiment analysis
# source: (https://huggingface.co/distilbert-base-uncased)
checkpoint = 'distilbert-base-uncased'
num_labels = 5
id2label = {0:'1 star', 1:'2 stars', 2:'3 stars', 3:'4 stars', 4:'5 stars'}
label2id = {'1 star':0, '2 stars':1, '3 stars':2, '4 stars':3, '5 stars':4}
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id)

# setup custom training arguments
# 1. store training checkpoints to 'results' output directory
# 2. fine-tune for just 1 epoch
# 3,4. use 16 as a batch size to speed things up
# 5. evaluate validation set every 500 steps (this is the default steps)
# 6. load the best model based on the lowest validation loss at the end of training
training_args = TrainingArguments(
    seed=42,
    output_dir = './results',
    num_train_epochs = 1,

    # Best parameters from Optuna
    learning_rate=2.734763905921527e-05,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,

    load_best_model_at_end=True,
    eval_strategy = "epoch",
    save_strategy = 'epoch'
)

# setup trainer with custom metrics (accuracy, f1)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=compute_metrics,
)

# disable wandb logging (a v4 huggingface artifact)
os.environ['WANDB_DISABLED']= "true"

In [None]:
trainer.train()

In [None]:
trainer.evaluate(dataset['test']) ## evaluate on test set

In [None]:
from transformers import pipeline
import torch

# create pipeline for sentiment classifier with custom model and tokenizer
sentiment_classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer)

In [None]:
# let's see how our model classifies a 1 star review
# this is from (https://ca.trustpilot.com/review/www.expedia.ca?page=5)

review = """
Hote bill error, expedia.ca fault
I used to like Expedia. Now I will certainly reconsider further bookings with them.
I have booked a hotel near Cancun and the invoice from Expedia included a tax to be charged by the hotel.
This is normal, there are various and numerous "environmental" taxes these days. However, the hotel actually
requested the amount that was 3x more than what Expedia.ca provided in the invoice. As far as I understand,
it was Expedia's fault, they calculated it incorrectly. However, Expedia has rejected my claim and even the proposal
to compensate me the difference in Expedia points! I guess, this is the warning sign for me - with Expedia,
the customer is always wrong.
"""
sentiment_classifier(review)

In [None]:
# let's see how our model classifies a 5 star review
# this is from (https://ca.trustpilot.com/review/www.airlineticketcentre.ca)

review = """
A Grateful Customer's Appreciation for Airline Ticket
I’ve been a loyal client of Airline Ticket for over 20 years,
and their service has consistently been outstanding.
They are incredibly reliable—whether it's booking, inquiries,
or last-minute changes, they always deliver with professionalism and care.

One of the things I value most is their commitment to personal service.
They always answer their landline promptly, and it’s always a real human on
the other end—ready to help, not just route you through automated systems.
Their customer service is truly exceptional: responsive, knowledgeable, and
 genuinely dedicated to guiding and supporting customers at any time.

I’m especially thankful to Michal, Judy, and most recently Shahir
for their continued professionalism and kindness. Their expertise and
personal touch make every interaction smooth and reassuring.

Thank you for two decades of excellence!
"""
sentiment_classifier(review)

In [None]:
# let's see how our model classifies a 5 star review
# this is from (https://ca.trustpilot.com/review/www.airlineticketcentre.ca)

review = """
A Grateful Customer's Appreciation for Airline Ticket
I’ve been a loyal client of Airline Ticket for over 20 years,
and their service has consistently been outstanding.
They are incredibly reliable—whether it's booking, inquiries,
or last-minute changes, they always deliver with professionalism and care.

One of the things I value most is their commitment to personal service.
They always answer their landline promptly, and it’s always a real human on
the other end—ready to help, not just route you through automated systems.
Their customer service is truly exceptional: responsive, knowledgeable, and
 genuinely dedicated to guiding and supporting customers at any time.

I’m especially thankful to Michal, Judy, and most recently Shahir
for their continued professionalism and kindness. Their expertise and
personal touch make every interaction smooth and reassuring.

Thank you for two decades of excellence!
"""
sentiment_classifier(review)