<a href="https://colab.research.google.com/github/calebtan2002/Twitter-NLP-Sentiment-Analysis-/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing of libraries** and **Loading of datasets**

In [30]:
import os
import pandas as pd
import numpy as np
from google.colab import drive
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

In [31]:
df = pd.read_csv('training.1600000.processed.noemoticon.csv',
                encoding='ISO-8859-1',
                header=None,
                engine='python',
                on_bad_lines='skip') # Skip lines with errors
df = df[[0, 5]]
df.columns = ['label', 'text']

In [32]:
df.head()

Unnamed: 0,label,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


# **Mapping of Labels** and **Sampling**

The original dataset labels are:


*   0 for negative
*   2 for neutral
*   4 for positive




I will remap these to 0, 1, and 2 for easier processing.

To reduce training time, I will use a sample of 100,000 tweets from the 1,600,000 dataset. However, for real-world scenarios, more samples can be used.

In [33]:
label_mapping = {0: 0, 2: 1, 4: 2}
df['label'] = df['label'].map(label_mapping)

df_subset = df.sample(n=100000, random_state=42)

# **Splitting the Dataset** and **Conversion to Hugging Face Datasets**

I will split the data into an 80-20 ratio for training and evaluation. Also, I will convert the pandas DataFrames into datasets objects for compatibility.

In [34]:
#Split the data
train_df, test_df = train_test_split(df_subset, test_size = 0.2, random_state = 42)

In [35]:
# Convert to Hugging Face datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# **Tokenizing the Dataset**
I will use a pre-trained tokenizer, **distilbert-base-uncased**, to convert text data into numerical format required by the model. The padding and truncation options ensure uniform input length.

In [36]:
# Tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

# **Setting Format for PyTorch**
I will format the datasets to be used with PyTorch by specifying the columns to be included.

In [37]:
# Set format for PyTorch
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

# **Loading the Pre-trained Model**
I will load a pre-trained **DistilBERT model**, fine-tuning it for sequence classification. The num_labels parameter specifies three sentiment categories.

In [38]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Defining Training Arguments**
I will configure training parameters such as the **number of epochs, batch size**, and **learning rate adjustments**. Also, setting **load_best_model_at_end to True** for optimal model selection.

In [39]:
# Define Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    num_train_epochs=2,  # Increase the number of epochs to 2
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    save_total_limit=1,
    save_strategy="epoch"
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


# **Defining the Evaluation Metrics**
I will use **accuracy, precision, recall, and F1-score** to assess the model's performance on the test set.

In [40]:
# Define compute metrics function
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

# **Initialising the Trainer**
The Trainer class from Hugging Face simplifies the training process by combining the model, training arguments, datasets, and evaluation metrics. Also, to avoid logging to Weights & Biases, I will disable it using an environment variable.

In [41]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

In [42]:
# Disable Weights & Biases (W&B)
os.environ["WANDB_DISABLED"] = "true"

# **Training the Model**
Finally, I will train the model using the previously defined Trainer.

In [43]:
# Train the model on the larger subset
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3944,0.384133,0.83385,0.833788,0.834364,0.83385
2,0.2429,0.418785,0.8364,0.836368,0.836678,0.8364


TrainOutput(global_step=10000, training_loss=0.35008572387695314, metrics={'train_runtime': 2009.9883, 'train_samples_per_second': 79.602, 'train_steps_per_second': 4.975, 'total_flos': 5298790440960000.0, 'train_loss': 0.35008572387695314, 'epoch': 2.0})