# Phase 1: Dataset Preparation

This notebook demonstrates the process of loading, cleaning, and converting the Twitter Entity Sentiment Analysis dataset from Kaggle into a format ready for machine learning using Hugging Face tools.

## Steps:

1. **Data Loading**
   - The dataset is sourced from Kaggle under the name *twitter-entity-sentiment-analysis*.

2. **Data Cleaning**
   - Unrelevant entries (as specified in the dataset description on Kaggle) are removed to ensure that only pertinent data is processed.

3. **Sentiment Conversion**
   - The original sentiment labels are mapped to numerical values:
     - **Positive** → 1
     - **Negative** → 0
     - **Neutral**  → 2

4. **Text Tokenization**
   - The text entries are tokenized using the `bert-base-uncased` tokenizer, preparing the text for subsequent modeling tasks.

5. **Dataset Conversion**
   - After preprocessing, the dataset is converted into a Hugging Face dataset format for compatibility.

6. **Saving Locally**
   - The final processed dataset is saved locally, ensuring easy access for future experiments or model training.



---

In [1]:
import sys
sys.path.insert(0,'../')

from transformers import AutoTokenizer

import kagglehub, os, random
import pandas as pd

from Pipeline.preprocessing.sentiment_analysis_Preprocessor import *


In [None]:
# define the model that will be used
model_name = 'bert-base-uncased'

### Load the data

In [None]:
# Download latest version
path = kagglehub.dataset_download("jp797498e/twitter-entity-sentiment-analysis")

In [4]:
# show the downloaded files
os.listdir(path)

['twitter_training.csv', 'twitter_validation.csv']

In [5]:
training_df = pd.read_csv(path + '/twitter_training.csv', names=['id', 'entity', 'sentiment', 'text'])
training_df.head()

Unnamed: 0,id,entity,sentiment,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [6]:
test_df = pd.read_csv(path + '/twitter_validation.csv', names=['id', 'entity', 'sentiment', 'text'])
test_df.head()

Unnamed: 0,id,entity,sentiment,text
0,3364,Facebook,Irrelevant,I mentioned on Facebook that I was struggling ...
1,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...
2,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...
3,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,..."
4,4433,Google,Neutral,Now the President is slapping Americans in the...


---

### Preprocess the dataset

In the documentation of the dataset, we can read that 

`There are three classes in this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to the entity (i.e. Irrelevant) as Neutral.`

The messages that are not relevant to an entity are marked as _Irrelevant_, but that doesn't tell if the message is _Positive_, _Negative_ or _Neutral_.\
For this reason, all the entries labeled as _Irrelevant_ will be removed.

In [7]:
# print some examples or irrelevant tweets
irrelevant_df = training_df[training_df['sentiment'] == 'Irrelevant']

for index, example in enumerate(random.sample(irrelevant_df['text'].to_list(), 5)):
    print(f'{index}: {example}')
    print()

0: forums.com is the greatest spot to get free modz

1: this baseball clip looks better than anything from the entire overwatch league

2: Watch Dogs 2 is a copy of GTA, but the driving is lagged and poorly optimised, the characters are plain and boring (Wrench and Sitara are ok tho), disproportionate enemies' AI response, hacking mechanics and missions are repetitive (stealth, kill, hack) and shooting isn't worth

3: Girls, what information???? Google conspiracies are your references when there are FOXY interviews that literally deny this....

4: bruuuhhh oh wtf is wrong with me these good gay ass pc players



In [8]:
# remothe the entries that have Irrelevant sentiment
training_df = training_df[training_df['sentiment'] != 'Irrelevant']
test_df = test_df[test_df['sentiment'] != 'Irrelevant'] 

In [9]:
# create a function to convert sentiment string to a digit, following the convention
def convert_sentiment_to_digit(sentiment):
    """
    Convert sentiment string to a digit.
    
    This function takes a string from the sentiment column and returns a digit.
    The conversion is done as follows:
        - 'Positive' -> 1
        - 'Negative' -> 0
        - 'Neutral' -> 2
    
    Args:
        sentiment (str): The sentiment string to be converted.
    
    Returns:
        int: The digit corresponding to the sentiment string.
    """
    if sentiment == 'Positive':
        return 1
    elif sentiment == 'Negative':
        return 0
    else:
        return 2

In [None]:
# define the desired preprocessing parameters
input_preprocessing_params = {
    'tokenizer': AutoTokenizer.from_pretrained(model_name),
    'max_length': 128,
    'truncation': True
}

output_preprocessing_params = {
    'label_preprocess_fn': convert_sentiment_to_digit
}

In [11]:
# create an instance of the Preprocessor class, from the Pipeline
preprocessor = SentimentAnalysisPreprocessor(input_preprocessing_params, output_preprocessing_params)

# preprocess the data
preprocessed_training_data = preprocessor.preprocess_input_data(training_df['text'])
preprocessed_test_data = preprocessor.preprocess_input_data(test_df['text'])

preprocessed_training_labels = preprocessor.preprocess_output_data(training_df['sentiment'])
preprocessed_test_labels = preprocessor.preprocess_output_data(test_df['sentiment'])

### Save locally the dataset, using the Hugging Face Dataset format

In [12]:
from datasets import Dataset, DatasetDict

In [13]:
#first create a dictionay
dataset = {
    'train_set': {
                    'input_ids': preprocessed_training_data,
                    'labels': preprocessed_training_labels        
                },

    'test_set': {
                    'input_ids': preprocessed_test_data,
                    'labels': preprocessed_test_labels        
                }
}

In [14]:
# then convert each split into a Hugging Face Dataset
train_dataset = Dataset.from_dict(dataset["train_set"])
test_dataset = Dataset.from_dict(dataset["test_set"])

In [15]:
# and finally pack it into a DatasetDict structure
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset,
})

In [16]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 61692
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 828
    })
})

In [17]:
# save locally
import pickle

# Open a file in write-binary mode
with open("data/dataset_sentiment_analysis.pkl", "wb") as file:
    # Serialize the list and save it to the file
    pickle.dump(dataset, file)