# Step-by-step guide to preprocessing data

This reading will guide you through the following steps:

- **Step 1:** Data Preprocessing
- **Step 2:** Clean the text 
- **Step 3:** Tokenize
- **Step 4:** Handle missing data
- **Step 5:** Prepare the data for fine-tuning
- **Step 6:** Split the data

## Step 1: Data Preprocessing

Before diving into the cleaning and tokenization processes, it's essential to import and organize the raw data into a structured format. We begin by loading the dataset, defining necessary labels, and preparing the initial dataset.

In [None]:
import pandas as pd
import torch

# Load the dataset from the URL
url = "https://huggingface.co/datasets/stepp1/tweet_emotion_intensity/resolve/main/train.csv"
data = pd.read_csv(url)

print("Dataset loaded successfully.")
print(data.head())

# Preprocessing for this specific dataset:
# 1. Rename 'tweet' to 'text' so it works with our cleaning function later
if 'tweet' in data.columns:
    data = data.rename(columns={'tweet': 'text'})

# 2. Create numeric labels from the 'emotion' column
if 'emotion' in data.columns:
    # Create a mapping from emotion string to number (e.g., anger -> 0, fear -> 1)
    label_mapping = {label: idx for idx, label in enumerate(data['emotion'].unique())}
    data['label'] = data['emotion'].map(label_mapping)
    print(f"\nLabel mapping applied: {label_mapping}")

# Convert labels to PyTorch tensor
if 'label' in data.columns:
    labels = torch.tensor(data['label'].tolist())

print(f"\nTotal samples: {len(data)}")

## Step 2: Clean the text

Text cleaning is the first step in preparing your dataset. It involves removing unwanted characters, URLs, and excess whitespace to ensure uniformity and cleanliness in the data. Text is also changed to lowercase to maintain consistency across all data points.

**Explanation:**
Cleaning the text by removing unnecessary characters and formatting it ensures that the data is consistent, making it easier for the model to understand.

In [None]:
import re

# Function to clean the text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# Apply cleaning function to your dataset
data['cleaned_text'] = data['text'].apply(clean_text)
print(data['cleaned_text'].head())

### Step 3: Tokenize

Tokenization is the process of converting text into individual tokens that a machine-learning model can understand. We use the tokenizer corresponding to the pretrained model (e.g., BERT) for this. This ensures that the data is properly formatted and ready for fine-tuning.

**Explanation:**
Tokenization converts the cleaned text into a format suitable for fine-tuning the model, ensuring that the input is ready for training.

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the cleaned text
tokens = tokenizer(
    data['cleaned_text'].tolist(), padding=True, truncation=True, return_tensors='pt', max_length=128
)

print(tokens['input_ids'][:5])  # Check the first 5 tokenized examples

### Step 4: Handle missing data

Missing data is common in real-world datasets. You can handle it either by removing incomplete entries or by imputing missing values. This step is critical to preventing errors during the training process.

**Explanation:**
Handling missing data ensures that your dataset is complete, which prevents training interruptions or biases introduced by missing information.

In [None]:
# Check for missing data
print(data.isnull().sum())

# Option 1: Drop rows with missing data
data = data.dropna()

# Option 2: Fill missing values with a placeholder
data['cleaned_text'].fillna('missing', inplace=True)

### Step 5: Prepare the data for fine-tuning

After cleaning and tokenizing your text, the next step is to prepare the data for fine-tuning. This involves structuring the tokenized data and labels into a format suitable for training, such as PyTorch DataLoader objects.

# Example code

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Create PyTorch tensors from the tokenized data
input_ids = tokens['input_ids']
attention_masks = tokens['attention_mask']
labels = torch.tensor(data['label'].tolist())

# Create a DataLoader for training
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

print("DataLoader created successfully!")

### Step 6: Split the data

Before training, it’s important to split your data into training, validation, and test sets. The training set is used to train the model, the validation set helps to tune model hyperparameters, and the test set is used for final evaluation to ensure that the model generalizes well to unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# First, split data into a combined training + validation set and a test set
train_val_inputs, test_inputs, train_val_labels, test_labels = train_test_split(
    input_ids, labels, test_size=0.1, random_state=42
)

# Now, split the combined set into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    train_val_inputs, train_val_labels, test_size=0.15, random_state=42
)

# Create DataLoader objects for training, validation, and test sets
train_dataset = TensorDataset(train_inputs, train_labels)
val_dataset = TensorDataset(val_inputs, val_labels)
test_dataset = TensorDataset(test_inputs, test_labels)

train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16)
test_dataloader = DataLoader(test_dataset, batch_size=16)

print("DataLoader objects for training, validation, and test sets created successfully!")

### Explanation

The `train_test_split` method from the `sklearn.model_selection` module splits your data into training and validation (or test) sets. Here's a breakdown of how it works:

- **`input_ids` and `labels`**: These are the inputs and labels you are splitting.
- **`test_size=0.1`**: This indicates that 10 percent of the data will be set aside for the test set.
- **`random_state=42`**: This ensures the split is reproducible—using the same random state will produce the same split every time.

In this case, we first split the data into two sets:
1. **`train_val_inputs` and `test_inputs`**: A combined set of training + validation data and a test set.

Then, we further split the `train_val_inputs` into `train_inputs` and `val_inputs` to get a separate validation set.

This process allows us to train, validate, and test data.

### Conclusion

Following this walkthrough, you’ve cleaned, tokenized, and structured your dataset for fine-tuning. With clean and well-prepared data, your model will have the best chance of achieving high performance during fine-tuning. You can use these preprocessing steps in your machine-learning projects to ensure optimal results.