<h2>Tasks: </h2>
* Efficiently pre-process your own data, ensuring it's ready for training and optimized for success in real-world tasks.
* Prepare a dataset for fine-tuning.
* Clean the raw data, tokenize the text, handle missing data, and structure it into a training-ready input for a fine-tuning task.

<h3>Step 1: Import dataset</h3>

In [None]:
import pandas as pd
import random

# Load the dataset from Hugging Face
splits = {'train': 'train.csv', 'test': 'test.csv'}
data = pd.read_csv('hf://datasets/stepp1/tweet_emotion_intensity/' + splits['train'])
data.head()

In [None]:
data.shape

<h3>Step 2: Clean the text</h3>

In [None]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

data['cleaned_text'] = data['tweet'].apply(clean_text)
print(data['cleaned_text'].head())

<h3>Step 3: Handle missing data</h3>

In [None]:
print(data.isnull().sum())
data = data.dropna(subset=['cleaned_text'])
data['cleaned_text'] = data['cleaned_text'].fillna('unknown')

<h3>Step 4: Tokenization</h3>

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = tokenizer(
    data['cleaned_text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt'
)

print(tokens['input_ids'][:5])