## Data Processing

### Phase 1: Data Cleaning & Normalization
The first step is to convert raw, noisy data into a uniform format.

In [1]:
# --- 
# STEP 1: DATA CLEANING
# Removing unwanted noise like HTML tags, irrelevant symbols, or ads[cite: 48, 55].
# ---
import re

raw_data = "Check out our LLM tutorial!!! <ads>Visit site</ads> It's very helpfull."

def clean_text(text):
    # Remove HTML tags [cite: 52]
    text = re.sub(r'<.*?>', '', text)
    # Remove extra punctuation/noise [cite: 54, 70]
    text = re.sub(r'[^\w\s]', '', text)
    return text

cleaned_data = clean_text(raw_data)
print(f"Cleaned: {cleaned_data}")

# ---
# STEP 2: TEXT NORMALIZATION
# Bringing uniformity by converting to lowercase to ensure consistency[cite: 66, 69].
# ---
normalized_data = cleaned_data.lower()
print(f"Normalized: {normalized_data}")

Cleaned: Check out our LLM tutorial Visit site Its very helpfull
Normalized: check out our llm tutorial visit site its very helpfull


### Phase 2: Tokenization & Numerical Conversion
Before training, text must be broken down into units that the model can understand mathematically.

In [None]:
# ---
# STEP 3: TOKENIZATION
# Breaking sentences into smaller units (tokens) such as words or sub-words.
# ---
from nltk.tokenize import word_tokenize
import nltk

nltk.download("punkt")

tokens = word_tokenize(normalized_data)
print(f"Tokens: {tokens}")

# ---
# STEP 4: NUMERICAL REPRESENTATION (Mock Example)
# Converting tokens into numbers so the neural network can process them.
# ---
vocab = {word: i for i, word in enumerate(set(tokens))}
numerical_form = [vocab[token] for token in tokens]

print(f"Numerical Mapping: {vocab}")
print(f"Model Input: {numerical_form}")

Tokens: ['check', 'out', 'our', 'llm', 'tutorial', 'visit', 'site', 'its', 'very', 'helpfull']
Numerical Mapping: {'very': 0, 'helpfull': 1, 'visit': 2, 'out': 3, 'our': 4, 'its': 5, 'tutorial': 6, 'check': 7, 'llm': 8, 'site': 9}
Model Input: [7, 3, 4, 8, 6, 2, 9, 5, 0, 1]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/edwardlance/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Phase 3: Dataset Preparation for Training
Finally, the prepared data is split into different sets to test the model's actual performance.

In [None]:
# ---
# STEP 5: DATA SPLITTING
# Dividing the data into Training, Validation, and Test sets.
# Training: For learning. Validation: For tuning. Test: For final evaluation.
# ---
from sklearn.model_selection import train_test_split

# Mock dataset of cleaned sentences
dataset = [
    "llm tutorial is helpful",
    "data preparation is key",
    "model training requires data",
]
labels = [1, 1, 1]  # Example labels for supervised intent.

train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

print(f"Training set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

Training set size: 2
Test set size: 1
