## Step 1: Import Required Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 2: Load Cleaned & Labeled Dataset

This code loads the **cleaned and labeled comment dataset** from a CSV file into a DataFrame and prints **basic information**, including the **number of comments** and **distribution of sentiment labels**, to verify the dataset before performing NLP or ML tasks.  

In [2]:
# Load the cleaned and labeled dataset
df = pd.read_csv("youtube_comments_cleaned.csv")

# Display basic info
print("Total comments loaded:", len(df))
print(df["label"].value_counts())

Total comments loaded: 4454
label
Positive    2240
Negative    1730
Neutral      484
Name: count, dtype: int64


# Step 3: Split Data into Training and Testing Sets

This code splits the dataset into **training and testing subsets** for machine learning. The split preserves the **sentiment label distribution** using `stratify=y` and ensures reproducibility with a fixed `random_state`.  

In [3]:
X = df["comment"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training size:", len(X_train))
print("Testing size:", len(X_test))

Training size: 3563
Testing size: 891


This code calculates the **relative frequency of each sentiment class** in the dataset, providing insight into **class balance**. Balanced classes are important to ensure fair and accurate model training.  

In [4]:
df["label"].value_counts(normalize=True)

label
Positive    0.502919
Negative    0.388415
Neutral     0.108666
Name: proportion, dtype: float64