# **YouTube Spam Detection**



# 1. Library Imports

In this step, we import libraries needed for data handling, feature engineering, model building, and evaluation.

In [2]:
from fastai.text.all import *
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from textblob import TextBlob


# 2: Load and Split Dataset

This cell loads the dataset and splits it into training and testing sets for model training and evaluation.

In [3]:
# Load dataset and split into train and test sets
file_path = '/content/drive/MyDrive/Colab Notebooks/Youtube-Spam-Dataset.csv'
data = pd.read_csv(file_path)
data = data.rename(columns={'CONTENT': 'text'})
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# 3: Feature Engineering

Sentiment features are added to enhance model accuracy by analyzing text polarity and subjectivity, which can help differentiate spam from non-spam comments.

In [4]:
def feature_engineering(df):
    df['text'] = df['text'].fillna('')
    df['polarity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
    df['subjectivity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
    return df

# Apply feature engineering to training data
train_df = feature_engineering(train_df)


# 4: Prepare Data for FastAI Text Classifier

The DataLoader prepares data for training, including the continuous sentiment features for more informed predictions.


In [5]:
dls = TextDataLoaders.from_df(
    train_df,
    text_col='text',
    label_col='CLASS',
    valid_pct=0.2,
    is_lm=False,
    y_block=CategoryBlock(),
    cont_names=['polarity', 'subjectivity']
)


# 5: Model Training

This cell trains a text classifier using FastAI’s AWD_LSTM model, fine-tuning it to optimize spam detection.

In [6]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=[accuracy])
learn.fine_tune(10, base_lr=1e-3)


  wgts = torch.load(wgts_fname, map_location = lambda storage,loc: storage)


epoch,train_loss,valid_loss,accuracy,time
0,0.617895,0.536575,0.753205,00:05


epoch,train_loss,valid_loss,accuracy,time
0,0.469445,0.426178,0.842949,00:02
1,0.452072,0.389508,0.839744,00:02
2,0.444852,0.378632,0.823718,00:02
3,0.429782,0.336121,0.846154,00:02
4,0.410244,0.317763,0.871795,00:03
5,0.383916,0.312682,0.871795,00:02
6,0.369441,0.306521,0.858974,00:02
7,0.359717,0.310439,0.86859,00:02
8,0.344527,0.308072,0.86859,00:02
9,0.341146,0.308504,0.86859,00:03


# Model Evaluation

The trained model is evaluated on test data, and a classification report is generated to assess precision, recall, and F1-score.

In [7]:
test_df = feature_engineering(test_df)
test_dls = dls.test_dl(test_df, with_labels=True)

preds = learn.get_preds(dl=test_dls, with_decoded=True)
decoded_preds = preds[1]
fine_tuned_targets = preds[2]

# Generate and print classification report
fine_tuned_preds_np = decoded_preds.numpy()
fine_tuned_targets_np = fine_tuned_targets.numpy()
report = classification_report(fine_tuned_targets_np, fine_tuned_preds_np)
print("Classification Report:\n", report)


Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.90      0.93       188
           1       0.91      0.97      0.94       204

    accuracy                           0.93       392
   macro avg       0.94      0.93      0.93       392
weighted avg       0.94      0.93      0.93       392

