# Classifying YouTube Comments using Supervised Learning
In this notebook, we will use supervised learning techniques to classify YouTube comments. The goal is to categorize comments into different classes such as positive, negative, spam, etc. 

## Steps Involved:
1. **Data Collection**: Gather a dataset of YouTube comments along with their corresponding labels.
2. **Data Preprocessing**: Clean the comments by removing special characters, stop words, and perform other necessary preprocessing steps.
3. **Feature Extraction**: Convert the text data into numerical features using techniques like TF-IDF, word embeddings, etc.
4. **Model Selection**: Choose a suitable supervised learning algorithm (e.g., Logistic Regression, SVM, Random Forest) for classification.
5. **Training the Model**: Train the model using the training dataset.
6. **Evaluation**: Evaluate the model\'s performance using metrics like accuracy, precision, recall, and F1-score.
7. **Prediction**: Use the trained model to classify new YouTube comments.

Let's get started!

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from transformers import pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [2]:
# Import the data as dataframe
file_name = "BBC_News_comments.csv"

current_dir = Path.cwd().parent

data_path = current_dir / "Data" / "RawData" / file_name

# Import the data
df = pd.read_csv(data_path)

In [4]:
df.head()

Unnamed: 0,comment
0,Skip navigation\r\nSign in\r\nChina announces ...
1,"You tariff me, I tariff you. It seems fair to ..."
2,Wonder how Trump administration is going to be...
3,"if Fentanyl enter US through Canada, why don't..."
4,"So trump said he’d tariff Canada, Canada threa..."


In [6]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [7]:
candidate_labels = ["Positive", "Negative"]

texts = df['comment'].tolist()
print(texts)



In [8]:
for text in texts:
    result = classifier(text, candidate_labels)
    print(text, result)

Skip navigation
Sign in
China announces retaliatory action as Donald Trump's tariffs take effect | BBC News
BBC News
17.4M subscribers
Subscribe
4.1K
Share
BBC is a British public broadcast service. Wikipedia
    
    
    
   
    
    
    
   501K views  1 day ago  #BBCNews #China #US
China has announced retaliatory tariffs on some American goods, as US tariffs on all Chinese goods come into force.
 …
...more
2,994 Comments
Sort by
Add a comment...
@manwingchi9156
1 day ago
You tariff me, I tariff you. It seems fair to me. 
907
Reply
59 replies
@NareshKumar-vz5jm
1 day ago
Wonder how Trump administration is going to be pumping Qardun Token
5K
Reply
2 replies
@yusamer9836
1 day ago
if Fentanyl enter US through Canada, why don't Canada experience the same problem of Fentanyl abuse like US??? logic says it should be abundant and cheaper in Canada compared to US
742
Reply
108 replies
@WhichDoctor1
1 day ago
So trump said he’d tariff Canada, Canada threatened to tariff back. Then trump g