# Spam Comments Detection with Machine Learning

Spam comments detection means classifying comments as spam or not spam. YouTube is one of the platforms that uses Machine Learning to filter spam comments automatically to save its creators from spam comments.

Detecting spam comments is the task of text classification in Machine Learning. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content.

To detect spam comments with Machine Learning, we need labelled data of spam comments.

In the section below, you will learn how to detect spam comments with machine learning using the Python programming language.

Let’s start this task by importing the necessary Python libraries 

In [22]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

In [23]:
data = pd.read_csv("Youtube01-Psy.csv")
print(data.sample(5))

                                COMMENT_ID             AUTHOR  \
279      z13ytrypzmmxwrqsw22hfhf4ivbot5w2q            Hollz C   
101  z130gviqarmshdnau04cdzigvs3jepx4qw00k        khir abqari   
128  z121jvbjakmbehsys04ce5yj1y3hxx0hzsk0k  abdellah chafouai   
127    z123gnabnwbqtr1e022jejij0zzzez2os04      Luat ha hpuoc   
110      z123jlf4lzjbgpbcr23yhxyqbpe3gxpvm           TIGERIO_   

                    DATE                                            CONTENT  \
279  2014-11-08T09:17:52                         I'm watching this in 2014﻿   
101  2014-11-04T07:37:28  they said this video are not deserve 2billion ...   
128  2014-11-05T16:12:51  Discover a beautiful song of A young Moroccan ...   
127  2014-11-05T15:38:10  so crazy, over 2 billion views, not US, not Uk...   
110  2014-11-04T19:46:38  EHI GUYS CAN YOU SUBSCRIBE IN MY CHANNEL? I AM...   

     CLASS  
279      0  
101      0  
128      1  
127      0  
110      1  


We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further:

In [24]:
data = data[["CONTENT", "CLASS"]]
print(data.sample(5))

                                               CONTENT  CLASS
8      You should check my channel for Funny VIDEOS!!﻿      1
281  how does this video have 2,127,322,484 views i...      0
282                             What my gangnam style﻿      0
90   https://www.indiegogo.com/projects/cleaning-th...      1
34                           2 billion....Coming soon﻿      0


The class column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0:

In [25]:
data["CLASS"] = data["CLASS"].map({0: "Not Spam",
                                   1: "Spam Comment"})
print(data.sample(5))

                                               CONTENT         CLASS
59                                      Subscribe ME!﻿  Spam Comment
104  need money?Enjoy https://www.tsu.co/emerson_za...  Spam Comment
29    Subscribe to me for free Android games, apps.. ﻿  Spam Comment
80     http://woobox.com/33gxrf/brt0u5 FREE CS GO!!!!﻿  Spam Comment
299     I am so awesome and smart!!! Sucscribe to me!﻿  Spam Comment


### Training a Classification Model

Now let’s move further by training a classification Machine Learning model to classify spam and not spam comments. As this problem is a problem of binary classification, I will use the Bernoulli Naive Bayes algorithm to train the model:

In [26]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

cv = CountVectorizer()
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.2, 
                                                random_state=42)

model = BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))

0.9857142857142858


Now let’s test the model by giving spam and not spam comments as input:

In [27]:
sample = "Check this out: https://amanxai.com/" 
data = cv.transform([sample]).toarray()
print(model.predict(data))

['Spam Comment']


In [28]:
sample = "Lack of information!" 
data = cv.transform([sample]).toarray()
print(model.predict(data)) 

['Not Spam']


So this is how you can train a Machine Learning model for the task of spam detection using Python.

## Summary

Spam comments detection means classifying comments as spam or not spam. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content