# Spam Comments Detection with Machine Learning

Spam comments detection means classifying comments as spam or not spam. YouTube is one of the platforms that uses Machine Learning to filter spam comments automatically to save its creators from spam comments. 

Detecting spam comments is the task of text classification in Machine Learning. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website or any piece of content.

To detect spam comments with Machine Learning, we need labelled data of spam comments. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

In [2]:
data = pd.read_csv("Youtube01-Psy.csv")
print(data.sample(5))

                                COMMENT_ID                    AUTHOR  \
203      z13cc1abmqz5cjpkc223ybzavyibznjey                       전광용   
84   z13vs53zipmszf1ib04cerd5xlmzdz5qlzw0k          Minecraft-Viasat   
298    z12mubjyckfjvhmni23bc5nxbnrggtvjq04  Bishwaroop Bhattacharjee   
27     z13kszcinpnvc34v2234fnpxkpmlw3nhc04                Kyle Jaber   
162      z13dvrmxorf0cnj0423rsfcjlxevgjwll                   ZodexHD   

                    DATE                                            CONTENT  \
203  2014-11-07T15:14:48                                        Fantastic!﻿   
84   2014-11-03T14:38:53                                  Check my channel﻿   
298  2014-11-08T12:34:11  https://www.facebook.com/SchoolGeniusNITS/phot...   
27   2014-01-19T00:21:29            Check me out! I'm kyle. I rap so yeah ﻿   
162  2014-11-06T17:50:59  look at my channel i make minecraft pe lets pl...   

     CLASS  
203      0  
84       1  
298      1  
27       1  
162      1  


### We only need the content and class column from the dataset for the rest of the task. So let’s select both the columns and move further:

In [7]:
data=data[["CONTENT","CLASS"]]

In [8]:
print(data.sample(5))

                                               CONTENT  CLASS
263                                               LoL﻿      0
78   -----&gt;&gt;&gt;&gt;  https://www.facebook.co...      1
156    Search "Chubbz Dinero - Ready Or Not " Thanks ﻿      1
74   http://www.guardalo.org/best-of-funny-cats-gat...      1
320  If the shitty Chinese Government didn't block ...      0


### The class column contains values 0 and 1. 0 indicates not spam, and 1 indicates spam. So to make it look better, I will use spam and not spam labels instead of 1 and 0:

In [9]:
data["CLASS"] = data["CLASS"].map({0: "Not Spam",
                                   1: "Spam Comment"})
print(data.sample(5))

                                               CONTENT         CLASS
228  Like if you came here too see how many views t...      Not Spam
288  if i reach 100 subscribers i will go round in ...  Spam Comment
265  9 year olds be like, 'How does this have 2 bil...      Not Spam
274  You know a song sucks dick when you need to us...      Not Spam
91   There is one video on my channel about my brot...  Spam Comment


## Training a Classification Model

Now let’s move further by training a classification Machine Learning model to classify spam and not spam comments. As this problem is a problem of binary classification, I will use the Bernoulli Naive Bayes algorithm to train the model:

In [10]:
x = np.array(data["CONTENT"])
y = np.array(data["CLASS"])

In [12]:
cv = CountVectorizer()
x = cv.fit_transform(x)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

In [13]:
model = BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))

0.9857142857142858


#### Now let’s test the model by giving spam and not spam comments as input:

In [14]:
sample = "Check this out: https://thecleverprogrammer.com/" 
data = cv.transform([sample]).toarray()
print(model.predict(data))

['Spam Comment']


In [15]:
sample = "Lack of information!" 
data = cv.transform([sample]).toarray()
print(model.predict(data))

['Not Spam']
