### Problem
Spam Comments Detection; Spam refers to unsolicited, unwanted, or irrelevant messages that are sent in bulk through electronic messaging systems such as email, text messages, or social media. Spam is usually sent for commercial purposes, such as advertising products or services, but it can also be sent for malicious purposes, such as phishing scams or spreading malware.

Spam is generally considered a nuisance and can be harmful as it can consume network resources, slow down systems, and expose users to various forms of fraud or security risks. Many email providers and social media platforms use various techniques to filter out and block spam messages, such as spam filters, blacklists, or captcha tests. Definitely not something we want around

#### About Dataset
Abstract: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.
Data Set Information:
The table below lists the datasets, the YouTube video ID, the number of samples in each class and the total number of samples per dataset.
Dataset --- YouTube ID -- # Spam - # Ham - Total
Psy ------- 9bZkp7q19f0 --- 175 --- 175 --- 350
KatyPerry - CevxZvSJLk8 --- 175 --- 175 --- 350
LMFAO ----- KQ6zr6kCPj8 --- 236 --- 202 --- 438
Eminem ---- uelHwf8o7_U --- 245 --- 203 --- 448
Shakira --- pRpeEdMmmQ0 --- 174 --- 196 --- 370
Note: the chronological order of the comments were kept.
The collection is composed by one CSV file per dataset, where each line has the following attributes:
COMMENT_ID,AUTHOR,DATE,CONTENT,TAG
Further details please visit this website,http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection#

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ML Libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier

In [2]:
# import data
psy = pd.read_csv('Youtube01-Psy.csv')
eminem =  pd.read_csv('Youtube04-Eminem.csv')
katy = pd.read_csv('Youtube02-KatyPerry.csv')
lmfao = pd.read_csv('Youtube03-LMFAO.csv')
shakira= pd.read_csv('Youtube05-Shakira.csv')

In [3]:
# Data
psy.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [4]:
# Limiting columns to necessary columns for spam detection
psy =psy[["CONTENT", "CLASS"]]
katy =katy[["CONTENT", "CLASS"]]
eminem =eminem[["CONTENT", "CLASS"]]
lmfao =lmfao[["CONTENT", "CLASS"]]
shakira =shakira[["CONTENT", "CLASS"]]

In [5]:
# Data
psy.head()

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",1
1,Hey guys check out my new channel and our firs...,1
2,just for test I have to say murdev.com,1
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [6]:
# Splitting Data to X and y
X = np.array(psy.CONTENT)
y = np.array(psy.CLASS)

In [7]:
# Count Vectorizer
cv = CountVectorizer()
X=cv.fit_transform(X)

In [14]:
# Function to Evaluate models
def model(X, y):
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
    model, rf, cb = MultinomialNB(), RandomForestClassifier(random_state=23), CatBoostClassifier()
    model.fit(Xtrain, ytrain), rf.fit(Xtrain, ytrain), cb.fit(Xtrain, ytrain)
    print(f'MulttinominalNB = {model.score(Xtest, ytest)}, Random Forest = {rf.score(Xtest, ytest)}, Catboost = {cb.score(Xtest, ytest)}')

In [15]:
# Train model with Psy DataSet
model(X, y)

Learning rate set to 0.005982
0:	learn: 0.6880886	total: 10.6ms	remaining: 10.6s
1:	learn: 0.6841643	total: 20.5ms	remaining: 10.2s
2:	learn: 0.6796977	total: 30.5ms	remaining: 10.1s
3:	learn: 0.6743778	total: 40.2ms	remaining: 10s
4:	learn: 0.6709078	total: 50.1ms	remaining: 9.97s
5:	learn: 0.6674541	total: 59.6ms	remaining: 9.88s
6:	learn: 0.6626340	total: 67.4ms	remaining: 9.56s
7:	learn: 0.6587286	total: 77.4ms	remaining: 9.59s
8:	learn: 0.6564586	total: 87.2ms	remaining: 9.6s
9:	learn: 0.6524486	total: 97ms	remaining: 9.6s
10:	learn: 0.6481186	total: 107ms	remaining: 9.59s
11:	learn: 0.6430150	total: 116ms	remaining: 9.58s
12:	learn: 0.6397553	total: 122ms	remaining: 9.27s
13:	learn: 0.6365211	total: 132ms	remaining: 9.31s
14:	learn: 0.6333633	total: 142ms	remaining: 9.33s
15:	learn: 0.6298585	total: 152ms	remaining: 9.33s
16:	learn: 0.6247041	total: 162ms	remaining: 9.34s
17:	learn: 0.6214006	total: 171ms	remaining: 9.34s
18:	learn: 0.6165818	total: 181ms	remaining: 9.34s
19:	lea

179:	learn: 0.3250269	total: 1.91s	remaining: 8.69s
180:	learn: 0.3242163	total: 1.92s	remaining: 8.69s
181:	learn: 0.3238657	total: 1.93s	remaining: 8.69s
182:	learn: 0.3229861	total: 1.94s	remaining: 8.68s
183:	learn: 0.3215621	total: 1.95s	remaining: 8.66s
184:	learn: 0.3207597	total: 1.96s	remaining: 8.65s
185:	learn: 0.3199549	total: 1.97s	remaining: 8.63s
186:	learn: 0.3188932	total: 1.98s	remaining: 8.62s
187:	learn: 0.3180399	total: 1.99s	remaining: 8.6s
188:	learn: 0.3171259	total: 2s	remaining: 8.59s
189:	learn: 0.3162683	total: 2.01s	remaining: 8.57s
190:	learn: 0.3151888	total: 2.02s	remaining: 8.56s
191:	learn: 0.3147765	total: 2.03s	remaining: 8.54s
192:	learn: 0.3135691	total: 2.04s	remaining: 8.53s
193:	learn: 0.3133001	total: 2.05s	remaining: 8.5s
194:	learn: 0.3123130	total: 2.06s	remaining: 8.49s
195:	learn: 0.3115498	total: 2.07s	remaining: 8.47s
196:	learn: 0.3106087	total: 2.08s	remaining: 8.46s
197:	learn: 0.3096519	total: 2.08s	remaining: 8.45s
198:	learn: 0.308

344:	learn: 0.2309488	total: 3.57s	remaining: 6.78s
345:	learn: 0.2306270	total: 3.58s	remaining: 6.78s
346:	learn: 0.2303996	total: 3.6s	remaining: 6.77s
347:	learn: 0.2301132	total: 3.61s	remaining: 6.76s
348:	learn: 0.2296450	total: 3.62s	remaining: 6.75s
349:	learn: 0.2294613	total: 3.63s	remaining: 6.74s
350:	learn: 0.2291751	total: 3.64s	remaining: 6.72s
351:	learn: 0.2286935	total: 3.65s	remaining: 6.71s
352:	learn: 0.2283148	total: 3.65s	remaining: 6.7s
353:	learn: 0.2281089	total: 3.67s	remaining: 6.69s
354:	learn: 0.2276733	total: 3.67s	remaining: 6.68s
355:	learn: 0.2274670	total: 3.68s	remaining: 6.67s
356:	learn: 0.2271216	total: 3.69s	remaining: 6.65s
357:	learn: 0.2267317	total: 3.7s	remaining: 6.64s
358:	learn: 0.2266227	total: 3.71s	remaining: 6.63s
359:	learn: 0.2264070	total: 3.72s	remaining: 6.62s
360:	learn: 0.2258901	total: 3.73s	remaining: 6.61s
361:	learn: 0.2254808	total: 3.74s	remaining: 6.59s
362:	learn: 0.2248689	total: 3.75s	remaining: 6.58s
363:	learn: 0.2

511:	learn: 0.1840292	total: 5.22s	remaining: 4.98s
512:	learn: 0.1837400	total: 5.24s	remaining: 4.97s
513:	learn: 0.1835892	total: 5.25s	remaining: 4.96s
514:	learn: 0.1833175	total: 5.26s	remaining: 4.95s
515:	learn: 0.1829979	total: 5.27s	remaining: 4.94s
516:	learn: 0.1828266	total: 5.28s	remaining: 4.93s
517:	learn: 0.1826262	total: 5.29s	remaining: 4.92s
518:	learn: 0.1824090	total: 5.3s	remaining: 4.91s
519:	learn: 0.1820753	total: 5.31s	remaining: 4.9s
520:	learn: 0.1818453	total: 5.32s	remaining: 4.89s
521:	learn: 0.1817035	total: 5.33s	remaining: 4.88s
522:	learn: 0.1814258	total: 5.34s	remaining: 4.87s
523:	learn: 0.1811555	total: 5.35s	remaining: 4.86s
524:	learn: 0.1808481	total: 5.36s	remaining: 4.85s
525:	learn: 0.1806992	total: 5.37s	remaining: 4.84s
526:	learn: 0.1802155	total: 5.38s	remaining: 4.83s
527:	learn: 0.1799002	total: 5.39s	remaining: 4.82s
528:	learn: 0.1797233	total: 5.4s	remaining: 4.8s
529:	learn: 0.1792625	total: 5.41s	remaining: 4.79s
530:	learn: 0.17

677:	learn: 0.1543594	total: 6.88s	remaining: 3.27s
678:	learn: 0.1541656	total: 6.89s	remaining: 3.26s
679:	learn: 0.1538892	total: 6.91s	remaining: 3.25s
680:	learn: 0.1538409	total: 6.92s	remaining: 3.24s
681:	learn: 0.1538040	total: 6.93s	remaining: 3.23s
682:	learn: 0.1533965	total: 6.94s	remaining: 3.22s
683:	learn: 0.1533573	total: 6.95s	remaining: 3.21s
684:	learn: 0.1533115	total: 6.96s	remaining: 3.2s
685:	learn: 0.1530571	total: 6.97s	remaining: 3.19s
686:	learn: 0.1529874	total: 6.98s	remaining: 3.18s
687:	learn: 0.1529497	total: 6.99s	remaining: 3.17s
688:	learn: 0.1528405	total: 7s	remaining: 3.16s
689:	learn: 0.1527917	total: 7.01s	remaining: 3.15s
690:	learn: 0.1525753	total: 7.02s	remaining: 3.14s
691:	learn: 0.1522897	total: 7.03s	remaining: 3.13s
692:	learn: 0.1522526	total: 7.04s	remaining: 3.12s
693:	learn: 0.1522144	total: 7.05s	remaining: 3.11s
694:	learn: 0.1521663	total: 7.06s	remaining: 3.1s
695:	learn: 0.1520242	total: 7.07s	remaining: 3.09s
696:	learn: 0.151

852:	learn: 0.1306743	total: 8.79s	remaining: 1.51s
853:	learn: 0.1305219	total: 8.81s	remaining: 1.5s
854:	learn: 0.1304839	total: 8.82s	remaining: 1.5s
855:	learn: 0.1302302	total: 8.83s	remaining: 1.49s
856:	learn: 0.1301824	total: 8.84s	remaining: 1.48s
857:	learn: 0.1301545	total: 8.85s	remaining: 1.47s
858:	learn: 0.1299660	total: 8.86s	remaining: 1.45s
859:	learn: 0.1298962	total: 8.87s	remaining: 1.44s
860:	learn: 0.1298614	total: 8.88s	remaining: 1.43s
861:	learn: 0.1296426	total: 8.89s	remaining: 1.42s
862:	learn: 0.1296174	total: 8.9s	remaining: 1.41s
863:	learn: 0.1295665	total: 8.91s	remaining: 1.4s
864:	learn: 0.1294298	total: 8.92s	remaining: 1.39s
865:	learn: 0.1290912	total: 8.93s	remaining: 1.38s
866:	learn: 0.1290388	total: 8.94s	remaining: 1.37s
867:	learn: 0.1289318	total: 8.95s	remaining: 1.36s
868:	learn: 0.1285637	total: 8.96s	remaining: 1.35s
869:	learn: 0.1283813	total: 8.97s	remaining: 1.34s
870:	learn: 0.1280992	total: 8.98s	remaining: 1.33s
871:	learn: 0.12

In [22]:
import pickle

In [24]:
# Dump Model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

In [50]:
# load model 
with open('model.pkl', 'rb') as f:
    clf = pickle.load(f)

In [51]:
sample = katy.CONTENT[3]
data = cv.transform([sample]).toarray()
print(clf.predict(data))

[1]


In [52]:
katy.head(50)

Unnamed: 0,CONTENT,CLASS
0,i love this so much. AND also I Generate Free ...,1
1,http://www.billboard.com/articles/columns/pop-...,1
2,Hey guys! Please join me in my fight to help a...,1
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,1
4,Hey everyone. Watch this trailer!!!!!!!! http...,1
5,check out my rapping hope you guys like it ht...,1
6,"Subscribe pleaaaase to my instagram account , ...",1
7,hey guys!! visit my channel pleaase (i'm searc...,1
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,1
9,http://www.twitch.tv/daconnormc﻿,1
