[View in Colaboratory](https://colab.research.google.com/github/emjames/neural-networks/blob/dev/spam_ham.ipynb)

### TubeSpam: Comment Spam Filtering on YouTube
[White paper](https://ieeexplore.ieee.org/document/7424299/)

### Semantically Enriched Machine Learning Approach to Filter YouTube Comments for Socially Augmented User Models
[White paper](https://pdfs.semanticscholar.org/8979/0601d6373766e5a7b99729c8f8bd8b4abf2b.pdf)


In [0]:
# Get the data
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip'
!mkdir data
!unzip 'YouTube-Spam-Collection-v1.zip' -d /data

In [0]:
!ls /data

In [0]:
import os
import glob
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
path = r'/data'
all_files = glob.glob(os.path.join(path, '*.csv'))
# Concat the data files into one
data = pd.concat(pd.read_csv(file) for file in all_files)
df_train, df_test = train_test_split(data, test_size=0.33, random_state=42)

In [0]:
# Explore the data
df_train.head(5)

In [5]:
# We're interested in CONTENT and CLASS columns
df_train = df_train.iloc[:, 3:]
df_test = df_test.iloc[:, 3:]
df_train.head(10)

Unnamed: 0,CONTENT,CLASS
355,I love dis song!! 3,0
99,Hey guys! Can you please SUBSCRIBE to my chan...,1
88,The great mother of the jungle. Sweet and natu...,0
31,"Came here to check the views, goodbye.﻿",0
197,What does that tattoo on her right tricep say?﻿,0
79,&lt;3 this song so much.SHAKIRA YOUR A REALLY ...,0
293,How did you know that people makes another acc...,1
386,Check out this video on YouTube:﻿,1
193,katy perry is awesome﻿,0
177,https://www.surveymonkey.com/s/CVHMKLT﻿,1


In [0]:
# Pre-processing the comments
import string
from sklearn.feature_extraction import stop_words
# stop_words.ENGLISH_STOP_WORDS
def text_preprocess(text):
#   text = text.translate(str.maketrans('', '', string.punctuation))
  text = [word for word in text.split() if word.lower() not in stop_words.ENGLISH_STOP_WORDS]
  return ' '.join(text)

In [0]:
df_content = df_train.iloc[:,0].apply(text_preprocess)

In [8]:
df_content.head(5)

355                                    love dis song!! 3
99     Hey guys! SUBSCRIBE channel,because I'm gonna ...
88     great mother jungle. Sweet natural. like videos.﻿
31                           Came check views, goodbye.﻿
197                       does tattoo right tricep say?﻿
Name: CONTENT, dtype: object

In [0]:
# TF-IDF is a way to judge the topic of an article.
# This is done by the kind of words it contains.
# Here words are given weight so it measures relevance, not frequency
from sklearn.feature_extraction import TfidfVectorizer
vectorizer = TfidfVectorizer()