# 6.6 Assignment 6: Naïve Bayes

### TABLE OF CONTENTS
[Baye's Theorem](#Baye's-Theorem)

[YouTube Spam Comments: NB application](#YouTube-Spam-Comments:-NB-application)

# Baye's Theorem

This section explains how to turn P(E|H) to P(H|E), with E=Evidence and H=Hypothesis in layman's terms, and utilizes a real-life example to demonstrate this.

Conditional probability makes  P(E|H) = P(H|E) because of its property of symmetry. This means that the probability of seeing an observation/evidence given a true hypothesis P(E|H) is the same as the probability of the hypothesis being true given the evidence/observation. Essentially, it's asking a very similar question just with a different focus, P(H|E) asks what the probability is of the hypothesis being true with the given evidence, and P(E|H) asks what the probability of seeing the evidence with the true hypothesis. 

For example, in a scenario of spam email classification:
Let's say E is a given word in an email (like "promotion"), and H is our hypothesis that the email is indeed spam.

We could ask P(E|H): what is the probability the email contains the word "promotion" (E) given the email is spam (H)? This is the same as asking P(H|E): what is the probability that the email is spam given that it contains the word "promotion" in it? Both questions essentially ask the same question, but focus on a different component of the algorithm -one deals directly with the probability of the evidence specifically while the other chooses to focus on the probability of the hypothesis.

# YouTube Spam Comments: NB application

This section utilizes 5 datasets, 4 for training, and the fifth for testing of the application of the NB algorithm.The datasets are composed of 1,956 real messages extracted from five videos. These five videos are popular pop songs that were among the 10 most viewed in the collection period. 

#### Dataset Attributes:
- COMMENT_ID: Unique ID representing the comment

- AUTHOR: Author ID

- DATE: Date the comment is posted

- CONTENT: The comment

- TAG: Attribute Explained


#### GOALS:
For this exercise use any four of these five datasets to build a spam filter with the Naïve Bayes approach. 

Use that filter to check the accuracy of the remaining dataset.

Make sure to report the details of your training and the model.

### TABLE OF CONTENTS
[Libraries](#Libraries)

[Data Preprocessing](#Data-Preprocessing)

[NB Model Training](#NB-Model-Training)

[Model Testing](#Model-Testing)

[Overall Conclusions](#Overall-Conclusions)

# Libraries

In [61]:
import pandas as pd
import numpy as np
import string
import re

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/anasantiago/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [69]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anasantiago/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [125]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB

# Data Preprocessing

In [2]:
#load first dataset
psy_data = pd.read_csv('Youtube01-Psy.csv')
psy_data.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [3]:
#drop unnecessary cols
psy_clean = psy_data.copy()
psy_clean = psy_clean.drop(columns=['COMMENT_ID', 'AUTHOR', 'DATE'])

In [4]:
#examine data distribution by class
psy_clean.groupby(by = ['CLASS']).size()

CLASS
0    175
1    175
dtype: int64

In [5]:
#load next dataset
kp_data = pd.read_csv('Youtube02-KatyPerry.csv')
kp_data.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12pgdhovmrktzm3i23es5d5junftft3f,lekanaVEVO1,2014-07-22T15:27:50,i love this so much. AND also I Generate Free ...,1
1,z13yx345uxepetggz04ci5rjcxeohzlrtf4,Pyunghee,2014-07-27T01:57:16,http://www.billboard.com/articles/columns/pop-...,1
2,z12lsjvi3wa5x1vwh04cibeaqnzrevxajw00k,Erica Ross,2014-07-27T02:51:43,Hey guys! Please join me in my fight to help a...,1
3,z13jcjuovxbwfr0ge04cev2ipsjdfdurwck,Aviel Haimov,2014-08-01T12:27:48,http://psnboss.com/?ref=2tGgp3pV6L this is the...,1
4,z13qybua2yfydzxzj04cgfpqdt2syfx53ms0k,John Bello,2014-08-01T21:04:03,Hey everyone. Watch this trailer!!!!!!!! http...,1


In [6]:
#clean data & examine distribution by class
kp_clean = kp_data.copy()
#drop unnecessary cols
kp_clean = kp_clean.drop(columns=['COMMENT_ID', 'AUTHOR', 'DATE'])
kp_clean.groupby(by=['CLASS']).size()

CLASS
0    175
1    175
dtype: int64

In [7]:
#load third df
lmfao_data = pd.read_csv('Youtube03-LMFAO.csv')
lmfao_clean = lmfao_data.copy()
#drop unnecessary cols
lmfao_clean = lmfao_clean.drop(columns=['COMMENT_ID', 'AUTHOR', 'DATE'])
#examine distributions
lmfao_clean.groupby(by=['CLASS']).size()

CLASS
0    202
1    236
dtype: int64

In [8]:
#load fourth df
eminem_data = pd.read_csv('Youtube04-Eminem.csv')
eminem_clean = eminem_data.copy()
#drop unnecessary cols
eminem_clean = eminem_clean.drop(columns=['COMMENT_ID', 'AUTHOR', 'DATE'])
#examine distributions
eminem_clean.groupby(by=['CLASS']).size()

CLASS
0    203
1    245
dtype: int64

In [9]:
#load last df
shakira_data = pd.read_csv('Youtube05-Shakira.csv')
shakira_clean = shakira_data.copy()
#drop unnecessary cols
shakira_clean = shakira_clean.drop(columns=['COMMENT_ID', 'AUTHOR', 'DATE'])
#examine distributions
shakira_clean.groupby(by=['CLASS']).size()

CLASS
0    196
1    174
dtype: int64

In [133]:
psy_list = psy_clean['CONTENT'].tolist()
#psy_list

In [107]:
#create function to clean up text, remove links
sample_text = 'watch?v=vtaRGgvGtWQ   2014 Check this out .\ufeff'

def remove_punctuation(text):
    #remove links
    text = text.lower()
    text = re.sub(r'htt\S+', '', text)
    #remove youtube specific links
    text = re.sub(r'watch\?v=\S+', '', text)
    #remove numbers
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

new_sample = remove_punctuation(sample_text)
new_sample

'    check this out \ufeff'

In [108]:
pd.set_option('display.max_colwidth', None)

In [109]:
#examine clean text
processed_psy = psy_clean.copy()
processed_psy['processed_comments'] = processed_psy['CONTENT'].apply(remove_punctuation)
processed_psy[:9]

Unnamed: 0,CONTENT,CLASS,processed_comments
0,"Huh, anyway check out this you[tube] channel: kobyoshi02",1,huh anyway check out this youtube channel kobyoshi
1,"Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I'm the monkey in the white shirt,please leave a like comment and please subscribe!!!!",1,hey guys check out my new channel and our first vid this is us the monkeys im the monkey in the white shirtplease leave a like comment and please subscribe
2,just for test I have to say murdev.com,1,just for test i have to say murdevcom
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,me shaking my sexy ass on my channel enjoy ﻿
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1,check this out ﻿
5,"Hey, check out my new website!! This site is about kids stuff. kidsmediausa . com",1,hey check out my new website this site is about kids stuff kidsmediausa com
6,Subscribe to my channel ﻿,1,subscribe to my channel ﻿
7,i turned it on mute as soon is i came on i just wanted to check the views...﻿,0,i turned it on mute as soon is i came on i just wanted to check the views﻿
8,You should check my channel for Funny VIDEOS!!﻿,1,you should check my channel for funny videos﻿


In [110]:
#used this documentation to remove stop words: https://pythonspot.com/nltk-stop-words/#google_vignette
stop_words = set(stopwords.words('english'))

In [111]:
#create function to tokenize words and remove stop words & stem words
def tokenize_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

tokenize_text(new_sample)

['check', '\ufeff']

In [112]:
#apply tokenization function to processed_comments
processed_psy_list = processed_psy['processed_comments'].apply(tokenize_text)
processed_psy_list.head()

0                                                                                 [huh, anyway, check, youtub, channel, kobyoshi]
1    [hey, guy, check, new, channel, first, vid, us, monkey, im, monkey, white, shirtpleas, leav, like, comment, pleas, subscrib]
2                                                                                                          [test, say, murdevcom]
3                                                                                           [shake, sexi, ass, channel, enjoy, ﻿]
4                                                                                                                      [check, ﻿]
Name: processed_comments, dtype: object

In [113]:
processed_psy_df = pd.DataFrame(processed_psy_list)
processed_psy_df = processed_psy_df.rename(columns={'processed_comments': 'stemmed_tokens'})
psy_df = processed_psy_df.join(processed_psy, how = 'outer')
psy_df

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"[huh, anyway, check, youtub, channel, kobyoshi]","Huh, anyway check out this you[tube] channel: kobyoshi02",1,huh anyway check out this youtube channel kobyoshi
1,"[hey, guy, check, new, channel, first, vid, us, monkey, im, monkey, white, shirtpleas, leav, like, comment, pleas, subscrib]","Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I'm the monkey in the white shirt,please leave a like comment and please subscribe!!!!",1,hey guys check out my new channel and our first vid this is us the monkeys im the monkey in the white shirtplease leave a like comment and please subscribe
2,"[test, say, murdevcom]",just for test I have to say murdev.com,1,just for test i have to say murdevcom
3,"[shake, sexi, ass, channel, enjoy, ﻿]",me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,me shaking my sexy ass on my channel enjoy ﻿
4,"[check, ﻿]",watch?v=vtaRGgvGtWQ Check this out .﻿,1,check this out ﻿
...,...,...,...,...
345,"[billion, view, there, planet, lol﻿]",How can this have 2 billion views when there's only me on the planet? LOL﻿,0,how can this have billion views when theres only me on the planet lol﻿
346,"[dont, im, watch, ﻿]",I don't now why I'm watching this in 2014﻿,0,i dont now why im watching this in ﻿
347,"[subscrib, call, duti, vid, give, away, goal, subs﻿]",subscribe to me for call of duty vids and give aways Goal-100 subs﻿,1,subscribe to me for call of duty vids and give aways goal subs﻿
348,"[hi, guy, pleas, android, photo, editor, download, thank]",hi guys please my android photo editor download. thanks https://play.google.com/store/apps/details?id=com.butalabs.photo.editor﻿,1,hi guys please my android photo editor download thanks


In [120]:
psy_df['stemmed_tokens'] = psy_df['stemmed_tokens'].astype(str)
psy_df.head()

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"['huh', 'anyway', 'check', 'youtub', 'channel', 'kobyoshi']","Huh, anyway check out this you[tube] channel: kobyoshi02",1,huh anyway check out this youtube channel kobyoshi
1,"['hey', 'guy', 'check', 'new', 'channel', 'first', 'vid', 'us', 'monkey', 'im', 'monkey', 'white', 'shirtpleas', 'leav', 'like', 'comment', 'pleas', 'subscrib']","Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I'm the monkey in the white shirt,please leave a like comment and please subscribe!!!!",1,hey guys check out my new channel and our first vid this is us the monkeys im the monkey in the white shirtplease leave a like comment and please subscribe
2,"['test', 'say', 'murdevcom']",just for test I have to say murdev.com,1,just for test i have to say murdevcom
3,"['shake', 'sexi', 'ass', 'channel', 'enjoy', '\ufeff']",me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,me shaking my sexy ass on my channel enjoy ﻿
4,"['check', '\ufeff']",watch?v=vtaRGgvGtWQ Check this out .﻿,1,check this out ﻿


In [137]:
#clean 4 other datasets: kp_clean, lmfao_clean, eminem_clean, shakira_clean
#kp:
processed_kp = kp_clean.copy()
#remove punctuation step
processed_kp['processed_comments'] = processed_kp['CONTENT'].apply(remove_punctuation)
#tokenization step
processed_kp_list = processed_kp['processed_comments'].apply(tokenize_text)
#convert to list
processed_kp_df = pd.DataFrame(processed_kp_list)
#rename col
processed_kp_df = processed_kp_df.rename(columns={'processed_comments': 'stemmed_tokens'})
#join stemmed, tokenized col
kp_df = processed_kp_df.join(processed_kp, how = 'outer')
#convert stemmed tokens to strings for model
kp_df['stemmed_tokens'] = kp_df['stemmed_tokens'].astype(str)
kp_df.head()

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"['love', 'much', 'also', 'gener', 'free', 'lead', 'auto', 'pilot', 'amp']",i love this so much. AND also I Generate Free Leads on Auto Pilot &amp; You Can Too! http://www.MyLeaderGate.com/moretraffic﻿,1,i love this so much and also i generate free leads on auto pilot amp you can too
1,"['vote', 'sone', 'pleasewer', 'vipspleas', 'help', 'us', 'gtlt\ufeff']",http://www.billboard.com/articles/columns/pop-shop/6174122/fan-army-face-off-round-3 Vote for SONES please....we're against vips....please help us.. &gt;.&lt;﻿,1,vote for sones pleasewere against vipsplease help us gtlt﻿
2,"['hey', 'guy', 'pleas', 'join', 'fight', 'help', 'abusedmistr', 'anim', 'fund', 'go', 'help', 'pay', 'vet', 'billsand', 'help', 'find', 'home', 'place', 'extra', 'emphasi', 'help', 'disabl', 'anim', 'one', 'otherwis', 'would', 'put', 'sleep', 'anim', 'organ', 'donat', 'pleas']","Hey guys! Please join me in my fight to help abused/mistreated animals! All fund will go to helping pay for vet bills/and or helping them find homes! I will place an extra emphasis on helping disabled animals, ones otherwise would just be put to sleep by other animal organizations. Donate please. http://www.gofundme.com/Angels-n-Wingz﻿",1,hey guys please join me in my fight to help abusedmistreated animals all fund will go to helping pay for vet billsand or helping them find homes i will place an extra emphasis on helping disabled animals ones otherwise would just be put to sleep by other animal organizations donate please
3,['song\ufeff'],http://psnboss.com/?ref=2tGgp3pV6L this is the song﻿,1,this is the song﻿
4,"['hey', 'everyon', 'watch', 'trailer']",Hey everyone. Watch this trailer!!!!!!!! http://believemefilm.com?hlr=h2hQBUVB﻿,1,hey everyone watch this trailer


In [138]:
#lmfao_clean
processed_lmfao = lmfao_clean.copy()
#remove punctuation step
processed_lmfao['processed_comments'] = processed_lmfao['CONTENT'].apply(remove_punctuation)
#tokenization step
processed_lmfao_list = processed_lmfao['processed_comments'].apply(tokenize_text)
#convert to list
processed_lmfao_df = pd.DataFrame(processed_lmfao_list)
#rename col
processed_lmfao_df = processed_lmfao_df.rename(columns={'processed_comments': 'stemmed_tokens'})
#join stemmed, tokenized col
lmfao_df = processed_lmfao_df.join(processed_lmfao, how = 'outer')
#convert stemmed tokens to strings for model
lmfao_df['stemmed_tokens'] = lmfao_df['stemmed_tokens'].astype(str)
lmfao_df.head()

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"['href', 'best', 'part\ufeff']","<a href=""http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s"">2:19</a> best part﻿",0,a href best part﻿
1,"['wierd', 'funny\ufeff']",wierd but funny﻿,0,wierd but funny﻿
2,"['hey', 'guy', 'im', 'humanbr', 'br', 'br', 'dont', 'want', 'human', 'want', 'sexi', 'fuck', 'giraffebr', 'br', 'br', 'alreadi', 'money', 'surgeri', 'elong', 'spinal', 'core', 'surgeri', 'chang', 'skin', 'pigment', 'everyth', 'els', 'like', 'post', 'other', 'root', 'dreambr', 'br', 'br', 'im', 'fuck', 'make', 'music', 'check', 'first', 'song', 'relnofollow', 'classothashtag', 'href']","Hey guys, I&#39;m a human.<br /><br /><br />But I don&#39;t want to be a human, I want to be a sexy fucking giraffe.<br /><br /><br />I already have the money for the surgery to elongate my spinal core, the surgery to change my skin pigment, and everything else! Like this post so others can root me on in my dream!!!!<br /><br /><br />Im fucking with you, I make music, check out my first song! <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23giraffebruuh"">#giraffebruuh</a>﻿",1,hey guys im a humanbr br br but i dont want to be a human i want to be a sexy fucking giraffebr br br i already have the money for the surgery to elongate my spinal core the surgery to change my skin pigment and everything else like this post so others can root me on in my dreambr br br im fucking with you i make music check out my first song a relnofollow classothashtag href
3,"['parti', 'rocklolwho', 'want', 'shuffle\ufeff']",Party Rock....lol...who wants to shuffle!!!﻿,0,party rocklolwho wants to shuffle﻿
4,"['parti', 'rock\ufeff']",Party rock﻿,0,party rock﻿


In [139]:
#eminem_clean
processed_eminem = eminem_clean.copy()
#remove punctuation step
processed_eminem['processed_comments'] = processed_eminem['CONTENT'].apply(remove_punctuation)
#tokenization step
processed_eminem_list = processed_eminem['processed_comments'].apply(tokenize_text)
#convert to list
processed_eminem_df = pd.DataFrame(processed_eminem_list)
#rename col
processed_eminem_df = processed_eminem_df.rename(columns={'processed_comments': 'stemmed_tokens'})
#join stemmed, tokenized col
eminem_df = processed_eminem_df.join(processed_eminem, how = 'outer')
#convert stemmed tokens to strings for model
eminem_df['stemmed_tokens'] = eminem_df['stemmed_tokens'].astype(str)
eminem_df.head()

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"['love', 'girl', 'talk', 'xxx\ufeff']",+447935454150 lovely girl talk to me xxx﻿,1,lovely girl talk to me xxx﻿
1,"['alway', 'end', 'come', 'back', 'songbr', '\ufeff']",I always end up coming back to this song<br />﻿,0,i always end up coming back to this songbr ﻿
2,"['sister', 'receiv', 'new', 'relnofollow', 'classothashtag', 'href', 'youtub', 'view', 'right', 'thing', 'use', 'pimpmyview', 'com\ufeff']","my sister just received over 6,500 new <a rel=""nofollow"" class=""ot-hashtag"" href=""https://plus.google.com/s/%23active"">#active</a> youtube views Right now. The only thing she used was pimpmyviews. com﻿",1,my sister just received over new a relnofollow classothashtag href youtube views right now the only thing she used was pimpmyviews com﻿
3,['cool\ufeff'],Cool﻿,0,cool﻿
4,"['hello', 'iam', 'palastine\ufeff']",Hello I&#39;am from Palastine﻿,1,hello iam from palastine﻿


In [140]:
#shakira_clean
processed_shakira = shakira_clean.copy()
#remove punctuation step
processed_shakira['processed_comments'] = processed_shakira['CONTENT'].apply(remove_punctuation)
#tokenization step
processed_shakira_list = processed_shakira['processed_comments'].apply(tokenize_text)
#convert to list
processed_shakira_df = pd.DataFrame(processed_shakira_list)
#rename col
processed_shakira_df = processed_shakira_df.rename(columns={'processed_comments': 'stemmed_tokens'})
#join stemmed, tokenized col
shakira_df = processed_shakira_df.join(processed_shakira, how = 'outer')
#convert stemmed tokens to strings for model
shakira_df['stemmed_tokens'] = shakira_df['stemmed_tokens'].astype(str)
shakira_df.head()

Unnamed: 0,stemmed_tokens,CONTENT,CLASS,processed_comments
0,"['nice', 'song\ufeff']",Nice song﻿,0,nice song﻿
1,"['love', 'song', '\ufeff']",I love song ﻿,0,i love song ﻿
2,"['love', 'song', '\ufeff']",I love song ﻿,0,i love song ﻿
3,"['let', 'make', 'first', 'femal', 'reach', 'one', 'billion', 'share', 'replay', '\ufeff']","860,000,000 lets make it first female to reach one billion!! Share it and replay it! ﻿",0,lets make it first female to reach one billion share it and replay it ﻿
4,"['shakira', 'best', 'worldcup\ufeff']",shakira is best for worldcup﻿,0,shakira is best for worldcup﻿


# NB Model Training

### Models:
[Psy dataset](#Psy-dataset)

[Katy Perry dataset](#Katy-Perry-dataset)

[LMFAO dataset](#LMFAO-dataset)

[Eminem dataset](#Eminem-dataset)

[Testing: Shakira dataset](#Testing-Shakira-dataset)

# Psy dataset

In [130]:
#set features
X_psy = psy_df['stemmed_tokens']
y_psy = psy_df['CLASS']

In [131]:
#build model in a function
def naive_bayes_model(X, y):
    #TF-IDF
    tfidf = TfidfVectorizer()
    #X_string = [' '.join(tokens) for tokens in X]
    X_vectorized = tfidf.fit_transform(X)

    #split data to evaluate performance over time across datasets
    X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size = .2, random_state = 5)

    #initialize model
    nb_model = MultinomialNB()

    #fit model
    nb_model.fit(X_train, y_train)
    #evaluate model
    y_classification_preds = nb_model.predict(X_test)

    print(classification_report(y_test, y_classification_preds))

In [132]:
naive_bayes_model(X_psy, y_psy)

              precision    recall  f1-score   support

           0       0.73      1.00      0.84        32
           1       1.00      0.68      0.81        38

    accuracy                           0.83        70
   macro avg       0.86      0.84      0.83        70
weighted avg       0.88      0.83      0.83        70



I decided to split my first dataset to evaluate its initial performance and compare how it improves/worsens with the different datasets, and ultimately compare its performance to the last testing dataset. 

The model didn't perform too poorly; starting with the precision values, it seems it accurately predicted all spam comments as spam, whereas out of all non-spam predictions, only 73% were actually non-spam comments. It's interesting to see this split because the dataset had an even number of spam and nonspam comments. 
Next, the recall demonstrates that the model accurately identified all of the non-spam comments as non-spam, but was only able to identify 68% of the actual spam comments. Therefore the f1-scores of 84 and 81 show that the model has a better balance between recall and precision for the nonspam messages, and it has a slightly lower balance between the two metrics when it comes to spam messages. Overall, the accuracy of 83% shows that the model performed pretty well. 

# Katy Perry dataset

In [141]:
#set X and y features:
X_kp = kp_df['stemmed_tokens']
y_kp = kp_df['CLASS']


#call model function
naive_bayes_model(X_kp, y_kp)

              precision    recall  f1-score   support

           0       0.91      0.79      0.85        38
           1       0.78      0.91      0.84        32

    accuracy                           0.84        70
   macro avg       0.85      0.85      0.84        70
weighted avg       0.85      0.84      0.84        70



The model performed slightly better with the Katy Perry dataset, when comparing the overall accuracy of the model, this one improved by a point.
The split of non-spam and spam messages remained the same since though this time the amounts were inversed. It is worth noting this because the precision for the non-spam messages improved from 73 to 91% meaning that from the nonspam comment predictions by the model, 91% were accurately not spam. For the spam messages, out of the spam predictions made by the model only 78% were actually spam. It is interesting to note that the model's recall reflects that the model only identified 79% of all the true nonspam comments, while identifying 91% of all the true spam messages. The inverse distribution of the testing and training classes is worth noting because of the model's reduction in precision from 100 to 78%. It is interesting though to note that the f1 scores for both the precision and recall metrics improved, and though the precision for spam messages lowered, the f1 score improved by 3 points altogether, which demonstrates the model's iterative learning. 

# LMFAO dataset


In [142]:
#features
X_lmfao = lmfao_df['stemmed_tokens']
y_lmfao = lmfao_df['CLASS']

#call model
naive_bayes_model(X_lmfao, y_lmfao)

              precision    recall  f1-score   support

           0       0.90      0.88      0.89        40
           1       0.90      0.92      0.91        48

    accuracy                           0.90        88
   macro avg       0.90      0.90      0.90        88
weighted avg       0.90      0.90      0.90        88



The model keeps improving! The overall accuracy this time went up from 84 to 90%. It is important to note that the size of the LMFAO dataset was significantly larger, which allowed the model to not only iterate on its learning thus far but also have a slightly larger training dataset to learn from as well. 
This model's precisions for both the nonspam and spam messages were identical, surprisingly, despite the uneven distribution of true spam and nonspam labels. Out of its spam and nonspam predictions, it accurately identified 90% of each category. However, its recall for the nonspam comments was a bit lower, showing less of a balance between the metrics which could be tied to having less nonspam comments to sort. The model was only able to identify 88% of all the nonspam messages while identifying 92% of all true spam comments. 
The f1 scores for both precision and recall improved significantly; in the previous iteration, the model's nonspam f1 score was 85% and it jumped by 4 points to 89%. For the spam category, the f1 score jumped from 84 to 91%, which is quite a significant improvement. We'll see if the model continues to improve. 

# Eminem dataset

In [143]:
#features
X_eminem = eminem_df['stemmed_tokens']
y_eminem = eminem_df['CLASS']

#call model
naive_bayes_model(X_eminem, y_eminem)

              precision    recall  f1-score   support

           0       0.90      0.97      0.94        38
           1       0.98      0.92      0.95        52

    accuracy                           0.94        90
   macro avg       0.94      0.95      0.94        90
weighted avg       0.95      0.94      0.94        90



The Eminem dataset was slightly larger still than the LMFAO dataset, so it is a bit unsurprising that the model continued to improve due to the learning its done thus far, along with having both more learning datapoints in training and more testing opportunities. 
The precision for the non-spam label remained the same as last time, meaning out of the non-spam model predictions only 90% were still accurately non spam, however, the model was able to identify 97% of all true nonspam comments in the data (per recall metric). On the other hand, the model did exceptionally well with the spam category, out of all its predictions for spam comments, 98% were indeed spam, and though its recall remained at 92%, meaning that it only correctly identified 92% of all the true spam comments, its overal accuracy jumped by 4 points from 90 to 94% nonetheless. 
I know the upcoming testing dataset is slightly smaller than the previous two datasets but I wonder since the model has continuously improved, whether the results will remain relatively high or potentially even improve. 

# Model Testing
This section will test my Naive Baye's model with the Shakira dataset, which has the split:

- non-spam comments: 194

- spam comments: 174

# Testing: Shakira dataset

In [144]:
#features
X_shakira = shakira_df['stemmed_tokens']
y_shakira = shakira_df['CLASS']

#call model
naive_bayes_model(X_shakira, y_shakira)

              precision    recall  f1-score   support

           0       0.83      0.97      0.89        39
           1       0.96      0.77      0.86        35

    accuracy                           0.88        74
   macro avg       0.90      0.87      0.88        74
weighted avg       0.89      0.88      0.88        74



The model's performance worsened with the Shakira dataset I used as my testing set. This is a bit surprising because the overall score decreased by 6 points, which is pretty significant given that the model had been improving thus far. The model had a bit of a worse balance between the precision and recall metrics, out of its nonspam predictions only 83% were right, but it correctly found 97% of all the true nonspam comments, while out of its spam comment predictions 96% were correct, though it only found 77% of all the true spam comments in this testing dataset.

# Overall Conclusions

it was surprising to see that the model had been continuously improving in its overall accuracy as well as improving its balance between precision and recall across the 4 training datasets I used (Psy, Katy Perry, LMFAO, and Eminem). I attribute some of the improvements to be due to the increasing size of the datasets -the first two had an even split across spam and nonspam comments and were of equal size, so the improvement on the second model can be attributed to the model's learning- the third and fourth datasets were larger than the previous two and allowed the model to access a larger pool of learning samples to build on top of its iterative learning up until that point.
I wonder if my model worsened on Shakira's video due to the potential presence of comments in languages other than English. I would've expected the model to perform relatively the same, and not worsen as much as it did, so I attribute this discrepancy to potential anomalies in the data. Since Shakira is primarily a Spanish-speaking singer, I think perhaps some of the comments in the dataset were in Spanish and the model was not exposed to those comments before. 