## Problem 1:

##### Bayes theorem shows us how to turn $P(D|H)$ to $P(H|E)$, with $D = \text{Data / Evidence}$ and $H = \text{Hypothesis}$. But what does that really mean? Imagine you have to explain this to someone who doesn't understand machine learning or probability at all. How would you do it in a paragraph or two without using any jargons? Use an example from real life to ground the explanation.

***

_With these types of problems, they often involve a cause(s) and effect(s), one of which is known, and the other is unknown. This is what Bayes theorem helps with. Bayes theorem considers what the likeliness of a certain effect would be given some cause. In other words, it helps us determine the likeliness that a hypothesis (effect) occurs after observing some data/evidence (cause). For example, if you had two types of dice, one 4 sided and the other 6 sided, and we are told which dice was chosen at random, we could find the likeliness of a certain number being rolled. In this case, the effect/hypothesis (the number rolled) is unknown while the cause/data/evidence (which dice is chosen) is known._

_However, there are also instances in which the objective is the reverse, that is, you are trying to find the likeliness of some cause (data/evidence) given a certain effect (hypothesis). We can consider a similar example with the dice, in which we see what number is already rolled, though need to determine the likeliness of which dice was chosen. The cause and effect are the same as in the previous case, though this time the knowns and unknowns are swapped, that is, the effect/hypothesis (the number rolled) is known and the cause/data/evidence (which dice is chosen) is unknown._

_From this, we can see that once the knowns and unknowns are identified, Bayes theorem allows us to find the likeliness of the unknown, whether that be the cause (hypothesis) or the effect (data/evidence)._

***
***
***

## Problem 2:

##### Download the  YouTube spam collection dataset.

##### This is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos. These 5 videos are popular pop songs that were among the 10 most viewed on the collection period. All the five dataset has the following attributes:

##### `COMMENT_ID`: Unique id representing the comment
##### `AUTHOR`: Author id,
##### `DATE`: Date the comment is posted,
##### `CONTENT`: The comment,
##### `TAG`: For spam 1, otherwise 0

##### For this exercise use any 4 of these 5 datasets to build a spam filter with Naive Bayes approach and use that filter to check the accuracy on the remaining dataset. Make sure to report the details of your training and the model.

***

_First we need to import any packages including `pandas`, `GaussianNB`, and `CountVectorizer`._

In [1]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer

_Now we can load in all of our datasets. The last one (`Youtube05-Shakira`) will act as our test data so we can assign it as such._

In [2]:
df_psy = pd.read_csv("YouTube-Spam-Collection-v1\Youtube01-Psy.csv")
df_kp = pd.read_csv("YouTube-Spam-Collection-v1\Youtube02-KatyPerry.csv")
df_lmfao = pd.read_csv("YouTube-Spam-Collection-v1\Youtube03-LMFAO.csv")
df_eminem = pd.read_csv("YouTube-Spam-Collection-v1\Youtube04-Eminem.csv")

df_test = pd.read_csv("YouTube-Spam-Collection-v1\Youtube05-Shakira.csv")

_Since we will be using the first four datasets for training, we can combine them using `concat()` into a single dataframe._

> _Note: this source was used to understand the `concat()` function: https://pandas.pydata.org/docs/reference/api/pandas.concat.html_

In [3]:
df_train = pd.concat([df_psy, df_kp, df_lmfao, df_eminem])

_Then, using this combined training data, we can assign `CONTENT` as our training features and `CLASS` as our training label._

In [4]:
train_features = df_train['CONTENT']
train_label = df_train['CLASS']

_Before continuing, we can also check the number of both spam (1) and non-spam (0) instances in our training data._

In [5]:
train_label.value_counts()

1    831
0    755
Name: CLASS, dtype: int64

_From this we can see that about 52.4% of the comments are spam while about 47.6% of the comments are not spam, meaning that there is a fairly even distribution between the two and our future model can be relatively equally trained on both types._

_Then we can use `CountVectorizer()` to convert the training comments into a matrix of token counts that track how many times words appear in the messages. We can also set the messages to be converted to lowercase before tokenizing._

> _Note: this source was used to understand `CountVectorizer()`: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html_

In [6]:
vectorizer = CountVectorizer(lowercase = False)

_Now we can use `vectorizer` to learn the vocabulary dictionary of the messages and return a matrix of the word counts._

In [7]:
vectorize_train_features = vectorizer.fit_transform(train_features)

_Next we can load in the Gaussian Naive Bayes model._

> _Note: this source was used to understand the Gaussian Naive Bayes model: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html_

In [8]:
NB_model = GaussianNB()

_Using the Gaussian Naive Bayes model, we can fit it to our vectorized training features (`vectorize_train_features`) and training label (`train_label`)._

In [9]:
NB_fitted_model = NB_model.fit(vectorize_train_features.toarray(), train_label)

_Now we can measure the training accuracy of our model._

In [10]:
NB_fitted_model.score(vectorize_train_features.toarray(), train_label)

0.9899117276166457

_From this we see that our Gaussian Naive Bayes model is 98.99% accurate on our training data, which is very good._

_Then, to test our model we first need to assign our test features and label._

In [11]:
test_features = df_test['CONTENT']
test_label = df_test['CLASS']

_Now we can transform the test features to a matrix of word occurances so that we can run our model on the test data._

In [12]:
vectorize_test_features = vectorizer.transform(test_features)

_Then we can measure the accuracy of our model on the test data._

In [13]:
NB_fitted_model.score(vectorize_test_features.toarray(), test_label)

0.8918918918918919

_While the test accuracy declined by about 10% from our training data, a 89.19% accuracy is still very high._

In [225]:
skf = StratifiedKFold(n_splits = 5, shuffle = True)

for train_index, test_index in skf.split(features, label):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = label.iloc[train_index], label.iloc[test_index]

TRAIN: [   0    2    3 ... 1582 1583 1584] TEST: [   1    7    9   22   36   49   51   59   64   69   73   80   81   93
   99  106  113  115  117  119  122  126  128  135  144  168  169  174
  176  178  182  183  188  189  199  205  211  213  218  220  224  231
  234  237  239  242  249  259  270  271  276  281  287  288  293  309
  312  313  329  333  335  343  346  351  359  360  362  363  364  366
  367  370  379  383  393  402  403  405  406  410  419  423  427  432
  434  437  446  449  450  463  467  468  477  496  497  505  507  509
  517  518  520  525  536  541  549  557  560  561  568  569  570  588
  592  593  601  602  604  616  626  628  636  650  651  655  656  670
  673  682  688  690  697  698  706  716  717  719  730  731  736  737
  746  750  752  756  757  759  762  782  788  790  792  797  807  811
  821  832  839  840  851  852  854  866  873  875  876  877  879  882
  884  889  891  898  900  905  909  911  912  917  922  923  930  931
  941  942  949  950  956  9

In [139]:
features_split = features.str.split()
features_split

0      [huh,, anyway, check, out, this, you[tube], ch...
1      [hey, guys, check, out, my, new, channel, and,...
2        [just, for, test, i, have, to, say, murdev.com]
3      [me, shaking, my, sexy, ass, on, my, channel, ...
4            [watch?v=vtarggvgtwq, check, this, out, .﻿]
                             ...                        
443     [subscribe, to, my, channel, x, please!., spare]
444    [check, out, my, videos, guy!, :), hope, you, ...
445    [3, yrs, ago, i, had, a, health, scare, but, t...
446    [rihanna, looks, so, beautiful, with, red, hai...
447    [857.482.940, views, awesome, !!!!!!!!!!!!!!!!...
Name: CONTENT, Length: 1586, dtype: object

Note: the following resource was used to resolve an error:
> https://stackoverflow.com/questions/51852551/key-error-not-in-index-while-cross-validation

In [184]:
vectorizer = CountVectorizer(lowercase = False)
X = vectorizer.fit_transform(X_train)
#vectorizer.get_feature_names_out()



#vectorizer = CountVectorizer().fit(X_train)
#X_train_vectorized = vectorizer.transform(X_train)
#X_train_vectorized.toarray().shape

TypeError: expected string or bytes-like object

In [150]:
# Build "dictionary" of all words in the comments
dict = []

for comment in X_train:
    for word in comment:
        if (word not in dict):
            dict.append(word)

Note: the following two resources were used to understand how to structure this as well as how to find if an element in not in a list:
> https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html

> https://stackoverflow.com/questions/10406130/check-if-something-is-not-in-a-list-in-python/10406143

In [170]:
dict_spam_count = {}

for word in dict:
    word_appearance = 0
    for comment in X_train:
        if word in comment:
            word_appearance = word_appearance + 1
            
#    total_spam = len(train_spam)
#    spamicity = (emails_with_w+1)/(total_spam+2)
#    dict_spamicity[w.lower()] = spamicity

In [171]:
dict_spam_count

{}

In [None]:
X_train

In [167]:
base_data = [0] * len(X_train)

In [159]:
df_dict = pd.DataFrame()

for word in dict:
   df_dict[word] = base_data

  df_dict[word] = base_data


Unnamed: 0,"huh,",anyway,check,out,this,you[tube],channel:,kobyoshi02,hey,guys,...,things?,motivate,they’ve,you’re,far!,1000,started!,red,857.482.940,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!﻿
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1264,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1265,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1266,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1267,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [168]:
#for index, row in enumerate(X_train):
#    for word in X_train:
#        base_data[word][index] += 1
        
for index, row in enumerate(X_train):
    for word in row:
        base_data[word][index] += 1

TypeError: list indices must be integers or slices, not str

In [161]:
X_train

0      [huh,, anyway, check, out, this, you[tube], ch...
1      [hey, guys, check, out, my, new, channel, and,...
2        [just, for, test, i, have, to, say, murdev.com]
3      [me, shaking, my, sexy, ass, on, my, channel, ...
4            [watch?v=vtarggvgtwq, check, this, out, .﻿]
                             ...                        
441                             [best., song., ever, 🙌﻿]
443     [subscribe, to, my, channel, x, please!., spare]
445    [3, yrs, ago, i, had, a, health, scare, but, t...
446    [rihanna, looks, so, beautiful, with, red, hai...
447    [857.482.940, views, awesome, !!!!!!!!!!!!!!!!...
Name: CONTENT, Length: 1269, dtype: object

In [54]:
NB_model = GaussianNB()

In [57]:
cross_val = cross_validate(NB_model, features, label, cv = kf)

Traceback (most recent call last):
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 207, in fit
    X, y = self._validate_data(X, y)
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 871, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\lefta\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 673, in check_array
 