# Spam Detection Support


## Bag-of-Words Processing

A model, which represents a piece of text, such as a sentence or a document, as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
The words are stored as tockens, with a count of frequency of their appearance.

1. Convert strings to lower case
2. Remove punctuation
3. Tokenize the message and give an integer ID to each token
4. Count frequencies

In [None]:
# Demo data frame
demo_df = ['Hello, how are you!',
    'Win money, win from home.',
    'Call me now',
    'Hello, Call you tomorrow?']


In [None]:
# Convert it to lower case
lower_case_df = []
for i in demo_df:
    lower_case_df.append(i.lower())
print(lower_case_df)

In [None]:
# Remove punctuation
import string
no_punctuation = []
for i in lower_case_df:
    no_punctuation.append(i.translate(str.maketrans('', '', string.punctuation)))
print(no_punctuation)

In [None]:
# Split in tokens
tokenized = []
for i in no_punctuation:
    tokenized.append(i.split(' '))
print(tokenized)

In [None]:
# Count the frequency
import pprint
from collections import Counter
frequency_list = []

for i in tokenized:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
    
# prety print
pprint.pprint(frequency_list)

## Bag of Words in scikit-learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
print(count_vector)

In [None]:
count_vector.fit(demo_df)
count_vector.get_feature_names()

In [None]:
# Create a matrix of features
# columns: the tokens
# rows: the documents
# cells: the frequency of appearance of this word in this document
doc_array = count_vector.transform(demo_df).toarray()
doc_array

In [None]:
# Improve the printing
import pandas as pd
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix