# Machine Learning Project - Text Classification
### Text Processing and Naive Bayes

**Dataset:**
https://www.kaggle.com/uciml/sms-spam-collection-dataset/version/1

**Objective:** to classify SMS message as spam or not spam (ham).

**Methods:**

From the given data set, use Naïve Bayes to classify the SMS message.
The framework for text classification is briefly summarized here:
* Preprocessing of the dataset (change to lower case, remove numbers, remove punctuation, stop words, white space, word stemming, etc.)
* Document-Term-Matrix creation – matrix of word counts for each individual document in the matrix (e.g. documents as rows, words as columns or vice versa)
* Text Analysis (e.g. word counts, visualizations using wordclouds)
* Predict Spam or Not

### Import Data

In [1]:
import pandas as pd

df = pd.read_csv("spam.csv", encoding="latin-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [35]:
#Drop empty ones
df = pd.read_csv("spam.csv", encoding="latin-1")

df = df.dropna(how="any", axis=1)

### EDA

In [36]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [38]:
# name the columns
df.columns = ['target', 'message']
df.head()

Unnamed: 0,target,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [39]:
# Find out how many of each class there are
balance_counts = df.groupby('target')['target'].agg('count').values
balance_counts

array([4825,  747], dtype=int64)

This data set is imbalanced.  There isn't much I can do about this, but it is important to note.

In [45]:
# find average length of message

ham_lengths = []
spam_lengths = []

for x in range(len(df['message'])):
    if df['target'][x] == 'ham':
        ham_lengths.append(len(df['message'][x]))
    else:
        spam_lengths.append(len(df['message'][x]))

avg_ham = sum(ham_lengths)/len(ham_lengths)
avg_spam = sum(spam_lengths)/len(spam_lengths)

print("Ham messages average characters:",avg_ham)
print("Spam messages average characters:",avg_spam)

Ham messages average characters: 71.02362694300518
Spam messages average characters: 138.8661311914324


It is interesting to see that spam messages have almost double the characters than ham messages.  Perhpas it is because real texts in the course of conversation could be only one or two words, while spam is usually trying to sell something or provide lots of info to get the reader to act on.  Spam is naturally longer.

In [48]:
len(df['message'][0].split())

20

In [50]:
# find average length of message

ham_words_count = []
spam_words_count = []

for x in range(len(df['message'])):
    if df['target'][x] == 'ham':
        ham_words_count.append(len(df['message'][x].split()))
    else:
        spam_words_count.append(len(df['message'][x].split()))

avg_ham_words = sum(ham_words_count)/len(ham_words_count)
avg_spam_words = sum(spam_words_count)/len(spam_words_count)

print("Ham messages average words:",avg_ham_words)
print("Spam messages average words:",avg_spam_words)

Ham messages average words: 14.20062176165803
Spam messages average words: 23.85140562248996


Again, spam messages are longer, this time measured by average words in messages. I'm guessing this is for the same reasoning as above.

To get more insight, I will need to clean the data.

### Clean Text

Cleaning the text makes it possible to input into a model.  It is important to standardize the data and remove the most common words that will not really help uncover any patterns because they are so prevalent in both classes.  Finally, word stemming acts as a normalization technique too.  Trimming the words to their roots makes different, but related words (aka ones with the same stems) appear appropriately as similar to the model.  

- Make Lowercase
- Remove Punctuation
- Remove Numbers
- Strip Whitespace
- Remove Stopwords
- Word Stemming

In [4]:
# Make Messages All Lowercase
df['message_lower'] = df['message'].str.lower()
df.head()

# Source: https://www.kite.com/python/answers/how-to-make-a-pandas-dataframe-string-column-lowercase-in-python

Unnamed: 0,target,message,message_lower
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro..."


In [5]:
# Remove Puntuation
df['message_no_punct'] = df['message_lower'].str.replace(r'[^\w\s]+', '')
df.head()

#source = https://stackoverflow.com/questions/50444346/fast-punctuation-removal-with-pandas

Unnamed: 0,target,message,message_lower,message_no_punct
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


In [6]:
# Remove Numbers
df['message_no_nums'] = df['message_no_punct'].str.replace('\d+', '')
df.head()

#source = https://stackoverflow.com/questions/41719259/how-to-remove-numbers-from-string-terms-in-a-pandas-dataframe

Unnamed: 0,target,message,message_lower,message_no_punct,message_no_nums
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in a wkly comp to win fa cup final...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah i dont think he goes to usf he lives aroun...


In [7]:
# Strip White Space
df['message_no_nums'].str.strip()
df.head()

#source = https://www.geeksforgeeks.org/pandas-strip-whitespace-from-entire-dataframe/

Unnamed: 0,target,message,message_lower,message_no_punct,message_no_nums
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in a wkly comp to win fa cup final...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah i dont think he goes to usf he lives aroun...


In [8]:
# Drop Uneeded Columns
df.drop(['message_lower', 'message_no_punct'],axis=1, inplace=True)
df.head(10)

Unnamed: 0,target,message,message_no_nums
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup final...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darling its been weeks now ...
6,ham,Even my brother is not like to speak with me. ...,even my brother is not like to speak with me t...
7,ham,As per your request 'Melle Melle (Oru Minnamin...,as per your request melle melle oru minnaminun...
8,spam,WINNER!! As a valued network customer you have...,winner as a valued network customer you have b...
9,spam,Had your mobile 11 months or more? U R entitle...,had your mobile months or more u r entitled t...


In [9]:
# Remove Stop Words

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

df['clean_message'] = df['message_no_nums'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head(15)

#Source = https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe

Unnamed: 0,target,message,message_no_nums,clean_message
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup final...,free entry wkly comp win fa cup final tkts st ...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah dont think goes usf lives around though
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darling its been weeks now ...,freemsg hey darling weeks word back id like fu...
6,ham,Even my brother is not like to speak with me. ...,even my brother is not like to speak with me t...,even brother like speak treat like aids patent
7,ham,As per your request 'Melle Melle (Oru Minnamin...,as per your request melle melle oru minnaminun...,per request melle melle oru minnaminunginte nu...
8,spam,WINNER!! As a valued network customer you have...,winner as a valued network customer you have b...,winner valued network customer selected receiv...
9,spam,Had your mobile 11 months or more? U R entitle...,had your mobile months or more u r entitled t...,mobile months u r entitled update latest colou...


From the above output, there are still some words remaining that would probably be classified as stopwords if they were spelled correctly or were normal characters.  Some examples include: "ok","u","c","r","å","id".

I will remove these as well.

In [10]:
# Remove More Stopwords

# Add more stopwords
extra_words = ["ok","u","c","r","å","id"]
stop_words = stop + extra_words

# Remove stopwords
df['clean_message'] = df['message_no_nums'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df.head(15)

Unnamed: 0,target,message,message_no_nums,clean_message
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,lar joking wif oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup final...,free entry wkly comp win fa cup final tkts st ...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,dun say early hor already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah dont think goes usf lives around though
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darling its been weeks now ...,freemsg hey darling weeks word back like fun s...
6,ham,Even my brother is not like to speak with me. ...,even my brother is not like to speak with me t...,even brother like speak treat like aids patent
7,ham,As per your request 'Melle Melle (Oru Minnamin...,as per your request melle melle oru minnaminun...,per request melle melle oru minnaminunginte nu...
8,spam,WINNER!! As a valued network customer you have...,winner as a valued network customer you have b...,winner valued network customer selected receiv...
9,spam,Had your mobile 11 months or more? U R entitle...,had your mobile months or more u r entitled t...,mobile months entitled update latest colour mo...


In [11]:
# Word Stemming

from nltk.stem.snowball import SnowballStemmer

# Drop Uneeded Columns
df.drop(['message_no_nums'],axis=1, inplace=True)

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Stem every word
df['stemmed'] = df['clean_message'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df.head()

# Source: https://stackoverflow.com/questions/37443138/python-stemming-with-pandas-dataframe

Unnamed: 0,target,message,clean_message,stemmed
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,lar joking wif oni,lar joke wif oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...,free entri wkli comp win fa cup final tkts st ...
3,ham,U dun say so early hor... U c already then say...,dun say early hor already say,dun say earli hor alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though,nah dont think goe usf live around though


### Document-Term-Matrix creation

Steps:
- First, need to make the labels binary.  'ham' and 'spam' will work better to classify as 0 and 1
- Split into train test sets
- Will fit the vectorize the train set only (because it would give the model an undue advantage to vectorize on the test set too).
- Vectorize the train and test set, based on the fit from the train set. 
- TF-IDF

Then the text will be ready to be inputted and used to train the model

#### Encode Targets

In [12]:
# Encode the target from 'ham' and 'spam' to 0 and 1
df['b_target'] = df['target'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,target,message,clean_message,stemmed,b_target
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...,0
1,ham,Ok lar... Joking wif u oni...,lar joking wif oni,lar joke wif oni,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...,free entri wkli comp win fa cup final tkts st ...,1
3,ham,U dun say so early hor... U c already then say...,dun say early hor already say,dun say earli hor alreadi say,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though,nah dont think goe usf live around though,0


#### Train-Test Splits

In [13]:
# Set X (predictive features) and Y (target feature)
X = df['stemmed']
Y = df['b_target']

# Split into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

#### Vectorization

Vectorization is a key step in data preparation.  Vectorization creates a matrix with every word in the dataset as a column.  The rows of the matrix are each message in the data, with 1s in the column for the word included in the message. This helps 'quanitfy' the text data in a way that the model can analyze.

It is important to fit the vector on the train data only.  Then the vector is used to transform both the train and test sets. 

In [14]:
# Vectorize the data
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer and fit on the training set
vect = CountVectorizer()
vect.fit(x_train)

# Use the trained to create a document-term matrix from train and test sets
x_train_dtm = vect.transform(x_train)
x_test_dtm = vect.transform(x_test)

#source: https://www.kaggle.com/andreshg/nlp-glove-bert-tf-idf-lstm-explained

#### TF-IDF

TF-IDF is also important for text analysis.  It has two components:

- TF = "Term Frequency"
- IDF = "Inverse Document Frequency"

In a text database, the frequency of words could be inversely correlated to how important the word is.  Common words might not reveal much about patterns for classifications.  But some words that are used less frequently are key words in discovering patterns.   

TF-IDF transformation provides word frequency scores for each word in the data, with less important common words having less significance than key words.  It acts to highlight interesting words in a message - the ones that might be frequent in a message, but not necessarily frequent across messages.

In [15]:
# TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

tfidf_transformer.fit(x_train_dtm)
x_train_tfidf = tfidf_transformer.transform(x_train_dtm)

x_train_tfidf

#source: https://www.kaggle.com/andreshg/nlp-glove-bert-tf-idf-lstm-explained

<4457x6303 sparse matrix of type '<class 'numpy.float64'>'
	with 35945 stored elements in Compressed Sparse Row format>

A row of the matrix will look like the output below.  0 means the word is not in the sms message for that row, and a 1 means that the word is present

In [17]:
formatted = x_train_tfidf.todense()
formatted[0]

#source: https://stackoverflow.com/questions/15115765/how-to-access-sparse-matrix-elements

matrix([[0., 0., 0., ..., 0., 0., 0.]])

## Model

Now the data is processed/prepared and I am ready to train the model.  I use a Naive Bayes model.

In [18]:
# Create a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# Train the model
nb.fit(x_train_dtm, y_train)

# Make class anf probability predictions
y_pred_class = nb.predict(x_test_dtm)

## Analysis Questions

### What is the accuracy of the model?  Report your finding with corresponding tables/graphs.

In [19]:
# calculate accuracy of class predictions
from sklearn import metrics
print("Accuracy of the model is:", metrics.accuracy_score(y_test, y_pred_class))
print('')
print("Confusion Matrix:")
conf_matrix = (metrics.confusion_matrix(y_test, y_pred_class))
conf_matrix

Accuracy of the model is: 0.9820627802690582

Confusion Matrix:


array([[961,   4],
       [ 16, 134]], dtype=int64)

The model scores an accuracy of 0.982 which seems pretty successful.  The confusion matrix shows that 961 ham messages were classified correctly, while 4 spam messages were classified as ham.  Also, 134 spam messages were classified correctly while 16 ham messages were classified as spam.  These scores are pretty good overall!

### Print the 5 most frequent words in each class, and their posterior probability generated by the model.

I interpreted this to mean the 5 words from each class that occur in the most messages of that class, not neccesarily the words that occur most in the entire dataset.  Some words could occur mulitple times in a message, which could skew the results.  I more focus on which words occur in the most messages of each class.

In [20]:
# Get all the words in the Ham and Spam messages

# Split messages into ham and spam
ham_messages = []
spam_messages = []

for x in range(len(df['stemmed'])):
    if df['b_target'][x] == 0:
        ham_messages.append(df['stemmed'][x])
    else:
        spam_messages.append(df['stemmed'][x])
        
# Put words in lists for each class
ham_words = []
spam_words = []

for message in ham_messages:
    words = message.split()
    for word in words:
        ham_words.append(word)
        
for message in spam_messages:
    words = message.split()
    for word in words:
        spam_words.append(word)
        
#Remove duplicates
unique_ham_words = set(ham_words)
unique_spam_words = set(spam_words)

print("Unique words in Ham messages:",len(unique_ham_words))
print("Unique words in Spam messages:",len(unique_spam_words))

Unique words in Ham messages: 6057
Unique words in Spam messages: 1951


In [23]:
from tqdm import tqdm 

# Initiate dictionaries 
ham_word_freq = {}
spam_word_freq = {}

# Go through unique words for ham
for word in tqdm(list(unique_ham_words)):
    # Loop through each row of the database
    for x in range(len(df['stemmed'])):
        # For the specific class
        if df['b_target'][x] == 0:
            # if the word isn't yet in the dicitonary, instantiate it
            if word not in ham_word_freq:
                ham_word_freq[word] = 0
        
            # If the word is in the message for that row, increment the counter
            # We are counting how many messages (rows) that each word in ham occurs, and getting the most frequent
            if word in df['stemmed'][x].split():
                ham_word_freq[word] += 1
    
for word in tqdm(list(unique_spam_words)):
    for x in range(len(df['stemmed'])):
        if df['b_target'][x] == 1:
            if word not in spam_word_freq:
                spam_word_freq[word] = 0
        
            if word in df['stemmed'][x].split():
                spam_word_freq[word] += 1        
    


100%|██████████| 6057/6057 [06:28<00:00, 15.59it/s]
100%|██████████| 1951/1951 [01:13<00:00, 26.69it/s]


In [24]:
# Sort the dictionary above.  The strucuture is word as the key and number of messages containing that word as the value
# Sort by value
sorted_ham_word_freq = {k: v for k, v in sorted(ham_word_freq.items(), key=lambda item: item[1], reverse=True)}
# Save a list of the top 5 values.  These are the words that occur in the most rows of Ham
top_5_ham = list(sorted_ham_word_freq.items())[0:5]

sorted_spam_word_freq = {k: v for k, v in sorted(spam_word_freq.items(), key=lambda item: item[1], reverse=True)}
top_5_spam = list(sorted_spam_word_freq.items())[0:5]

print("The top 5 Ham Words:",top_5_ham)
print("The top 5 Spam Words:",top_5_spam)

print(len(list(sorted_ham_word_freq.items())),"unique words in ham.")
print(len(list(sorted_spam_word_freq.items())),"unique words in spam.")

The top 5 Ham Words: [('im', 412), ('go', 383), ('get', 337), ('come', 273), ('call', 265)]
The top 5 Spam Words: [('call', 328), ('free', 169), ('txt', 142), ('text', 122), ('mobil', 118)]
6057 unique words in ham.
1951 unique words in spam.


#### Posterior Odds Calculation

Info Needed to Calculate Posterior:

- Ham Messages: 4,825 
- Spam Messages: 747
- Total Messages: 5,572
- % Ham: 86.59%
- % Spam:13.41%


Intuitively, posterior probability in this case answers the question: given one of the top 5 words, what is the probability that the message is Ham (or Spam in the next section).

Posterior = (Likelihood * Prior) / Evidence

Where in this case for example, 

- Posterior = P(Ham | Word) = The probability that a message is ham, given it contains the word
- Likelihood = P(Word | Ham) = The probability that a message contains the word, given that the message is ham
- Prior = P(Ham) = The probability that a message is Ham
- Evidence = P(Word) = The probability of the word occuring in a message = Messages with the word / total messages


#### Ham Posteriors

In [31]:
ham_messages = 4825
spam_messages = 747
total_messages = ham_messages + spam_messages

# Loop through each word, frequency pairing
for tup in top_5_ham:    
    # Save the word 
    word = tup[0]
    
    print("########################")
    
    ##################### Likelihood ##########################
    # count how many ham messages contain the word / total ham messages
    # how many ham messages is equal to the second value in the tuple, but it is good to calculate again to understand the intuition
    class_messages_with_word = 0

    # Go through every row of the df
    for x in range(len(df['stemmed'])):
        # If the row is labeled as ham
        if df['b_target'][x] == 0:
            # and if the word is in the row's message
            if word in df['stemmed'][x].split():
                # Count it
                class_messages_with_word += 1
                    
    likelihood = class_messages_with_word / ham_messages 
    
    #################### Prior ##################################
    # Ham Messages / Total Messages
    prior = ham_messages / total_messages
    
    ###################### Evidence ############################
    # Messages with the word / total messages
    all_messages_with_word = 0

    # Similar loop to above, except counting all times the word occurs in a message, not just ham messages
    for x in range(len(df['stemmed'])):
        if word in df['stemmed'][x].split():
            all_messages_with_word += 1
        
    evidence = all_messages_with_word / total_messages
                
                
    posterior = likelihood * prior / evidence
    
    print("The posterior for:",word,"is:",posterior)


########################
The posterior for: im is: 0.9716981132075471
########################
The posterior for: go is: 0.911904761904762
########################
The posterior for: get is: 0.7985781990521327
########################
The posterior for: come is: 0.9820143884892087
########################
The posterior for: call is: 0.4468802698145026


#### Spam Posteriors

In [32]:
for tup in top_5_spam:    
    word = tup[0]
    
    print("########################")
    
    ##################### Likelihood ##########################
    # count how many ham messages contain the word / total ham messages
    class_messages_with_word = 0

    for x in range(len(df['stemmed'])):
        if df['b_target'][x] == 1:
            if word in df['stemmed'][x].split():
                class_messages_with_word += 1
                    
    likelihood = class_messages_with_word / spam_messages 
    
    #################### Prior ##################################
    # Ham Messages / Total Messages
    prior = spam_messages / total_messages
    
    ###################### Evidence ############################
    # Messages with the word / total messages
    all_messages_with_word = 0

    for x in range(len(df['stemmed'])):
        if word in df['stemmed'][x].split():
            all_messages_with_word += 1
        
    evidence = all_messages_with_word / total_messages
                
                
    posterior = likelihood * prior / evidence
    
    print("The posterior for:",word,"is:",posterior)

########################
The posterior for: call is: 0.5531197301854975
########################
The posterior for: free is: 0.7444933920704847
########################
The posterior for: txt is: 0.922077922077922
########################
The posterior for: text is: 0.5951219512195122
########################
The posterior for: mobil is: 0.887218045112782


#### Posterior Analysis

- Messages with the words "im", "go", or "come" have > .90 probability of being Ham
- Messages with the word "txt" has a 0.92 probability of being spam.  But the word 'text' has a probability of 0.59.  This indidcates that it is more likely to be abbreviated in spam messages.  
- The word 'call' occurs frequently in both ham and spam messages, making it a lower probability than other words which appear frequently only in one class.
- Even though these words seem common, a few give really good insight into whether a message is spam or ham.  This is surprising to me but demonstrates how effective Naive Bayes classifiers can be.

### How would you improve the model performance?

- More data
- More balanced data
- Perhaps using lemmatization instead of stemming would improve the model.
- Increasing the n-grams to give the model context for strings of words.  This added context might be helpful in classifying ham vs spam because it would take into account word combinations.
- Add more words to the stop list
- Also, feature engineering might help improve the model.  In my EDA, I saw that spam messages were much longer on average than ham messages.  Engineering a 'message_length' feature and inputting it into the model might improve accuracy.

### If the data set is bigger, do you think the accuracy increases? Discuss.

In general, I think a bigger dataset can always contribute to the model's accuracy.  There is just more input to train on.  I especially think more data would help if the additional data helped shift the balance of spam and ham.  Right now, the dataset inlcudes many more ham datapoints than spam.  Also, if more features were added, I think that would help the model's accuracy significantly. For example, if a feature was added that indicated whether the message came from someone in the recipient's contact list, I would guess that it would be easier to classify ham vs. spam.