## Status : *COMPLETED*

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

***
Instructions:
- Import the dataset into a pandas dataframe using the read_table method. 
- Because this is a tab separated dataset we will be using '\t' as the value for the 'sep' argument which specifies this format.
- Also, rename the column names by specifying a list ['label, 'sms_message'] to the 'names' argument of read_table().
- Print the first five values of the dataframe with the new column names.
***

In [2]:
localpath = '..\data\SMSSpamCollection'
colnames = ['label', 'sms_message']
df = pd.read_table(localpath, sep='\t', names=colnames)
df.head(5)

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


>Instructions:
- Convert the values in the 'label' colum to numerical values using map method as follows: {'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.
- Also, to get an idea of the size of the dataset we are dealing with, print out number of rows and columns using 'shape'.

In [3]:
df['label'] = df['label'].map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


>```python
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
```  

>**Instructions**:
Convert all the strings in the documents set to their lower case. Save them into a list called 'lower_case_documents'. You can convert strings to their lower case in python by using the `lower()` method.

In [10]:
>>> documents = ['Hello, how are you!',
                 'Win money, win from home.',
                 'Call me now.',
                 'Hello, Call hello you tomorrow?']
lower_case_documents = []

for doc in documents:
    item = doc.lower()
    lower_case_documents.append(item)

In [11]:
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


>**Instructions**: Remove all punctuation from the strings in the document set. Save them into a list called 'sans_punctuation_documents'.

In [13]:
import string as s

sans_puntucation_documents = []

for doc in lower_case_documents:
    for char in s.punctuation:
        if char in doc:
            doc = doc.replace(char, '')
    sans_puntucation_documents.append(doc)
print(sans_puntucation_documents)            

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


>**Udacity solution**

>```python
sans_punctuation_documents = []
import string
for i in lower_case_documents:
    sans_punctuation_documents.append(
        i.translate(
            str.maketrans('', '', string.punctuation)
            )
    )
print(sans_punctuation_documents)
```

The following makes a table indicating that *nothing* should be translated to *nothing* while removing `string.punctuation`:
```python
str.maketrans('', '', string.punctuation)
```

>Instructions: Tokenize the strings stored in 'sans_punctuation_documents' using the split() method. and store the final document set in a list called 'preprocessed_documents'.

In [19]:
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


In [20]:
preprocessed_documents = []

for doc in sans_punctuation_documents:
    docList = doc.split()
    for t in docList:
        if t not in preprocessed_documents:
            preprocessed_documents.append(t)

print(preprocessed_documents)

['hello', 'how', 'are', 'you', 'win', 'money', 'from', 'home', 'call', 'me', 'now', 'tomorrow']


The above is effective, but not what is the desired output.

The desired output is a series of lists of tokens for each entry/doc. Revision below.

In [22]:
preprocessed_documents = []

for doc in sans_punctuation_documents:
    docList = doc.split()
    preprocessed_documents.append(docList)
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


>**Instructions:**
Using the Counter() method and preprocessed_documents as the input, create a dictionary with the keys being each word in each document and the corresponding values being the frequncy of occurrence of that word. Save each Counter dictionary as an item in a list called 'frequency_list'.


In [33]:
from collections import Counter

frequency_list = []
for doc in preprocessed_documents:
    frequency_list.append(Counter(doc))
    
    # more soph what I originally did
    # frequency_list.append(dict(Counter(doc)))
print(frequency_list) 
# for pretty printing, use 'pprint' module

[Counter({'how': 1, 'you': 1, 'are': 1, 'hello': 1}), Counter({'win': 2, 'from': 1, 'money': 1, 'home': 1}), Counter({'call': 1, 'now': 1, 'me': 1}), Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


>**Instructions:**
Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'. 

In [41]:
>>> documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

from sklearn import feature_extraction
# count_vector = feature_extraction.text.CountVectorizer(input=documents)
# no need to use the fxn, just need to create an instance
count_vector = feature_extraction.text.CountVectorizer()

Alternatively:
>```python
from sklearn.feature_extraction.text import CountVectorizer
```

This is probably better/simpler

In [42]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [45]:
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

>**
Instructions:**
Create a matrix with the rows being each of the 4 documents, and the columns being each word. 
The corresponding (row, column) value is the frequency of occurrance of that word(in the column) in a particular
document(in the row). You can do this using the transform() method and passing in the document data set as the 
argument. The transform() method returns a matrix of numpy integers, you can convert this to an array using
toarray(). Call the array 'doc_array'


In [61]:
from sklearn.feature_extraction.text import CountVectorizer
doc_array = count_vector.transform(documents).toarray()
# transform, the CountVectorizer method, not the DF method in pandas

In [60]:
print(doc_array) # indicates location of nonzero values (only) and the value itself
print(doc_array.toarray())

  (0, 0)	1
  (0, 3)	1
  (0, 5)	1
  (0, 11)	1
  (1, 2)	1
  (1, 4)	1
  (1, 7)	1
  (1, 10)	2
  (2, 1)	1
  (2, 6)	1
  (2, 8)	1
  (3, 1)	1
  (3, 3)	2
  (3, 9)	1
  (3, 11)	1
[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


>**Instructions:**
Convert the array we obtained, loaded into 'doc_array', into a dataframe and set the column names to 
the word names(which you computed earlier using get_feature_names(). Call the dataframe 'frequency_matrix'.

In [63]:
frequency_matrix = pd.DataFrame(data=doc_array, 
                                columns=count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


***
>**Instructions:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data
using the following variables:
* `X_train` is our training data for the 'sms_message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'sms_message' column.
* `y_test` is our testing data for the 'label' column
Print out the number of rows we have in each our training and testing data.

In [67]:
from sklearn.model_selection import train_test_split
X = df['sms_message']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(len(X_train),len(X_test),'\n',len(y_train),len(y_test))

4179 1393 
 4179 1393


In [71]:
# instantiation
count_vector = CountVectorizer()

# fitting data + returning mtx simult.
training_data = count_vector.fit_transform(X_train)

# transform testing data + returning mtx
testing_data = count_vector.transform(X_test)

In [72]:
print(training_data)
print(testing_data)

# Notes:
# Both data are in the forms of mtx loc

  (0, 2022)	1
  (0, 4779)	1
  (0, 4662)	1
  (0, 6892)	1
  (0, 6656)	1
  (0, 50)	1
  (0, 4743)	1
  (0, 4375)	1
  (0, 1552)	1
  (0, 264)	1
  (0, 4983)	1
  (0, 7424)	1
  (0, 3170)	1
  (0, 2864)	2
  (0, 4987)	1
  (0, 1572)	1
  (0, 3880)	1
  (0, 5479)	1
  (0, 3971)	1
  (0, 4781)	1
  (0, 5193)	1
  (0, 3181)	1
  (0, 509)	1
  (1, 6758)	1
  (1, 3316)	1
  :	:
  (4177, 3700)	1
  (4177, 837)	1
  (4177, 307)	1
  (4177, 6662)	1
  (4177, 6034)	1
  (4177, 4508)	1
  (4177, 2556)	1
  (4177, 5490)	1
  (4177, 254)	1
  (4177, 2744)	1
  (4177, 4778)	1
  (4177, 4446)	1
  (4177, 4255)	1
  (4177, 3629)	1
  (4177, 7257)	1
  (4177, 1574)	1
  (4177, 6887)	1
  (4177, 3738)	1
  (4177, 5656)	1
  (4177, 6514)	1
  (4177, 6656)	1
  (4178, 5999)	1
  (4178, 7257)	1
  (4178, 4238)	1
  (4178, 1691)	1
  (0, 1538)	1
  (0, 5189)	1
  (0, 6542)	1
  (0, 7405)	1
  (1, 1016)	1
  (1, 3050)	1
  (1, 4163)	1
  (1, 4238)	1
  (1, 4370)	1
  (1, 5200)	1
  (1, 6656)	1
  (1, 7407)	1
  (1, 7420)	1
  (2, 986)	1
  (2, 3244)	1
  (2, 7162)	1
  (

***
The following code is provided (skeleton code):

In [73]:
# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg/~D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = (p_diabetes * p_pos_diabetes) + (p_no_diabetes * (1 - p_neg_no_diabetes))
print('The probability of getting a positive test result P(Pos) is: {}',format(p_pos))

The probability of getting a positive test result P(Pos) is: {} 0.10799999999999998


Note: the chance of someone getting a positive is:  
[[chance they have diabetes] \* [the chance that the test will turn positive for diabetes (given diabetes)]] **+** [[the chance that they don't have diabetes] \* [the chance of a false positive]  ]
>**Instructions**:
Compute the probability of an individual having diabetes, given that, that individual got a positive test result.
In other words, compute P(D|Pos).

>The formula is: P(D|Pos) = (P(D) * P(Pos|D) / P(Pos)

>**Instructions**:
Compute the probability of an individual not having diabetes, given that, that individual got a positive test result.
In other words, compute P(~D|Pos).

>The formula is: P(~D|Pos) = (P(~D) * P(Pos|~D) / P(Pos)

>Note that P(Pos/~D) can be computed as 1 - P(Neg/~D). 

>Therefore:
P(Pos/~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1


In [76]:
p_diabetes_pos = p_diabetes * p_pos_diabetes / p_pos 
print('The probability of having diabetes if test results positive: ', format(p_diabetes_pos))

p_no_diabetes_pos = p_no_diabetes * (1 - p_neg_no_diabetes) / p_pos
print('The probability of having no diabetes if test results positive: ', format(p_no_diabetes_pos))
# P(pos|~D) = 1 - P(neg|~D)

The probability of having diabetes if test results positive:  0.08333333333333336
The probability of having no diabetes if test results positive:  0.9166666666666666


In [78]:
p_no_diabetes_pos + p_diabetes_pos

1.0

***
>**Instructions**: Compute the probability of the words 'freedom' and 'immigration' being said in a speech, or
P(F,I).

>The first step is multiplying the probabilities of Jill Stein giving a speech with her individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_j_text

>The second step is multiplying the probabilities of Gary Johnson giving a speech with his individual 
probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_g_text

>The third step is to add both of these probabilities and you will get P(F,I).

In [87]:
p_j, p_g = [0.5, 0.5]
p_f_j, p_i_j, p_e_j = [0.1, 0.1, 0.8]
p_f_g, p_i_g, p_e_g = [0.7, 0.2, 0.1]

p_j_text = p_j * p_f_j * p_i_j
p_g_text = p_g * p_f_g * p_i_g

p_f_i = p_j_text + p_g_text

print("Probablity of words 'freedom' and 'immigration' being said in speech: ", format(p_f_i) )

Probablity of words 'freedom' and 'immigration' being said in speech:  0.075


In [89]:
p_j_fi = p_j_text / p_f_i
print("Probability of Jill Stein saying 'freedom' and 'immigration': ", format(p_j_fi))

Probability of Jill Stein saying 'freedom' and 'immigration':  0.06666666666666668


In [90]:
p_g_fi = p_g_text / p_f_i
print("Probability of Gary Johnson saying 'freedom' and 'immigration': ", format(p_g_fi))

Probability of Gary Johnson saying 'freedom' and 'immigration':  0.9333333333333332


***
>**Instructions**:

>We have loaded the training data into the variable 'training_data' and the testing data into the 
variable 'testing_data'.

>Import the MultinomialNB classifier and fit the training data into the classifier using fit(). 
Name your classifier 'naive_bayes'. You will be training the classifier using 'training_data' and y_train' from our split earlier. 

In [91]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

>**Instructions**:
Now that our algorithm has been trained using the training data set we can now make some predictions on the test data
stored in '`testing_data`' using `predict()`. Save your predictions into the 'predictions' variable.

In [93]:
predictions = naive_bayes.predict(testing_data)

>**Instructions**:
Compute the accuracy, precision, recall and F1 scores of your model using your test data '`y_test`' and the predictions
you made earlier stored in the '`predictions`' variable.

In [98]:
from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predictions)
precision = metrics.precision_score(y_test, predictions)
recall = metrics.recall_score(y_test, predictions)
f1 = metrics.f1_score(y_test, predictions)

valList = [accuracy, precision, recall, f1]
strList = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
iterList = zip(valList,strList)

for val, s in iterList:
    print('{:10}'.format(s), ': ', format(val))

Accuracy   :  0.9885139985642498
Precision  :  0.9720670391061452
Recall     :  0.9405405405405406
F1 Score   :  0.9560439560439562
