#### Feature Selection
Question: In exploring the relationship between "stopwords" and characters to remove -- I am going to look into the differences between occurances in spam and human (coded "ham") datasets.
Approach: Apply two sample t-tests of mean occurances across three types of stopword inputs: NLTK's english stopwords, a custom set, and non-alphanumeric 


In [38]:
from IPython.display import display, HTML # for pretty two table side x side display 
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from re import sub # import sub to replace items in the followiong list comprehension
from collections import defaultdict
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
import nltk


#### Read Inputs

In [4]:
# Read Data
data = pd.read_table('SMSSpamCollection',header= None, names = ('outcome', 'content'))
ham = data[data.outcome == 'ham']
spam = data[data.outcome == 'spam']

# Read Stop word and symbol set
stopwords_set1 = set(nltk.corpus.stopwords.words('english'))
stopwords_set1 = '\\s'+'\\s|\\s'.join(stopwords_set1)+'\\s'+'|^'.join(stopwords_set1)
stopwords_set2 = set('for a of the and to in or'.split())
stopwords_set2 = '\\s'+'\\s|\\s'.join(stopwords_set2)+'\\s'+'|^'.join(stopwords_set2)
symbol_set1 =     symbol_remover = '[^A-Za-z0-9]+'


#### Part 0: Quick dataset descriptives

In [33]:

CSS = """
.output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(CSS))

In [37]:
display(spam.head())
display(ham.head())

Unnamed: 0,outcome,content
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."


Unnamed: 0,outcome,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
6,ham,Even my brother is not like to speak with me. ...


In [35]:
display(spam.describe())
display(ham.describe())

Unnamed: 0,outcome,content
count,747,747
unique,1,653
top,spam,Please call our customer service representativ...
freq,747,4


Unnamed: 0,outcome,content
count,4825,4825
unique,1,4516
top,ham,"Sorry, I'll call later"
freq,4825,30


#### Part 1: Test Stopword / Symbol occurances 

Two sample, Two-Sided Test of means:
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

In [39]:
def word_test(find_me, data_spam, data_ham):
    data_ham = data_ham.copy()
    data_spam = data_spam.copy()
    check_ham = data_ham.content.str.count(find_me)
    check_spam = data_spam.content.str.count(find_me)
    data_ham['check'] = check_ham
    data_spam['check'] = check_spam
    results = ttest_ind(data_ham['check'], data_spam['check'])[1]
    return 'P-value:', results, 'ham:', check_ham.mean(),'spam:',check_spam.mean()

##### Test Mean occurance of stopwords and symbols

Hypothesis Tests:



Test A.
- H0: No Difference in mean occurance of NLTK English Stopwords in both Spam / Ham text messages 
- H1: There is a difference in mean occurance of stopwords

Test B.
- H0: No Difference in mean occurance of Custom Stopwords in both Spam / Ham text messages 
- H1: There is a difference in mean occurance of stopwords

Test C.
- H0: No Difference in mean occurance of Non Alphanumeric characters in both Spam / Ham text messages 
- H1: There is a difference in mean occurance of stopwords

In [40]:
# A. Test stop word set 1
test_a = word_test(find_me =stopwords_set1, data_ham= ham, data_spam=spam)
'reject at the .001 level?', test_a[1] < .001, test_a

('reject at the .001 level?',
 True,
 ('P-value:',
  2.4624610347942327e-15,
  'ham:',
  3.2876683937823836,
  'spam:',
  4.2971887550200805))

In [41]:
# B. Test stop word set 2
test_b = word_test(find_me =stopwords_set2, data_ham= ham, data_spam=spam)
'reject at the .001 level?', test_b[1] < .001, test_b

('reject at the .001 level?',
 True,
 ('P-value:',
  3.0363882818963718e-65,
  'ham:',
  1.1133678756476684,
  'spam:',
  2.1311914323962515))

In [202]:
# C. Test stop word set 1
test_c = word_test(find_me =symbol_set1, data_ham= ham, data_spam=spam)
'reject at the .001 level?', test_c[1] < .001, test_c

('reject at the .001 level?',
 True,
 ('P-value:',
  6.7010105678767329e-117,
  'ham:',
  14.41160621761658,
  'spam:',
  24.811244979919678))

Results:

- Test A: Reject Null that Means are same 
- Test B: Reject Null that Means are same
- Test C: Reject Null that Means are same


#### Part 2: Test Confidence Intervals

Question: What kind of accuracy, precision, and recall do we expect to see in future samples given that our text messages are acquired and do not differ from the 5000 text messages utilized here? 

Approach: Using resampling, resample 100 text messages 10,000 times. Use these to build a sampling distribution for Accuracy, Precision, and Recall that we will see with 90% liklihood in future cases. 

Contents located in project document as it requires additional code to run.
[Confidence Intervals in Accuracy, Precision, and Recall](https://github.com/chrisgian/Capstone1-Spam-Detection-NLP/blob/master/sb_c1_nlp.ipynb)


