### Project: Capstone Project 1: Exploratory Data Analysis  
At this point, you’ve obtained the dataset for your capstone project, cleaned, and wrangled it into a form that's ready for   analysis. It's now time to apply the inferential statistics techniques you’ve learned to explore the data.  
  
Based on your dataset, the questions that interest you, and the results of the visualization techniques that you used   previously, you might end up using only a few of the inferential techniques that you’ve learned. Your specific situation   determines how much time it’ll take you to complete this project. Talk to your mentor to determine the most appropriate  approach to take for your project. You may find yourself revisiting the analytical framework that you first used to develop  your proposal questions. It’s fine to refine your questions more as you get deeper into your data and find interesting patterns  and answers. Remember to stay in touch with your mentor to remain focused on the scope of your project 
  
Think of the following questions and apply them to your dataset:  

* Are there variables that are particularly significant in terms of explaining the answer to your project question?

* Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?

* What are the most appropriate tests to use to analyse these relationships?
  
Submission: Write a 1-2 page report on the steps and findings of your inferential statistical analysis. Upload this report to   your GitHub and submit a link. Eventually, this report can be incorporated into your milestone report.

In our data set of SMS we have spam and ham messages. We want to find out, whether some features is more important than others in order to predict whether is spam or ham. For the sake of demonstration we will pick just one word ("Call").   
  
First, we need to foind out whether we can apply CLT: 
* Is $n \geq 3$? Yes, we have 5572 messages in total 
* We assume that independent condition os satisfied as well  

Therefore we can roll up our sleeves.

In [1]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt

from nltk.tokenize import TweetTokenizer
from collections import Counter

In [2]:
df = pd.read_csv('SMSSpamCollection.txt', sep='\t', header=None)

In [3]:
df.columns = ['spam', 'text']
df['spam'] = df['spam'] == 'spam' # makes True/False nstead of "spam" and "ham"
df['spam'] = df['spam'].astype(int)  # number values instead of boolean value

In [4]:
# Get rid of the punctuation
translator = str.maketrans('', '', string.punctuation)
df.text = df.text.apply(lambda x: x.translate(translator))
df.head()

Unnamed: 0,spam,text
0,0,Go until jurong point crazy Available only in ...
1,0,Ok lar Joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor U c already then say
4,0,Nah I dont think he goes to usf he lives aroun...


We will create new column which will have value 0 or 1. O indicate that the message contains word "Call" and 1 otherwise.


In [5]:
# Use TweetTokenizer 
tknzr = TweetTokenizer()
df['text'] = df.text.apply(tknzr.tokenize)
df['text'].head()

0    [Go, until, jurong, point, crazy, Available, o...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, U, c, already, t...
4    [Nah, I, dont, think, he, goes, to, usf, he, l...
Name: text, dtype: object

In [6]:
contains_call = []
for i in range(len(df.text)):
    if 'call' in df.text[i]:
        contains_call.append(1)
    else:
        contains_call.append(0)    

In [7]:
# to make sure that I have desired list, with the same length as DataFrame
len(contains_call)

5572

In [8]:
# adding new feature 'call' to our DataFrame
df['call'] = contains_call

In [9]:
df.head()

Unnamed: 0,spam,text,call
0,0,"[Go, until, jurong, point, crazy, Available, o...",0
1,0,"[Ok, lar, Joking, wif, u, oni]",0
2,1,"[Free, entry, in, 2, a, wkly, comp, to, win, F...",0
3,0,"[U, dun, say, so, early, hor, U, c, already, t...",0
4,0,"[Nah, I, dont, think, he, goes, to, usf, he, l...",0


Now, we are prepare to do some hypothesis and test them. First, we state our hypothesis:  
* $H_0$: $\mu_{spam=call} = \mu_{ham=call}$
* $H_A$: $\mu_{spam=call} \neq \mu_{ham=call}$

Having $\alpha=0.05$

In [10]:
spam = df[df.spam==1]
ham = df[df.spam==0]
print("Total word 'call':", df.call.values.sum())
print("Total word 'call' in spam:", spam.call.values.sum())
print("Total word 'call' in ham:", ham.call.values.sum())

Total word 'call': 380
Total word 'call' in spam: 188
Total word 'call' in ham: 192


In [11]:
df.spam.value_counts()

0    4825
1     747
Name: spam, dtype: int64

In [13]:
no_call = [(747-188)/747, (4825-192)/4825]
call = [188/747, 192/4825]

In [14]:
rates = pd.DataFrame(data=[no_call, call], index=['spam', 'ham'])

In [15]:
rates.columns = ["no 'call'", "'call'"]

In [16]:
rates

Unnamed: 0,no 'call','call'
spam,0.748327,0.960207
ham,0.251673,0.039793


In [17]:
p_s = spam.call.values.sum()/747
p_h = ham.call.values.sum()/4825
sigma_s = (p_s * (1 - p_s)) / 747
sigma_h = (p_h * (1 - p_h)) /4825


# difference sample
mu = p_s - p_h
sigma = sigma_s + sigma_h
std_dev = np.sqrt(sigma)
print(mu)

0.21188061399310543


Distance from $\mu$ is $d = z * std\_dev$ and according z-table $z=1.96$. Margin error is $2*d$.
Keep in mind, our data is large enough, so we can use our sample proportions $\approx$ proportions.

In [18]:
d = 1.96*std_dev
low = mu - d
up = mu + d
margin_error = 2*d
margin_error
print("Low:",  low)
print("Up:", up)

Low: 0.18027417251991973
Up: 0.24348705546629112


Our $95\%$ confidence interval is $<0.18027, 0.2435>$. And therfore we can't reject null hypothesis. This is quite expected result -- wrod 'call' is widely used in 'ham' communication as in the 'spam'. So word 'call' alone is not sufficient to decided whether message is a spam or ham.