### Population proportion: using statsmodels for confidence intervals and hypothesis intervals 

### Introduction
After conducting exploratory data analysis, we can use the sample to infer (or draw conclusions) about the population from which it was drawn.
Methods of making inferences about parameters are either estimating the parameter or testing a hypothesis about the value of the parameter.
A parameter is a number that describes the population (p, μ, σ) while a statistic is a number that is computed from the sample (p̂, x̄, *s*).
<br>

<br><u>*Point estimation*</u>

Point estimation is the form of statistical inference, in which, based on the sample data, we estimate the unknown parameter of interest using a single value.

<br><u>*Confidence Interval*</u>

The idea behind interval estimation is to enhance the simple point estimates by supplying information about the size of the error attached.

<br><u>*Hypothesis Testing*</u>

Statistical hypothesis testing is defined as assessing evidence provided by the data in favor of or against some claim about the population.

### Examples using statsmodels package

In [1]:
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportion_confint
from statsmodels.stats.proportion import proportions_ztest

**Dataset**

I will be using the [UCI](https://archive.ics.uci.edu/ml/datasets/census+income) adult census dataset -- also avaliable from [Kaggle](https://www.kaggle.com/uciml/adult-census-income).

In [2]:
df_income = pd.read_csv('data/cleaned_census_income.csv')
df_income.shape

(30162, 15)

In [3]:
df_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30162 entries, 0 to 30161
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30162 non-null  int64 
 1   workclass       30162 non-null  object
 2   fnlwgt          30162 non-null  int64 
 3   education       30162 non-null  object
 4   education.num   30162 non-null  int64 
 5   marital.status  30162 non-null  object
 6   occupation      30162 non-null  object
 7   relationship    30162 non-null  object
 8   race            30162 non-null  object
 9   sex             30162 non-null  object
 10  capital.gain    30162 non-null  int64 
 11  capital.loss    30162 non-null  int64 
 12  hours.per.week  30162 non-null  int64 
 13  native.country  30162 non-null  object
 14  income          30162 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.5+ MB


In [4]:
df_income.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
1,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
2,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
3,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
4,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
5,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
6,68,Federal-gov,422013,HS-grad,9,Divorced,Prof-specialty,Not-in-family,White,Female,0,3683,40,United-States,<=50K
7,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
8,38,Self-emp-not-inc,164526,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,2824,45,United-States,>50K
9,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K


**1. Point estimate and Confidence Interval**

The confidence interval for the population proportion p is:

<img src="data/ci_proportion.png" alt="Drawing" style="width:300px; float:left"/>

<u>Example</u>:

Find the confidence interval for the population proportion of adults with workclass=Private with 95% confidence.

In [5]:
# point estimate (p̂ - p_hat)
count = len(df_income[df_income.workclass =='Private'])
n = len(df_income)
p_hat = count / n
print("Point estimate:", p_hat)

Point estimate: 0.7388767323121809


In [6]:
# determining the confidence interval
conf_int = proportion_confint(count=count, nobs=n, alpha=0.05) #alpha= 0.05 because we want to find the CI with 95% confidence
print('Confidence interval - statsmodel:', conf_int)

Confidence interval - statsmodel: (0.7339196423066949, 0.7438338223176668)


**2. Hypothesis Testing**

Hypothesis testing involves four steps:

<img src="data/hyp_steps.jpg" alt="Drawing" style="width:500px; float:left"/>

***Test statistic: z-score***<br>
The test statistic measures the difference between the sample proportion p̂ and the null value p0 by the z-score, assuming that the null hypothesis is true.

The test statistic z-score of p̂ when p0=p is:

<img src="data/z-score_proportion.png" alt="Drawing" style="width:200px; float:left"/>

***p-value***<br>
p-value is the probability of observing a test statistic (z-score) as extreme as that observed (or even more
extreme) assuming that the null hypothesis is true.

In [7]:
def print_report(p_value:float, alpha:float):
    if p_value > alpha:
        print('Since p_value(%g) > alpha(%g), the data does not provide enough evidence to reject the null hypothesis.' %(p_value, alpha))
    else:
        print('Since p_value(%g) < alpha(%g), the data provides enough evidence to reject the null hypothesis.' %(p_value, alpha))

<u>Example</u>: 

A recent study claimed that the percentage of American adults working for the Private sector and having Income > 50K was very low in the 1990s, just about 25%. The head of the US Department of Labor believes that the percentage should be much higher.

In [8]:
# estimating point estimation
count = len(df_income[(df_income.workclass=='Private') & (df_income.income=='>50K')])
n = len(df_income)
p_hat = count / n

# step 1: state the claims
'''
Ho: p0 = p = 0.25
Ha: p > 0.25
'''

# step 2: collecting and summarizing
p0 = 0.25

# using statsmodel package
ztest = proportions_ztest(count=count, nobs=n, value=p0, alternative='larger')
print('Test statistic z-test - statsmodel:', ztest)
print('z-score - statsmodel:', ztest[0])

Test statistic z-test - statsmodel: (-41.674833855600205, 1.0)
z-score - statsmodel: -41.674833855600205


In [9]:
# step 3: finding p=value
print('P_value:', ztest[1])

'''
Same result is achieved by:
p_less = stats.norm.cdf(test_stat)
p_value = 1 - p_less
'''

P_value: 1.0


In [10]:
# step 4: draw conclusions
alpha = 0.05
p_value = ztest[1]
print_report(p_value, alpha)

Since p_value(1) > alpha(0.05), the data does not provide enough evidence to reject the null hypothesis.


### Another option: custom functions

In [11]:
# confidence interval
def calculate_conf_interval(p_hat:float, n:int):
    margin_of_error = 2 * np.sqrt((p_hat*(1-p_hat))/n)
    return (p_hat - margin_of_error, p_hat + margin_of_error)

# z-score
def calculate_test_statistic(p_hat:float, p0:float, n:int):
    test_stat = (p_hat-p0)/np.sqrt(p0*(1-p0)/n)
    return test_stat