In [2]:
import pandas as pd
from scipy import stats

In [3]:
education_districtwise = pd.read_csv("education_districtwise.csv")

In [4]:
education_districtwise.shape

(680, 7)

In [5]:
education_districtwise.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [6]:
education_districtwise.isna().sum()

DISTNAME       0
STATNAME       0
BLOCKS         0
VILLAGES       0
CLUSTERS       0
TOTPOPULAT    46
OVERALL_LI    46
dtype: int64

In [7]:
education_districtwise = education_districtwise.dropna()

In [8]:
education_districtwise.isna().sum()

DISTNAME      0
STATNAME      0
BLOCKS        0
VILLAGES      0
CLUSTERS      0
TOTPOPULAT    0
OVERALL_LI    0
dtype: int64

In [9]:
education_districtwise.shape

(634, 7)

In [10]:
state21 = education_districtwise[education_districtwise['STATNAME']=='STATE21']
state28 = education_districtwise[education_districtwise['STATNAME']=='STATE28']

In [11]:
print(state21.shape)
print(state28.shape)

(71, 7)
(38, 7)


In [12]:
sample_state21 = state21.sample(n=20, replace = True, random_state=13490)
sample_state28 = state28.sample(n=20, replace = True, random_state=39103)

In [15]:
print(sample_state21.shape)
print(sample_state28.shape)

(20, 7)
(20, 7)


In [17]:
sample_state21['OVERALL_LI'].mean()

70.82900000000001

In [18]:
sample_state28['OVERALL_LI'].mean()

64.60100000000001

STATE21 has a mean district literacy rate of about 70.8%, while STATE28 has a mean district literacy rate of about 64.6%.
However, due to sampling variability, this observed difference might simply be due to chance, rather than an actual difference in the corresponding population means. A hypothesis test can help us determine whether or not the results are statistically significant.

A two-sample t-test is the standard approach for comparing the means of two independent samples. To review, the steps for conducting a hypothesis test are:
1.   State the null hypothesis and the alternative hypothesis.
2.   Choose a significance level.
3.   Find the p-value. 
4.   Reject or fail to reject the null hypothesis.

1.   State the null hypothesis and the alternative hypothesis.

H0: There is no difference in the mean district literacy rates between STATE21 and STATE28.
HA: There is a difference in the mean district literacy rates between STATE21 and STATE28.

where H0 = null hypothesis, HA = alternative hypothesis

2.   Choose a significance level.

The **significance level** is the threshold at which we will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. The Department of Education asks you to use their standard level of 5%, or 0.05.  

3.   Find the p-value. 

P-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.

Based on your sample data, the difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points. The null hypothesis claims that this difference is due to chance. The p-value is the probability of observing an absolute difference in sample means that is 6.2 or greater if the null hypothesis is true. If the probability of this outcome is very unlikely—in particular, if your p-value is less than your significance level of 5%— then you will reject the null hypothesis.

In [19]:
# two sample t-test
# a: Observations from the first sample
# b: Observations from the second sample
''' equal_var: 
A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. 
In our example, you don’t have access to data for the entire population, so you don’t want to assume anything about the variance. 
To avoid making a wrong assumption, set this argument to False.
'''
stats.ttest_ind(a=sample_state21['OVERALL_LI'], b=sample_state28['OVERALL_LI'], equal_var=False)

TtestResult(statistic=2.8980444277268735, pvalue=0.006421719142765237, df=35.20796133045557)

p-value is about 0.0064, or 0.64%. So, p-value is less than the significance level, i.e 5% or .05. Therefore, I reject the null hypothesis. In conclusion, there is a statistically significant difference between the mean district literacy rates of the two states: STATE21 and STATE28.

