<a href="https://colab.research.google.com/github/byruzyayandy1/Hypothesis_Testing-Population-and-Sample-of-Kaduna-/blob/master/Hypothesis_Testing_Basics_of_Kaduna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hypothesis Testing Basics

Statistical hypothesis tests are based a statement called the null hypothesis that assumes nothing interesting is going on between whatever variables you are testing. The exact form of the null hypothesis varies from one type test to another: if you are testing whether groups differ, the null hypothesis states that the groups are the same. For instance, if you wanted to test whether the average age of voters in your home state differs from the national average, the null hypothesis would be that there is no difference between the average ages.

#Sample Data of Population and Kaduna
Let's create some dummy age data for the population of voters in the entire country and a sample of voters in Kaduna and test the whether the average age of voters Kaduna differs from the population:

In [0]:
%matplotlib inline

In [0]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math

In [0]:
np.random.seed(6)

population_ages1 = stats.poisson.rvs(loc=18, mu=35, size=150000)
population_ages2 = stats.poisson.rvs(loc=18, mu=10, size=100000)
population_ages = np.concatenate((population_ages1, population_ages2))

kaduna_ages1 = stats.poisson.rvs(loc=18, mu=30, size=30)
kaduna_ages2 = stats.poisson.rvs(loc=18, mu=10, size=20)
kaduna_ages = np.concatenate((kaduna_ages1, kaduna_ages2))

print( population_ages.mean() )
print( kaduna_ages.mean() )

43.000112
39.26


Notice that we used a slightly different combination of distributions to generate the sample data for Kaduna, so we know that the two means are different. Let's conduct a t-test at a 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the population. To conduct a one sample t-test, we can the stats.ttest_1samp() function:

In [0]:
stats.ttest_1samp(a= kaduna_ages,                  #Sample data 
                 popmean= population_ages.mean())  #Pop mean

Ttest_1sampResult(statistic=-2.5742714883655027, pvalue=0.013118685425061678)

The test result shows the test statistic "t" is equal to -2.2972. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

In [0]:
stats.t.ppf(q= 0.025,           #Quantile to check
           df= 49)              #Degree of freedom

-2.0095752344892093

In [0]:
stats.t.ppf(q= 0.975,   #Quantile to check
           df= 49)      #Degree of freedom

2.009575234489209

We can calculate the chances of seeing a result as extreme as the one we observed (known as the p-value) by passing the t-statistic in as the quantile to the stats.t.cdf() function

In [0]:
stats.t.cdf(x= -2.5742,      #T-test statistic
               df= 49) * 2   #Mupltiply by two for two tailed test*

0.013121066545690117

*Note: The alternative hypothesis we are checking is whether the sample mean differs (is not equal to) the population mean. Since the sample could differ in either the positive or negative direction we multiply the by two.

Notice this value is the same as the p-value listed in the original t-test output. A p-value of 0.01311 means we'd expect to see data as extreme as our sample due to chance about 1.3% of the time if the null hypothesis was true. In this case, the p-value is lower than our significance level α (equal to 1-conf.level or 0.05) so we should reject the null hypothesis. If we were to construct a 95% confidence interval for the sample it would not capture population mean of 43:

In [0]:
sigma = kaduna_ages.std()/math.sqrt(50)  # Sample stdev/sample size

stats.t.interval(0.95,                        # Confidence level
                 df = 49,                     # Degrees of freedom
                 loc = kaduna_ages.mean(),    # Sample mean
                 scale= sigma)                # Standard dev estimate

(36.369669080722176, 42.15033091927782)

On the other hand, since there is a 1.3% chance of seeing a result this extreme due to chance, it is not significant at the 99% confidence level. This means if we were to construct a 99% confidence interval, it would capture the population mean:

In [0]:
stats.t.interval(alpha = 0.99,                # Confidence level
                 df = 49,                     # Degrees of freedom
                 loc = kaduna_ages.mean(),    # Sample mean
                 scale= sigma)                # Standard dev estimate

(35.40547994092107, 43.11452005907893)

With a higher confidence level, we construct a wider confidence interval and increase the chances that it captures to true mean, thus making it less likely that we'll reject the null hypothesis. In this case, the p-value of 0.013 is greater than our significance level of 0.01 and we fail to reject the null hypothesis.

for info you can check my github account below.

<a href ="https://github.com/byruzyayandy1"> Mohammed Bayero Yayandi </a>