<a href="https://colab.research.google.com/github/adelriscom/DataScience/blob/main/Lab5_Alexander_Del_Risco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import pandas as pd
from IPython.display import display

# Load the census income dataset

In [4]:
census_income_df = pd.read_csv('/content/drive/MyDrive/DataScience_UWinnipeg/cleaned_census_income.csv')

## The US Department of Labor is interested in studying the population proportion of American adults who work for the workclass = private sector and have an annual income greater than $50k.

### To estimate the point estimation for the population proportion of American adults who work for the Private sector and have an annual income greater than $50k, we can calculate the sample proportion from the given dataset.

>First, we can filter the dataset to only include those who work for the Private sector and have an income greater than $50k:

In [5]:
private_over_50k = census_income_df[(census_income_df['workclass'] == 'Private') & (census_income_df['income'] == '>50K')]


>Then, we can calculate the point estimation for the population proportion of X:

In [6]:
p_hat = len(private_over_50k) / len(census_income_df)

>To estimate the 95% confidence interval for the population proportion of X, we can use the formula:

>* CI = (p_hat - z_star * se, p_hat + z_star * se)

>where z_star = 1.96 for a 95% confidence level and se is the standard error, which can be calculated as:


In [8]:
se = np.sqrt(p_hat * (1 - p_hat) / len(census_income_df))

# calculate confidence interval
z_star = 1.96 # for a 95% confidence level
CI = (p_hat - z_star * se, p_hat + z_star * se)

print("Point estimation for population proportion of X:", p_hat)
print("95% Confidence interval for population proportion of X:", CI)

Point estimation for population proportion of X: 0.16166036734964526
95% Confidence interval for population proportion of X: (0.15750568539951126, 0.16581504929977925)


## To conduct a hypothesis testing for the population proportion of X, we can set up the null and alternative hypotheses:

* H0: The population proportion of American adults who work for the Private sector and have an annual income greater than $50k is 0.25 (as stated in the recent study).

* Ha: The population proportion of American adults who work for the Private sector and have an annual income greater than $50k is greater than 0.25 (as suggested by the head of the US Department of Labor).

>* H0: p <= 0.25
* Ha: p > 0.25

>Then, we can calculate the test statistic using the formula:
* z = (p_hat - p0) / se

>where **p0** = 0.25 is the hypothesized proportion under the null hypothesis.
Finally, we can calculate the p-value and compare it to the significance level (let's say we choose a significance level of 0.05) to draw conclusions.




In [9]:
# set up hypotheses
p0 = 0.25
H0 = "p <= 0.25"
Ha = "p > 0.25"

# calculate test statistic
z = (p_hat - p0) / se

# calculate p-value
p_value = 1 - st.norm.cdf(z)

# compare p-value to significance level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is sufficient evidence to support the claim that the proportion of American adults working for the Private sector and having income >50K is higher than 25%.")
else:
    print("Fail to reject the null hypothesis. There is not sufficient evidence to support the claim that the proportion of American adults working for the Private sector and having income >50K is higher than 25%.")


Fail to reject the null hypothesis. There is not sufficient evidence to support the claim that the proportion of American adults working for the Private sector and having income >50K is higher than 25%.


# The US Department of Labor is also interested in studying the population proportion of American adults who work for the workclass = Private and have education = Bachelors in the 1990s.

### Denote Y as the variable for American adults who work for the workclass = Private and have education = Bachelors.

> *Point estimation for population proportion of Y:*

In [10]:
p_hat = len(census_income_df[(census_income_df.workclass == 'Private') & (census_income_df.education == 'Bachelors')]) / len(census_income_df)
print('Point estimation for population proportion of Y:', p_hat)

Point estimation for population proportion of Y: 0.11454810688946357


> *Confidence interval for population proportion of Y:*

In [11]:
z_star = 1.96 # 95% confidence level
n = len(census_income_df)
y_count = len(census_income_df[(census_income_df.workclass == 'Private') & (census_income_df.education == 'Bachelors')])
p_hat = y_count / n
se = np.sqrt(p_hat * (1 - p_hat) / n)
lower = p_hat - z_star * se
upper = p_hat + z_star * se
print('Confidence interval for population proportion of Y: ({}, {})'.format(lower, upper))

Confidence interval for population proportion of Y: (0.11095390517508498, 0.11814230860384216)


### Hypothesis testing for population proportion of Y:
>Null Hypothesis (H0): The population proportion of American adults who work for the workclass = Private and have education = Bachelors in the 1990s is equal to or less than 0.05.

>Alternative Hypothesis (Ha): The population proportion of American adults who work for the workclass = Private and have education = Bachelors in the 1990s is greater than 0.05.

In [12]:
# Calculate the test statistic and p-value
z = (p_hat - 0.05) / se
p_value = 1 - st.norm.cdf(z)

# Print the results
print('Test statistic:', z)
print('p-value:', p_value)

# Check if we can reject the null hypothesis at the 5% significance level
alpha = 0.05
if p_value < alpha:
    print('We reject the null hypothesis.')
else:
    print('We fail to reject the null hypothesis.')

Test statistic: 35.19955182182142
p-value: 0.0
We reject the null hypothesis.


####Based on the results of the hypothesis test, we reject the null hypothesis (H0: p = 0.05) and conclude that there is evidence to suggest that the true population proportion of American adults who work for the workclass = Private and have education = Bachelors in the 1990s is greater than 0.05, with a significance level of 0.05.

####This means that the data provides evidence that the head of the US Department of Labor may be correct in believing that the percentage of American adults who work for the workclass = Private and have education = Bachelors in the 1990s is higher than 5%.