<a href="https://colab.research.google.com/github/adelriscom/DataScience/blob/main/Lab6_Alexander_Del_Risco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objectives
## In this lab, students will learn how to:
- Work with quantitative variables from real datasets for statistical estimation
and hypothesis testing.
- Calculate mean and standard deviation of sample data
- Estimate confidence intervals for population means
- Conduct hypothesis testing for population means

>#  Import packages

In [3]:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import pandas as pd
from IPython.display import display

># import dataset

In [5]:
census_income_df = pd.read_csv('/content/drive/MyDrive/DataScience_UWinnipeg/cleaned_census_income.csv')
display(census_income_df.head(10))

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
1,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
2,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
3,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
4,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
5,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
6,68,Federal-gov,422013,HS-grad,9,Divorced,Prof-specialty,Not-in-family,White,Female,0,3683,40,United-States,<=50K
7,45,Private,172274,Doctorate,16,Divorced,Prof-specialty,Unmarried,Black,Female,0,3004,35,United-States,>50K
8,38,Self-emp-not-inc,164526,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,0,2824,45,United-States,>50K
9,52,Private,129177,Bachelors,13,Widowed,Other-service,Not-in-family,White,Female,0,2824,20,United-States,>50K


># Calculate the number of samples, sample mean, and sample standard deviation:

In [7]:
income_private_sector_df = census_income_df[census_income_df.workclass == 'Private']
# number of samples
n = income_private_sector_df.age.count()
# sample mean
x_bar = income_private_sector_df.age.mean()
# sample standard deviation
s = income_private_sector_df.age.std()

# The US Department of Labor is interested in studying the population mean of ages of the American adults who work for the workclass = private sector and have an annual income greater than $50k.

- Denote X as the age variable for American adults who work for the Private sector
and have an annual Income greater than $50k (‘>50K’)
- Based on the sample dataset, estimate point estimation for population mean of X
- Based on the sample dataset, estimate 95% confidence interval for population
mean of X
> Situation: A recent study claimed that the average age of American adults working for the Private sector and having Income > 50K was very young in the 1990s, just about 25. The head of the US Department of Labor believes that the mean should be higher.
>
>Based on this dataset sample, conduct a hypothesis testing for the population mean
of X to draw conclusions about those statements.

>## State the null hypothesis:
The null hypothesis is that the population mean of X is equal to 25.

+ H0: μ = 25

>## State the alternative hypothesis:
The alternative hypothesis is that the population mean of X is greater than 25.

+ Ha: μ > 25

>## Choose a significance level:
Assuming a 95% confidence level, the significance level is 0.05.

+ α = 0.05

>## Calculate the test statistic:
We can use a one-sample t-test to calculate the test statistic, which is given by:

+ t = (x̄ - μ) / (s / √n)

>#### where x̄ is the sample mean, μ is the hypothesized population mean, s is the sample standard deviation, and n is the sample size.

>## Calculate the p-value:
>We can calculate the p-value using the t-statistic and the degrees of freedom (df = n-1).

>## Compare p-value with significance level:
+ If p-value is less than the significance level, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

>## Draw conclusions:
+ If we reject the null hypothesis, we can conclude that the population mean of X is greater than 25. Otherwise, we cannot conclude that the population mean of X is greater than 25.

In [8]:
# Extract the necessary data
# This filter the DataFrame to include only those individuals who work in the private sector and have an income greater than $50k:
income_private_sector_df = census_income_df[(census_income_df.workclass == 'Private') & (census_income_df.income == '>50K')]
# This extract the age column from the filtered DataFrame and calculate the sample size, sample mean, and sample standard deviation:
# extract age column from filtered DataFrame
ages = income_private_sector_df.age
n = len(ages)
x_bar = ages.mean()

'''The std() method in NumPy and Pandas calculates the standard 
deviation of a set of data. The ddof parameter (Delta Degrees of Freedom) 
specifies the number of degrees of freedom to be used in the calculation 
of the standard deviation.'''
s = ages.std(ddof=1)# here ddof=1 to account for the fact that we are working with a sample, not the full population

# Define the null and alternative hypotheses
mu0 = 25
Ha = 'greater'

# Set the significance level
alpha = 0.05

# Calculate the t-statistic
t = (x_bar - mu0) / (s / np.sqrt(n))

# Calculate the p-value
df = n - 1
p_val = 1 - st.t.cdf(t, df)

# Compare p-value with significance level and draw conclusions
if p_val < alpha:
    print("Reject the null hypothesis.")
    print("There is sufficient evidence to suggest that the population mean age of Americans working for the Private sector and having Income > 50K is greater than 25.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is not enough evidence to suggest that the population mean age of Americans working for the Private sector and having Income > 50K is greater than 25.")


Reject the null hypothesis.
There is sufficient evidence to suggest that the population mean age of Americans working for the Private sector and having Income > 50K is greater than 25.


# The US Department of Labor is also interested in studying the population mean of the number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than $50k (‘>50K’) in the 1990s.

- Denote Y as the variable for number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than $50k (‘>50K’) in the 1990s.
- Based on the sample dataset, estimate point estimation for population mean of Y

- Based on the sample dataset, estimate 95% confidence interval for population
mean of Y
>
> Situation: A recent study claimed that the average number of education years ‘education.num’ of American adults who work for the workclass = Private and have an annual Income greater than $50k (‘>50K’) in the 1990s was small, just about 3 years. The head of the US Department of Labor believes that the mean should be higher.
>
> Based on this dataset sample, conduct a hypothesis testing for the population mean
of Y to draw conclusions about those statements.

In [None]:
# select data for Private workclass with income > 50k
income_private_sector_df = census_income_df[(census_income_df.workclass == 'Private') & (census_income_df.income == '>50K')]

# calculate the sample mean
y_bar = income_private_sector_df['education.num'].mean()
print('Sample mean of Y:', y_bar)


In [None]:
# calculate sample standard deviation
s_y = income_private_sector_df['education.num'].std(ddof=1)

# calculate standard error of the mean
se_y = s_y / np.sqrt(len(income_private_sector_df))

# calculate margin of error for 95% confidence level
z_critical = st.norm.ppf(q=0.975)
margin_of_error = z_critical * se_y

# calculate the confidence interval
confidence_interval = (y_bar - margin_of_error, y_bar + margin_of_error)
print('95% Confidence Interval for population mean of Y:', confidence_interval)


In [None]:
# define null and alternative hypothesis
mu = 3
H0 = 'The population mean of Y is equal to 3'
Ha = 'The population mean of Y is greater than 3'

# calculate t-statistic and p-value for one-sample t-test
t_stat, p_val = st.ttest_1samp(income_private_sector_df['education.num'], mu, alternative='greater')

# calculate degrees of freedom
df = len(income_private_sector_df) - 1

# calculate critical value for 95% confidence level
alpha = 0.05
t_critical = st.t.ppf(1 - alpha, df)

# compare p-value with significance level
if p_val < alpha:
    print(f"{Ha}. We reject the null hypothesis. The t-statistic is {t_stat:.4f}, the p-value is {p_val:.4f}, and the critical value is {t_critical:.4f}.")
else:
    print(f"{H0}. We fail to reject the null hypothesis. The t-statistic is {t_stat:.4f}, the p-value is {p_val:.4f}, and the critical value is {t_critical:.4f}.")


# Revisar

In [11]:
# Subset data
subset = census_income_df[(census_income_df['workclass'] == 'Private') & (census_income_df['income'] == '>50K')]

# Calculate sample mean
sample_mean = subset['education.num'].mean()

# Calculate sample standard deviation
sample_std = subset['education.num'].std()

# Set confidence level
conf_level = 0.95

# Set degrees of freedom
df = len(subset) - 1

# Calculate t-score
t_score = st.t.ppf(q=(1-conf_level)/2, df=df)

# Calculate margin of error
margin_of_error = t_score * (sample_std / np.sqrt(len(subset)))

# Calculate confidence interval
conf_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Print results
print(f"Sample mean: {sample_mean:.2f}")
print(f"Confidence interval ({conf_level*100:.0f}%): {conf_interval}")
print("")

# Set null hypothesis
null_hyp = 3

# Calculate t-score for null hypothesis
t_score_null = (sample_mean - null_hyp) / (sample_std / np.sqrt(len(subset)))

# Calculate p-value
p_value = st.t.sf(np.abs(t_score_null), df=df) * 2

# Set significance level
sig_level = 0.05

# Print results
print(f"Null hypothesis: population mean = {null_hyp}")
print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample standard deviation: {sample_std:.2f}")
print(f"t-score: {t_score_null:.2f}")
print(f"p-value: {p_value:.4f}")
if p_value < sig_level:
    print("Reject null hypothesis: there is sufficient evidence to support the claim that the population mean is higher than 3.")
else:
    print("Fail to reject null hypothesis: there is insufficient evidence to support the claim that the population mean is higher than 3.")

Sample mean: 11.42
Confidence interval (95%): (11.480169906017851, 11.352069634589204)

Null hypothesis: population mean = 3
Sample mean: 11.42
Sample standard deviation: 2.28
t-score: 257.60
p-value: 0.0000
Reject null hypothesis: there is sufficient evidence to support the claim that the population mean is higher than 3.


>Based on the results, we fail to reject the null hypothesis that the population mean is equal to 3. This means that we do not have sufficient evidence to support the claim that the population mean is higher than 3. However, we can say with 95% confidence that the true mean education level of people in the private sector with incomes over 50K falls between the values of the confidence interval we calculated.