# Different Hypothesis Test

The Z-test is used for comparison between two sets of data when the population standard deviation is known.
The z-test assumes that the data is normally distributed, but it does not assume that the variances of the two groups are equal. 

The Chi-squared test is used to compare two or more categorical variables.
It does not make any assumptions about the distribution of data.

T-test is used for comparison of two sets of data when we don't know the population standard deviation. 
These assumptions include that the data is normally distributed and that the variances of the two groups are equal.


# Steps

1.Define the Null and Alternative Hypotheses <br>
2.Choose a Significance Level <br>
3.Collect Data and Calculate a Test Statistic <br>
4.Decide on the Null Hypothesis Based on the Test Statistic and Significance Level <br>
5.Interpret the Results <br>

In [9]:
# !pip3 install statsmodels

In [7]:
import numpy as np
from scipy import stats
rng = np.random.default_rng()
rvs = stats.uniform.rvs(size=50, random_state=rng)
print(stats.ttest_1samp(rvs, popmean=0.5))

TtestResult(statistic=np.float64(0.07995994307136536), pvalue=np.float64(0.9365947967752454), df=np.int64(49))


In [9]:
import numpy as np
import scipy.stats as stats

# Create a dummy dataset of 10 year old children's weight
data = np.random.randint(20, 40, 10)
print(data)

# Define the null hypothesis
H0 = "The average weight of 10 year old children is 32kg."

# Define the alternative hypothesis
H1 = "The average weight of 10 year old children is more than 32kg."

# Calculate the test statistic
t_stat, p_value = stats.ttest_1samp(data, popmean=32)

# Print the results
print("Test statistic:", t_stat)
print("p-value:", p_value)

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

[22 32 38 26 38 25 37 21 23 26]
Test statistic: -1.4855627054164144
p-value: 0.17155759269801515
Fail to reject the null hypothesis.


In [10]:
import pandas as pd
from scipy import stats
df = pd.read_csv("data/blood_pressure.csv")
df[['bp_before','bp_after']].describe()

Unnamed: 0,bp_before,bp_after
count,120.0,120.0
mean,156.45,151.358333
std,11.389845,14.177622
min,138.0,125.0
25%,147.0,140.75
50%,154.5,149.5
75%,164.0,161.0
max,185.0,185.0


In [12]:

stats.ttest_rel(df['bp_before'], df['bp_after'])


TtestResult(statistic=np.float64(3.3371870510833657), pvalue=np.float64(0.0011297914644840823), df=np.int64(119))

In [13]:
import numpy as np
from scipy import stats

# Sample data
method_A = [85, 88, 90, 92, 87]
method_B = [78, 80, 85, 86, 83]
method_C = [90, 92, 94, 96, 91]

# Perform the ANOVA test
f_statistic, p_value = stats.f_oneway(method_A, method_B, method_C)

print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


F-statistic: 16.155737704918028
P-value: 0.0003944484121574857


In [14]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the DataFrame
data = {
    'Method': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
    'Gender': ['Male', 'Male', 'Male', 'Female', 'Female'] * 3,
    'Score': [85, 88, 90, 92, 87, 78, 80, 85, 86, 83, 90, 92, 94, 96, 91]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Method,Gender,Score
0,A,Male,85
1,A,Male,88
2,A,Male,90
3,A,Female,92
4,A,Female,87
5,B,Male,78
6,B,Male,80
7,B,Male,85
8,B,Female,86
9,B,Female,83


In [15]:

# Fit the model
model = ols('Score ~ C(Method) * C(Gender)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                         sum_sq   df          F    PR(>F)
C(Method)            262.800000  2.0  15.526477  0.001208
C(Gender)             18.677778  1.0   2.207002  0.171548
C(Method):C(Gender)    2.755556  2.0   0.162801  0.852208
Residual              76.166667  9.0        NaN       NaN


C(Method): The p-value (0.001208) is less than 0.05, indicating a significant effect of the teaching method on the scores. <br>
C(Gender): The p-value (0.171548) is greater than 0.05, indicating that gender does not have a significant effect on the scores. <br>
C(Method)
(Gender): The p-value (0.852208) is much greater than 0.05, indicating no significant interaction effect between teaching method and gender on the scores.<br>

# Chi Square Test

In [16]:
import numpy as np
from scipy.stats import chi2_contingency

# Example contingency table
# Rows: Gender (Male, Female)
# Columns: Preference (A, B, C)
data = np.array([[30, 10, 20], [35, 15, 25]])

# Perform Chi-square test
chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi-square statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)

# Interpret the p-value
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis (variables are dependent)")
else:
    print("Fail to reject the null hypothesis (variables are independent)")

Chi-square statistic: 0.27692307692307694
P-value: 0.870696738961232
Degrees of freedom: 2
Expected frequencies:
[[28.88888889 11.11111111 20.        ]
 [36.11111111 13.88888889 25.        ]]
Fail to reject the null hypothesis (variables are independent)
