Section 1: Group Comparisons with Continuous Data
This exercise will be done in Python via a Snowflake Notebook.

1.	Read the males_ht_wt_cntry.csv file into a data frame
2.	Examine the data
a.	Display some rows to make sure it imported correctly
b.	Generate histograms of the heights by country using Altair or Seaborn charts
c.	Generate histograms of the weights by country using Altair or Seaborn charts

3.	Conduct an ANOVA test to determine if the weights differ by nationality and interpret your results. Use this link as a reference. Make sure you use Levene’s test to check if the variance is close to equal.

4.	ANOVA won’t tell you which sets of weights differ. You will need to compare each group against each other to determine that. Use this link as a reference.
a.	Conduct a test to determine if the weights of the Italian males were significantly different than the Dutch males (from the Netherlands) and interpret your results
b.	Conduct a test to determine if the weights of the American males were significantly different than the Dutch males (from the Netherlands) and interpret your results

5.	Conducting multiple tests like this increases the odds of getting false significant results. If you had conducted tests for 3 comparisons (Italian vs Dutch, Italian vs American, American vs Dutch), what is the probability one of these t-tests is not actually significant (i.e. false positive)?

6.	When comparing these groups, it’s better to control the Family-Wise Error Rate (FWER). Use a multiple comparison procedure with a Tukey adjustment. See this link for how to do this in using the pairwise_tukeyhsd() function (statsmodels.stats.multicomp.pairwise_tukeyhsd).



In [None]:
"""
1.Read the males_ht_wt_cntry.csv file into a data frame
"""

import pandas as pd

# read the file into a DataFrame
df = pd.read_csv("males_ht_wt_cntry.csv")

# display first few rows
df.head()

In [None]:
"""
2.	Examine the data
    a.	Display some rows to make sure it imported correctly
    b.	Generate histograms of the heights by country using Altair or Seaborn charts
    c.	Generate histograms of the weights by country using Altair or Seaborn charts
"""

#print data types and missing values
df.info()

#summary statistics
df.describe()

#how many rows and columns are there
df.shape

#a.	Display some rows to make sure it imported correctly
#b.	Generate histograms of the heights by country using Altair or Seaborn charts
#c.	Generate histograms of the weights by country using Altair or Seaborn charts

import seaborn as sns
import matplotlib.pyplot as plt

#histogram of height by country
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x="Height", hue="Country", kde=True, element="step", stat="density", bins=20)

plt.title("Histogram of Heights by Country")
plt.xlabel("Height")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

#histogram of weight by country
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x="Weight", hue="Country", kde=True, element="step", stat="density", bins=20)

plt.title("Histogram of Weights by Country")
plt.xlabel("Weight")
plt.ylabel("Density")
#plt.legend(title="Country")
plt.tight_layout()
plt.show()

In [None]:
"""
3.	Conduct an ANOVA test to determine if the weights differ by nationality and interpret your results. 
Use this link as a reference. Make sure you use Levene’s test to check if the variance is close to equal.
"""
import pandas as pd
from scipy.stats import f_oneway, levene

# Get list of countries
countries = df['Country'].unique()

# Create a list of weight arrays, one per country
weight_groups = [df[df['Country'] == country]['Weight'].dropna() for country in countries]

# Levene's test checks homogeneity of variances (null: equal variance)
levene_stat, levene_p = levene(*weight_groups)

print(f"Levene’s Test Statistic: {levene_stat:.4f}")
print(f"Levene’s Test p-value: {levene_p:.4f}")

if levene_p < 0.05:
    print("Variances are significantly different (p < 0.05), use Welch's ANOVA.\n\n")
else:
    print("Variances are not significantly different (p ≥ 0.05), standard ANOVA is appropriate.\n\n")


anova_stat, anova_p = f_oneway(*weight_groups)

print(f"ANOVA F-Statistic: {anova_stat:.4f}")
print(f"ANOVA p-value: {anova_p:.4f}")

if anova_p < 0.05:
    print("Reject the null hypothesis: There is a statistically significant difference in average weight by country.")
else:
    print("Fail to reject the null hypothesis: No significant difference in weight by country.")

levene_stat, levene_p = levene(*weight_groups)

print(f"Levene’s Test Statistic: {levene_stat:.4f}")
print(f"Levene’s Test p-value: {levene_p:.4f}")

Levene’s Test for Equal Variance
	Statistic: 2.6579
	p-value: 0.0722
    
Interpretation: Since p > 0.05, we fail to reject the null hypothesis — this means the variances in weight are not significantly different between countries.

 One-Way ANOVA (Weight by Country)
	F-statistic: 73.0317
	p-value: 0.0000
Interpretation: The p-value is extremely small, well below 0.05.

We reject the null hypothesis and conclude that there is a statistically significant difference in mean weight across the countries.

So, Levene says the "spread" of data in each group is similar (as tested by the variance).

While ANOVA says the "center" of at least one group is different (as tested by the mean).

4.	ANOVA won’t tell you which sets of weights differ. You will need to compare each group against each other to determine that. Use this link as a reference.
a.	Conduct a test to determine if the weights of the Italian males were significantly different than the Dutch males (from the Netherlands) and interpret your results
b.	Conduct a test to determine if the weights of the American males were significantly different than the Dutch males (from the Netherlands) and interpret your results



In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load the dataset (update path as needed if you're not in a local Jupyter or Snowflake Notebook)
df = pd.read_csv("males_ht_wt_cntry.csv")  # or the path to your file

# Extract weight data by country
italy_weights = df[df['Country'] == 'Italy']['Weight']
netherlands_weights = df[df['Country'] == 'Netherlands']['Weight']
usa_weights = df[df['Country'] == 'USA']['Weight']

# --- a. Italy vs Netherlands ---
t_stat_italy_nl, p_val_italy_nl = ttest_ind(italy_weights, netherlands_weights, equal_var=True)
print("Italy vs Netherlands:")
print(f"  t-statistic: {t_stat_italy_nl:.4f}")
print(f"  p-value: {p_val_italy_nl:.4f}")
if p_val_italy_nl < 0.05:
    print("  ➜ Statistically significant difference in mean weight.\n")
else:
    print("  ➜ No statistically significant difference.\n")

# --- b. USA vs Netherlands ---
t_stat_usa_nl, p_val_usa_nl = ttest_ind(usa_weights, netherlands_weights, equal_var=True)
print("USA vs Netherlands:")
print(f"  t-statistic: {t_stat_usa_nl:.4f}")
print(f"  p-value: {p_val_usa_nl:.4f}")
if p_val_usa_nl < 0.05:
    print("  ➜ Statistically significant difference in mean weight.\n")
else:
    print("  ➜ No statistically significant difference.\n")

a. Italy vs Netherlands
	•	t-statistic: -11.1358
	•	p-value: 0.0000

Interpretation:
There IS a statistically significant difference in the average weight between Italian and Dutch males.
Since the p-value is well below the 0.05 threshold, we can confidently reject the null hypothesis that their mean weights are equal.
The negative t-statistic suggests that, on average, Italian males weigh less than Dutch males.

⸻

b. USA vs Netherlands
	•	t-statistic: 0.3921
	•	p-value: 0.6955

Interpretation:
There is NO statistically significant difference in the average weight between American and Dutch males.
Because the p-value is well above 0.05, we fail to reject the null hypothesis — this means that any observed difference in weight could be due to random chance, not a true difference in the population.

5.	Conducting multiple tests like this increases the odds of getting false significant results. If you had conducted tests for 3 comparisons (Italian vs Dutch, Italian vs American, American vs Dutch), what is the probability one of these t-tests is not actually significant (i.e. false positive)?

When I t-test a hypothesis I'm typically looking for a significance level (α) of 0.05, meaning there’s a 5% chance of a false positive — i.e., rejecting the null hypothesis when it’s actually true. However, when I conduct multiple tests, the probability that at least one test will yield a false positive increases.

If I run 3 independent tests, what is the probability that at least one of them results in a false positive (assuming all null hypotheses are actually true)?

⸻

I compute this using the complement rule:

P(\text{at least one false positive}) = 1 - (1 - \alpha)^n

Where:
	•	alpha = 0.05 (significance level)
	•	n = 3 (number of tests)

P = 1 - (1 - alpha)^n = 1 - (1-alpha)^n = 1 - 0.857375 = 0.142625



In [None]:
# Parameters
alpha = 0.05       # significance level
num_tests = 3      # number of independent tests

# Compute the probability of at least one false positive
prob_false_positive = 1 - (1 - alpha) ** num_tests

# Display result
print(f"Probability of at least one false positive (Type I error) out of {num_tests} tests: {prob_false_positive:.4f}")

6.	When comparing these groups, it’s better to control the Family-Wise Error Rate (FWER). Use a multiple comparison procedure with a Tukey adjustment. See this link for how to do this in using the pairwise_tukeyhsd() function (statsmodels.stats.multicomp.pairwise_tukeyhsd).

In [None]:
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Load the dataset
df = pd.read_csv("males_ht_wt_cntry.csv")  # Adjust path if needed

# Run Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['Weight'], groups=df['Country'], alpha=0.05)

# Print the summary
print(tukey.summary())

Section 2: Group Comparisons with Categorical Data
	Create a new BMI column. Use the Imperial formula BMI=  (Weight*703)/〖Height〗^2 . 

	Create another new column ‘Overweight’ that is a 1 if BMI >= 25 and 0 otherwise. There are several ways to do this in Python. 

	Create a contingency table of overweight by nationality and examine it. Describe any differences you see between nationalities.

	Conduct a Chi-Sq test using scipy.stats to see if the differences are significant. Explain your findings.



In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset
df = pd.read_csv("males_ht_wt_cntry.csv")  # Adjust path if needed

# Step 1: Calculate BMI using the imperial formula
df['BMI'] = (df['Weight'] * 703) / (df['Height'] ** 2)

# Step 2: Flag overweight (BMI >= 25)
df['Overweight'] = (df['BMI'] >= 25).astype(int)

# Step 3: Create contingency table (Country vs Overweight)
contingency = pd.crosstab(df['Country'], df['Overweight'])

# Step 4: Perform Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency)

# Step 5: Print results
print("Contingency Table (0 = not overweight, 1 = overweight):")
print(contingency)
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns))
print("\nChi-Square Test Results:")
print(f"Chi-Square Statistic: {chi2:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"p-value: {p:.4f}")

# Optional: Interpretation based on p-value
if p < 0.05:
    print("\nThere is a statistically significant relationship between nationality and overweight status.")
else:
    print("\nNo statistically significant relationship found.")

1.	USA:
	•	Has the highest number and proportion of overweight males, which I know from our frequent travels to Europe!
	•	Over 57% (52 out of 90) are classified as overweight.
	•	Indicates a strong trend toward higher BMI values compared to other countries.
	2.	Italy:
	•	Has the lowest number and proportion of overweight males.
	•	Only about 23% (16 out of 70) are overweight.
	•	Suggests that Italian males in this dataset tend to have lower BMI values.
	3.	Netherlands:
	•	Falls in the middle.
	•	40% (32 out of 80) are overweight, but this does not align with my personal observation
	•	This shows a moderate level of overweight prevalence.

⸻

Summary:
	•	The proportion of overweight individuals increases from Italy → Netherlands → USA.
	•	These patterns reflect clear differences in average body mass across the countries in this sample.
	•	The Chi-Square test confirms these differences are statistically significant — they are unlikely due to random variation alone.

1.	Build a linear regression of to see whether height predicts weight. There are two main modules for conducting linear regression in Python. Use statsmodels. Explain the results. 

2.	Fit the same regression model using linear algebra. Compare your resultant ’s to the ones you obtained earlier. 




In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the dataset
df = pd.read_csv("males_ht_wt_cntry.csv")  # Adjust path if needed

# --- Part 1: Linear Regression using statsmodels ---
# Define independent and dependent variables
X = df['Height']
y = df['Weight']

# Add a constant (intercept) to the model
X_with_const = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X_with_const).fit()

# Print results
print("=== Statsmodels Regression Results ===")
print(model.summary())
print()

# --- Part 2: Linear Regression using Linear Algebra ---
# Prepare matrices
X_matrix = np.vstack([np.ones(len(X)), X]).T  # Add intercept column manually
y_vector = y.values.reshape(-1, 1)

# Calculate beta coefficients using normal equation
beta_hat = np.linalg.inv(X_matrix.T @ X_matrix) @ (X_matrix.T @ y_vector)

# Display results
print("=== Linear Algebra Regression Coefficients ===")
print(f"Intercept (β₀): {beta_hat[0][0]:.4f}")
print(f"Slope     (β₁): {beta_hat[1][0]:.4f}")

Regression Results (from statsmodels):
	•	Intercept (β₀): -28.2547
	•	Slope (β₁): 2.8537
	•	R-squared: 0.3635
	•	F-statistic p-value: < 0.0001 (very significant)

Interpretation:
	•	For every additional inch in height, a male’s weight increases by approximately 2.85 lbs, on average.
	•	The intercept is negative, which means at a height of 0 inches, the predicted weight is negative (which is not meaningful physically, but it’s just the y-intercept of the linear fit).
	•	The R² value of 0.364 tells us that about 36% of the variation in weight is explained by height — this is moderate explanatory power.
	•	The F-statistic and p-value show that the model is statistically significant overall.

⸻

2. Linear Regression Using Linear Algebra

We used the normal equation:

\hat{\beta} = (X^TX)^{-1}X^Ty
	•	Intercept (β₀): -28.2547
	•	Slope (β₁): 2.8537

These match exactly with the coefficients from statsmodels, confirming the accuracy of the linear algebra implementation.

⸻

Summary:
	•	Both methods yield identical coefficients.
	•	Height is a statistically significant predictor of weight.
	•	However, since R² is only ~0.36, other factors (e.g., body composition, nationality, age) likely also influence weight.

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv("males_ht_wt_cntry.csv")

# Create BMI column
df['BMI'] = (df['Weight'] * 703) / (df['Height'] ** 2)

# Create Overweight column (1 if BMI >= 25, else 0)
df['Overweight'] = (df['BMI'] >= 25).astype(int)

# Display the updated dataframe
# print(df.head())  # Use .head() to show the first 5 rows
print(df)