# Week 8 - Daniel Solis Toro

## Exercise 9-1.

As sample size increases, the power of a hypothesis test increases, which means it is more likely to be positive if the effect is real. Conversely, as sample size decreases, the test is less likely to be positive even if the effect is real. To investigate this behavior, run the tests in this chapter with different subsets of the NSFG data. You can use thinkstats2.SampleRows to select a random subset of the rows in a DataFrame.
What happens to the p-values of these tests as the sample size decreases? What is the smallest sample size that yields a positive test?

In [4]:
import thinkstats2
import thinkplot
import nsfg
import pandas as pd
import numpy as np
from scipy import stats

# Load NSFG dataset
df = nsfg.ReadFemPreg()

def run_test(sample_size):
    # Take a random subset of the data
    sample = thinkstats2.SampleRows(df, sample_size)
    
    # Separate first-born and other births
    firsts = sample[sample['birthord'] == 1]['birthwgt_lb']
    others = sample[sample['birthord'] > 1]['birthwgt_lb']
    
    # Perform a t-test for difference in means
    t_stat, p_value = stats.ttest_ind(firsts.dropna(), others.dropna())
    
    return p_value

sample_sizes = [5000, 2000, 1000, 500, 200, 100, 50, 30, 20, 10]  # Different subset sizes
results = {}

for size in sample_sizes:
    p_value = run_test(size)
    results[size] = p_value
    print(f"Sample Size: {size}, p-value: {p_value:.5f}")


Sample Size: 5000, p-value: 0.00180
Sample Size: 2000, p-value: 0.04637
Sample Size: 1000, p-value: 0.19244
Sample Size: 500, p-value: 0.57157
Sample Size: 200, p-value: 0.39916
Sample Size: 100, p-value: 0.43245
Sample Size: 50, p-value: 0.84727
Sample Size: 30, p-value: 0.50287
Sample Size: 20, p-value: 0.90467
Sample Size: 10, p-value: 0.15680


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are 

Conclusion:
- As sample size increases, p-values decrease, making it more likely to detect a real effect.
- With smaller sample sizes, p-values fluctuate more, showing low statistical power.
- The smallest sample size that still yielded a significant result was 2000 (p = 0.04637).

## Exercise 10-1.

Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someoneâ€™s weight, how much would it help to know their height?

Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

In [26]:
# Import necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats

# Step 1: Load the BRFSS data
import brfss
data = brfss.ReadBrfss(nrows=None)

# Drop rows with missing height or weight values
df = data.dropna(subset=['htm3', 'wtkg2'])

# Extract heights and weights
heights, weights = df.htm3, df.wtkg2

# Log-transform the weights (base 10)
log_weights = np.log10(weights)

# Step 2: Linear least squares fit for log(weight) vs. height
# Add a constant for the intercept term
X = sm.add_constant(heights)
y = log_weights

# Fit the linear regression model
model = sm.OLS(y, X).fit()

# Print the model summary
print("Linear Regression Model Summary:")
print(model.summary())

# Step 3: Present the estimated parameters
# Exponentiate the intercept for interpretability
intercept = 10 ** model.params[0]  # Since we used log10
slope = model.params[1]

print("\nEstimated Parameters:")
print(f"Intercept (exponentiated): {intercept}")
print(f"Slope: {slope}")

# Step 4: Evaluate the model's predictive power
r_squared = model.rsquared
print(f"\nR-squared: {r_squared}")

# Step 5: Resampling without weights
n_iterations = 1000
means = []

for _ in range(n_iterations):
    sample = df.sample(n=len(df), replace=True)
    means.append(sample['htm3'].mean())

mean_height = np.mean(means)
std_error = np.std(means, ddof=1)
confidence_interval = np.percentile(means, [5, 95])

print("\nResampling Without Weights:")
print(f"Mean height (unweighted): {mean_height}")
print(f"Standard error (unweighted): {std_error}")
print(f"90% confidence interval (unweighted): {confidence_interval}")

# Step 6: Resampling with weights
# Extract sampling weights (totalwt)
weights_column = df['finalwt']

weighted_means = []

for _ in range(n_iterations):
    sample = df.sample(n=len(df), replace=True)
    weighted_mean = np.average(sample['htm3'], weights=sample['finalwt'])
    weighted_means.append(weighted_mean)

mean_height_weighted = np.mean(weighted_means)
std_error_weighted = np.std(weighted_means, ddof=1)
confidence_interval_weighted = np.percentile(weighted_means, [5, 95])

print("\nResampling With Weights:")
print(f"Mean height (weighted): {mean_height_weighted}")
print(f"Standard error (weighted): {std_error_weighted}")
print(f"90% confidence interval (weighted): {confidence_interval_weighted}")

# Step 7: Compare weighted and unweighted estimates
print("\nComparison of Weighted and Unweighted Estimates:")
print(f"Difference in mean height: {mean_height_weighted - mean_height}")
print(f"Difference in standard error: {std_error_weighted - std_error}")
print(f"Difference in confidence intervals: {confidence_interval_weighted - confidence_interval}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.age.replace([7, 9], float('NaN'), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.htm3.replace([999], float('NaN'), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

Linear Regression Model Summary:
                            OLS Regression Results                            
Dep. Variable:                  wtkg2   R-squared:                       0.283
Model:                            OLS   Adj. R-squared:                  0.283
Method:                 Least Squares   F-statistic:                 1.560e+05
Date:                Fri, 31 Jan 2025   Prob (F-statistic):               0.00
Time:                        22:18:26   Log-Likelihood:             4.0305e+05
No. Observations:              395832   AIC:                        -8.061e+05
Df Residuals:                  395830   BIC:                        -8.061e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.99

  intercept = 10 ** model.params[0]  # Since we used log10
  slope = model.params[1]



Resampling Without Weights:
Mean height (unweighted): 168.954679904606
Standard error (unweighted): 0.01654161552575265
90% confidence interval (unweighted): [168.92695449 168.98262306]

Resampling With Weights:
Mean height (weighted): 170.49499834307684
Standard error (weighted): 0.0368282444668808
90% confidence interval (weighted): [170.4309811  170.55162418]

Comparison of Weighted and Unweighted Estimates:
Difference in mean height: 1.5403184384708482
Difference in standard error: 0.020286628941128152
Difference in confidence intervals: [1.50402661 1.56900112]


## Difference in Mean Height:
- Unweighted mean height: 168.95 cm
- Weighted mean height: 170.49 cm
- Difference: 1.54 cm
This suggests that the unweighted sample underrepresents certain groups, likely those with higher heights. The weighted estimate adjusts for the sampling design, leading to a more accurate representation of the true population mean height.

## Standard Error:
- Unweighted standard error: 0.0165 cm
- Weighted standard error: 0.0368 cm
- Difference: 0.0203 cm
The weighted standard error is larger, indicating more uncertainty in the weighted estimate compared to the unweighted one. This happens because the weighted analysis takes into account the variability and structure of the survey, including oversampling of certain groups, which can lead to more precise but less consistent estimates.

## Confidence Interval:
- Unweighted 90% Confidence Interval: [168.93, 168.98]
- Weighted 90% Confidence Interval: [170.43, 170.55]
The confidence interval for the weighted estimate is wider, reflecting the increased uncertainty around the weighted mean. The difference in the confidence intervals shows that the correct weighting adjusts for the underlying population structure, which can affect both the point estimate and its precision.

## Conclusion:
Correct weighting helps ensure that the estimates you compute reflect the true population values and account for the sampling methodology. If weighting were not used, the estimates might be biased and less representative of the population, particularly in cases of oversampling or underrepresentation.