In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#  read the dataset
df = pd.read_csv("2012-sat-results.csv")

print(df.info())
print("")

# convert all values to numeric
df["SAT Critical Reading Avg. Score"] = pd.to_numeric(df["SAT Critical Reading Avg. Score"], errors="coerce")
df["SAT Math Avg. Score"] = pd.to_numeric(df["SAT Math Avg. Score"], errors="coerce")
df["SAT Writing Avg. Score"] = pd.to_numeric(df["SAT Writing Avg. Score"], errors="coerce")

# Drop rows with NaN values
df = df.dropna(subset=["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"])

print(df.info())
print("")

# population params
mu = df["SAT Writing Avg. Score"].mean()
tao = df["SAT Writing Avg. Score"].sum()
sigmasq = df["SAT Writing Avg. Score"].var(ddof=0)

print(f"The mu is: {mu}")
print(f"The tao is: {tao}")
print(f"The sigma^2 is: {sigmasq}")

print("")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 6 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   DBN                              478 non-null    object
 1   SCHOOL NAME                      478 non-null    object
 2   Num of SAT Test Takers           478 non-null    object
 3   SAT Critical Reading Avg. Score  478 non-null    object
 4   SAT Math Avg. Score              478 non-null    object
 5   SAT Writing Avg. Score           478 non-null    object
dtypes: object(6)
memory usage: 22.5+ KB
None

<class 'pandas.core.frame.DataFrame'>
Index: 421 entries, 0 to 477
Data columns (total 6 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              421 non-null    object 
 1   SCHOOL NAME                      421 non-null    object 
 2   Num of SA

#### **Choose an auxiliary variable x that should be related to your variable of interest y. Take a SRS of size n (the same size as in Report 2)**
Our variable of interest is SAT Writing Avg. Score. A related (auxilliary) variable we are using is SAT Math Avg. Score. The SAT Math Avg. Score has a 0.8885 correlation with SAT Writing Avg. Score.

In [2]:
# Correlation With an Auxilary Variable

df["SAT Math Avg. Score"].corr(df["SAT Writing Avg. Score"])

n = 80
seed = 440
sampled_df = df.sample(n=n, replace=True, random_state=seed)

#### **Perform a diagnostic analysis to determine if x and y have a linear relationship based on the sample data. Do regression analysis y ∼ x**

In [3]:
import statsmodels.api as sm

X = sampled_df["SAT Math Avg. Score"]
y = sampled_df["SAT Writing Avg. Score"]

# Add a constant (intercept)
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Get the summary
print(model.summary())

                              OLS Regression Results                              
Dep. Variable:     SAT Writing Avg. Score   R-squared:                       0.815
Model:                                OLS   Adj. R-squared:                  0.813
Method:                     Least Squares   F-statistic:                     343.6
Date:                    Sat, 29 Mar 2025   Prob (F-statistic):           2.62e-30
Time:                            17:14:36   Log-Likelihood:                -377.75
No. Observations:                      80   AIC:                             759.5
Df Residuals:                          78   BIC:                             764.3
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------

Since the p-value for the slope is 0, this means that you reject the null hypothesis that there is not a significant linear relationship between SAT math score and SAT writing score.

#### **Based on the results of the regression analysis, make a conclusion about the appropriateness of using ratio and regression estimators.**

Since the p-value of the intercept is 0, this means that you reject the null hypothesis and conclude the intercept is significantly different from 0. As seen in the previous part, there is a statistically significant linear linearship between SAT math score and SAT writing scores.

Ratio estimators assume that the line has an intercept of 0, but this assumption is not met in this case. In contrast, the linear regression model does not require the intercept to be 0. Therefore, with this data, it is more appropriate to use the regression estimator rather than a ratio estimator.

# 4. Estimate your parameter of interest by Ratio estimator. Estimate its variance and give a confidence interval of α level chosen in Report 2.

##### Calculate $\hat \mu_r$

- $y =$ avg writing score
- $x =$ avg math score
- $\alpha = .05$
- Ratio Estimator of Population True Avg Writing Score:
- $\hat\mu_r = r * \mu_x$
    - where $mu_x$ is population mean of $x$'s
    - where $r$ is $\frac{\sum_{i=1}^n y_i}{\sum_{i=1}^n x_i} = \frac{\bar y}{\bar x}$

In [4]:
alpha = .05
mu_x = df["SAT Math Avg. Score"].mean()
r = sum(sampled_df["SAT Writing Avg. Score"])/sum(sampled_df["SAT Math Avg. Score"])
mu_r_hat = r * mu_x
print(f"mu_r_hat is: {mu_r_hat} \n") 

mu_r_hat is: 395.02407708915626 



##### Estimate variance of $\hat\mu_r$
- $\hat{\text{var}} (\hat\mu_r) = \frac{N-n}{N} * s^2_r$

- $s^2_r = \frac{1}{n-1}\sum_{i=1}^n (y_i - rx_i)^2$

In [5]:
N = df.shape[0]
s_r_squared = (1 / (n-1)) * sum((sampled_df["SAT Writing Avg. Score"] - r * sampled_df["SAT Math Avg. Score"])**2)
var_hat_mu_r_hat = ((N-n)/N) * s_r_squared
print(f"var_hat_mu_r_hat is: {var_hat_mu_r_hat} \n")

var_hat_mu_r_hat is: 709.4558712502459 



##### Confidence Interval
- $100(1-\alpha)\%$ CI for $\mu$ based on normal approx: $\hat\mu_r \pm t_{n-1,\frac{\alpha}{2}}\sqrt[]{\hat{\text{var}} (\hat\mu_r)}$

In [6]:
from scipy.stats import t

t_crit = t.ppf(1-(alpha/2), n-1)
lowerBound = mu_r_hat - t_crit * np.sqrt(var_hat_mu_r_hat)
upperBound = mu_r_hat + t_crit * np.sqrt(var_hat_mu_r_hat)
print(f"95% CI for mu_r_hat is: ({lowerBound}, {upperBound}) \n")


95% CI for mu_r_hat is: (342.00721591635954, 448.040938261953) 



# 5. Estimate your parameter of interest by Regression estimator Estimate its variance and give a confidence interval of α level chosen in Report 2.

##### Calculate $\hat\mu_L$
- $\hat\mu_L = a + b*\mu_x$
- $a = \bar{y} - b * \bar{x}$
- $b = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$

In [7]:
y_bar = sampled_df["SAT Writing Avg. Score"].mean()
x_bar = sampled_df["SAT Math Avg. Score"].mean()

bNumerator = sum((sampled_df["SAT Math Avg. Score"] - x_bar) * (sampled_df["SAT Writing Avg. Score"] - y_bar))
bDenominator = sum((sampled_df["SAT Math Avg. Score"] - x_bar)**2)
b = bNumerator / bDenominator

a = y_bar - b * x_bar

mu_L_hat = a + b * mu_x

print(f"mu_L_hat is: {mu_L_hat} \n")

mu_L_hat is: 396.8558588854824 



##### Estimate Variance of $\hat\mu_L$
- $\hat{\text{var}}(\hat{\mu_L}) = \frac{N-n}{Nn(n-2)}\sum_{i=1}^n(y_i - a - bx_i)^2$

In [8]:
var_hat_mu_L_hat = ((N - n) / (N * n * (n - 2))) * sum(((sampled_df["SAT Writing Avg. Score"] - a - b * sampled_df["SAT Math Avg. Score"])**2))
print(f"var_hat_mu_L_hat is: {var_hat_mu_L_hat} \n")

var_hat_mu_L_hat is: 7.678518718573295 



##### Confidence Interval
- 100(1-alpha)% CI: $\hat{\mu_L} \pm t_{n-2, \frac{\alpha}{2}}\sqrt[]{\hat{\text{var}}(\hat{\mu_L})}$

In [9]:
t_crit = t.ppf(1-(alpha/2), n-2)
lowerBound = mu_L_hat - t_crit * np.sqrt(var_hat_mu_L_hat)
upperBound = mu_L_hat + t_crit * np.sqrt(var_hat_mu_L_hat)
print(f"95% CI for mu_L_hat is: ({lowerBound}, {upperBound}) \n")

95% CI for mu_L_hat is: (391.3391937391112, 402.37252403185363) 



# 6. Choose the best estimator of your parameter based on estimated variance.

In [10]:
if var_hat_mu_r_hat > var_hat_mu_L_hat:
    print("var_hat_mu_L_hat is a better estimator of the paremeter as it has a lesser variance of", var_hat_mu_L_hat)
else:
    print("var_hat_mu_r_hat is a better estimator of the paremeter as it has a lesser variance of", var_hat_mu_r_hat)

var_hat_mu_L_hat is a better estimator of the paremeter as it has a lesser variance of 7.678518718573295


In this case, $\hat{\text{var}}(\hat{mu_L})$ is a better estimator of $\mu$ as it has a lower variance of 7.69, compared to the variance of 709 given by the ratio estimator.

# 7. Calculate the true regression coefficients.
Namely, do regression $y ∼ x$ using whole data set (population). Is your conclusion in the part 3 changed? How does it change?

In [12]:
x = df["SAT Math Avg. Score"]
y = df["SAT Writing Avg. Score"]

# Add a constant (intercept)
x = sm.add_constant(x)

# Fit the model
model = sm.OLS(y, x).fit()

# Get the summary
print(model.summary())

                              OLS Regression Results                              
Dep. Variable:     SAT Writing Avg. Score   R-squared:                       0.789
Model:                                OLS   Adj. R-squared:                  0.789
Method:                     Least Squares   F-statistic:                     1570.
Date:                    Sat, 29 Mar 2025   Prob (F-statistic):          8.43e-144
Time:                            17:16:23   Log-Likelihood:                -1983.0
No. Observations:                     421   AIC:                             3970.
Df Residuals:                         419   BIC:                             3978.
Df Model:                               1                                         
Covariance Type:                nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------

The true regression coefficients are $a = 61.0737$ and $b = 0.8054$. The $p$-values for the intercept and the slope are both $0$, thus we do not change our part $3$ conclusions. The assumption of linearity is met, however the regression line does not pass through $(0,0)$, thus the linear regression estimator is more appropriate to use than the ratio estimator.