# ECON 140: Econometrics (UC Berkeley)

## Homework #2: Class Size and Student Performance

### Brandon Lee Concepcion, 3037601855, UC Berkeley

Below is all my code, and accompanying answers, for ECON 140: Econometrics HW #2

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [3]:
df = pd.read_csv('caschool.csv', delimiter=';')
df.head()

Unnamed: 0,testscr,str,avginc,el_pct
0,690.8,17.88991,22.69,0.0
1,661.2,21.52466,9.824,4.583333
2,643.6,18.69723,8.978,30.0
3,647.7,17.35714,8.978,0.0
4,640.85,18.67133,9.080333,13.85768


---

## Question 1. Estimating the regression equation 

We have been asked to report the estimate intercept $\hat{\beta_0}$ and $\hat{\beta_1}$ for the regression model defined below as

$$ testscr_i = \beta_0 + \beta_1 str_i + u_i$$

We can definitely make use of our equations from lecture in order to calculate our desired quantities!

In [6]:
X = df["str"]
y = df['testscr']

X_mean = X.mean()
y_mean = y.mean()

sample_cov = np.sum(((X - X_mean)) * ((y - y_mean)) )
sample_var = np.sum((X - X_mean)**2)

b_1 = sample_cov / sample_var
b_0 = y_mean - b_1 * X_mean

print(f'We found b0 to be {b_0}')
print(f'We found b1 to be {b_1}')

We found b0 to be 698.9329772025982
We found b1 to be -2.2798095950788158


### Interpreting the Coefficients

Looking at our interecept, it tells us that when $str = 0$ our baseline test score is $698.93$. We know this because $\beta_0 = 698.93$. Do note that a student-teacher ratio of $0$, or even less than $0$, does not actually provide much value in this scenario.

Interpreting our coefficient, $\beta_1$, we can see that as we increase the # of students per teacher by 1, the associated average test score goes **reduces** by $2.28$ points. This is because $\beta_1 = -2.28$ based on our calculations

---

## Question 2. Using the regression equation 

Predict the average test score for districts with the following class sizes (student-teacher ratios):

$$STR = 14, 20, 25, 30 $$

Comment on how predicted performance changes as class size increase

In [7]:
given_STR = np.array([14, 20, 25, 30])

predicted_scores = b_0 + b_1 * given_STR

for i in range(len(given_STR)):
    print(f'The predicted test score for a student-teacher ratio of {given_STR[i]} is {predicted_scores[i]}')

The predicted test score for a student-teacher ratio of 14 is 667.0156428714947
The predicted test score for a student-teacher ratio of 20 is 653.3367853010219
The predicted test score for a student-teacher ratio of 25 is 641.9377373256278
The predicted test score for a student-teacher ratio of 30 is 630.5386893502337


### Interpretation

Looking at our values, we can clearly see that as we increase the number of students per teacher, the predicted average test score decreases lineaarly. This was to be expected, as remember from our regression equation that our slope $\beta_1$ was negative. 

--- 

## Question 3. Suppose the government is adding more students per class

- By how many points would the predicted test score change if we increase class size by 10 students (holding everything else constant)?
- What about an increase of 5 students? 20 students?

Use your $\hat{\beta_1}$ to answer each

In [8]:
increase_by_10 = b_1 * 10
increase_by_5 = b_1 * 5
increase_by_20 = b_1 * 20

print(f'The predicted test score would decrease by {increase_by_10} points if we increase class size by 10 students')
print(f'The predicted test score would decrease by {increase_by_5} points if we increase class size by 5 students')
print(f'The predicted test score would decrease by {increase_by_20} points if we increase class size by 20 students')

The predicted test score would decrease by -22.79809595078816 points if we increase class size by 10 students
The predicted test score would decrease by -11.39904797539408 points if we increase class size by 5 students
The predicted test score would decrease by -45.59619190157632 points if we increase class size by 20 students


### Interpretation

Assuming we held everything else constant:
- The predicted test score would **decrease** by $22.80$ $(-22.80)$ points if we were to add 10 students

- The predicted test score would **decrease** by $11.40$ $(-11.40)$ points if we were to add 5 students
- The predicted test score would **decrease** by $45.60$ $(-45.60)$ points if we were to add 20 students

---

## Question 4. Constructing a $95\%$ and $99\%$ confidence interval for $\beta_1$

- Based on these intervals, do you conclude that class size has a statistically significant effect on test scores?
- Explain the economic and statistical interpretation of your results

In [12]:
n = df.shape[0]

# Calculate the standard error of the regression coefficient
se_b1 = np.sqrt(np.sum((y - (b_0 + b_1 * X))**2) / (n - 2)) / np.sqrt(np.sum((X - X_mean)**2))

# Calculate the t-values for 95% and 99% confidence intervals
t_95 = stats.t.ppf(0.975, n - 2)
t_99 = stats.t.ppf(0.995, n - 2)

# Calculate the confidence intervals
ci_95 = [b_1 - t_95 * se_b1, b_1 + t_95 * se_b1]
ci_99 = [b_1 - t_99 * se_b1, b_1 + t_99 * se_b1]

print(f'The 95% confidence interval for beta_1 is {ci_95}')
print(f'The 99% confidence interval for beta_1 is {ci_99}')


The 95% confidence interval for beta_1 is [np.float64(-3.222981302773272), np.float64(-1.3366378873843592)]
The 99% confidence interval for beta_1 is [np.float64(-3.5214263854300545), np.float64(-1.038192804727577)]


### Interpretation

If we were to assume a null hypothesis $H_0 : \beta_1 = 0 $, then we would reject this null hypothesis. At both a $95\%$ and $99\%$ confidence level, we can conclude with statistical significance that the true value of $\beta_1$ is likely $\beta_1 < 0$, as the intervals we generated **both** do not contain the value of $0$. 

An economic interpretation of these results it that for every additional student which is added per teacher, we can expect a mean drop in their test scores by $2.28$ points

---

## Question 5. Marginal Impact 

The California government will only invest money in reducing class sizes if the *marginal impact of decreasing class size by one student* is at least 2 points in average test scores. Formally test this claim at the $5\%$ significance level

- Write the null and alternative hypothesis in terms of parameter(s)
- Use your regression results to test it at the $95\%$ confidence level
- What is your recommendation?

In [13]:
t_statistic = (b_1 - (-2)) / se_b1
p_value = stats.t.cdf(t_statistic, n - 2)

print(f'The t-statistic is {t_statistic}')
print(f'The p-value is {p_value}')

The t-statistic is -0.5831485983539039
The p-value is 0.2800538740968571


### Interpretation

We can define our hypotheses as follows: 
- Null hypothesis $H_0 : \beta_1 = -2 $
- Alternative hypothesis $H_A : \beta_1 < -2 $

Looking at our results, we see that our p-value is $0.28$, which is not less than our $0.05$ cutoff required to reject the null hypothesis. As a result, our test concludes that we **fail to reject the null hypothesis** at a $95\%$ confidence level. Our evidence is not sufficient enough to conclude that the marginal impact of decreasing class size by one student is at least a 2 point increase in average test scores. Therefore, the California government should not be investing their money to reduce these class sizes

---

## Question 6. Summarizing our Findings 

Summarize your findings in a short paragraph for a policy brief to the California government: 

- Should they reduce class sizes? Justify your recommendation using your regression results and the confidence intervals. 
- Be critical about why (or why not) the government should trust these results. How would extend the analysis for adressing potential drawbacks.

### Our Interpretation

The initial confidence intervals we generated at a 95% and 99% cnfidence level *do* show statistical significance in regards to effect of class size on test scores. We *do* see that the marginal impact of decreasing class size by one student *does* seem to increase the average test score as a whole. However, a difference of 2 points has still not been revealed as statistically significant. As seen in the previous question, if the California government made it absolutely necessary for a 2-point improvement in test scores, for every reduction in class size by 1 student, at a 95% confidence level we would be **unable** to provide a statistically significant outcome.

While these results do provide statistical significance in regards to student-teacher ratio to test scores, this is just one attribute / factor which is used as part of our analysis. In reality, there are many other factors which *we did not account for** which could also be revealing about their effects on test scores. Revising our model from a Simple Linear Regression model to a Multiple Linear Regression model could be more robust, and could help with identifying the key areas of imporvement for students. 

---