# Problem set 4


Anh Nguyen Tran

Example Homework

03/14/2024

In [None]:
*Set up your data

set maxvar 32000
use GSS_1972_2021.dta, clear
eststo clear
keep rincome age educ race sex partyid polviews

*Once again, don't forget to do "eststo clear"

### SHORT RECAP: Why do we care about heteroscedasticity?

1. Biased standard errors (SE): SEs are used to calculate the confidence intervals (CI). You want your confidence intervals to be as accurate as possible of the population. Don't forget too- we use CIs to do hypothesis testing, so this will also hold implications for statistical significance.
2. OLS assumes "constant variance of errors" (relook at Kyle's notes for all the assumptions of the OLS model). If there is heteroscedasticity in your model, then this assumption is violated, and the question of whether your model is valid is brought into question.

## 1. Visually assess heteroscedasticity [1 pt]

    A. Create a scatterplot of a dependent variable and independent variable of interest from a dataset of your choice.
    
    B. Include a fitted line with an area graph of the confidence interval for the prediction.
    
    C. Write a couple sentences describing how the distribution of the data in the graph does or does not appear to be heteroscedastic.

In [None]:
%set graph_height = 8

In [None]:
%set graph_width = 11

In [None]:
****NOTE: Once again, for simplicity, I recommend doing just bivariate models.

* PART A & B: Scatterplot between an interval-ratio dependent variable and a nominal variable
// Charlie used an interval-ratio dependent variable and an interval-ratio independent variable
    // that makes it easier for him to graphically see heteroscedasticity (he just looks across his data for any unevenness)

tw (scatter rincome partyid) ///
(lfitci rincome partyid), ///
ytitle(log income, size(large)) ///
xtitle(,size(large)) legend(off) scheme(538w) ///
title("Income and Party Identification" " ", span size(large)) ///
aspect(1, place(west))

*REMINDER: The "lfitci" completes the requirement for Part B (create a fitted line with a confidence interval shadow).

In [None]:
codebook partyid

### PART C: Interpretating heteroscedasticity

Remember- with nominal variables, there's no inherent order (e.g., being Catholic (number label 1) is not lesser than Protestant (number label 2)) or spacing (e.g., the number label of 1 and the number label of 2 in a nominal variable represents just the category and not any inherent value to the numbers itself). That's why it's hard to just glance at this and see if there's heteroscedasticity occurring.

You can still look at the spread of the income within each category of partyid to see if there are any patterns. If the spread or variability of income appears to be different across the various categories of partyid, that might suggest a form of heteroscedasticity.

From my plot, the vertical spread of the points does not seem to change systematically across the categories of political party affiliation. The scatter appears relatively constant. There appears to be no heteroscedasticity happening.

## 2. Test for heteroscedasticity [1 pt]

    A. Do a Breusch-Pagan postestimation test for heteroscedasticity in your dependent variable and independent variable relationship.

    B. Reestimate the regression after logging the DV or IV if appropriate and do another Breusch-Pagan test. Does logging reduce heteroscedasticity?

In [None]:
*PART A: Breusch-Pagan test
// Run basic regression first before doing any postestimation command

quietly reg rincome partyid
estat hettest

### Breusch-Pagan test recap and interpretation

WHAT CHARLIE STATED IN LECTURE: "A large chi2 statistic and low probability for chi2 indicates that heteroskedasticity is a problem."

Why is this? The Breusch-Pagan test has a null hypothesis that the variance of the residuals is constant. The alternative hypothesis is that the variance of the residuals is NOT constant (which is heteroscedasticity). If the p-value is less than 0.05, you can reject the null hypothesis.

LOOKING AT MY EXAMPLE (focus on the p-value when you are doing this):
1. Degrees of freedom is 1 (representing that I have only one independent variable in my model). 
2. Since the p-value of 0.0176 is less than 0.05, I am rejecting the null hypothesis of constant variance. This suggests that there is statistically significant evidence of heteroscedasticity in my model.

In [None]:
*PART B: Logging variables if needed (most of you may be skipping this step)
// Log any dependent variables you all have that are very wide-ranging (don't log age or years of education)
    // Log variables like college net price or state grant aid (numbers that are very wide-ranging, huge, and positive)

gen rincomeln = log(rincome)

quietly reg rincomeln partyid
estat hettest

### tldr...

There's still heteroscedasticity going on here even after I log income.

## 3. Boot strap your standard errors [1 pt.]

    A. Quietly reestimate your regression coefficient with convential OLS and store the results.
    
    B. Quietly reestimate your regression with bootstrapped standard errors and store the results.
    
    C. Use esttab to output the results of the two models and tell us how the bootstrap standard erros differ from the conventional results.
    
    D. Explain in your own words what the bootstrap procedure is doing and why it yields similar or different standard errors to the convential model.

In [None]:
*PART A: Basic OLS

eststo: quietly reg rincomeln partyid

In [None]:
*PART B: OLS with Bootstrapped SEs
// NOTE: To be clear, bootstrapping standard errors still work when you have a nominal independnet variable for OLS.

eststo: quietly bootstrap _b[partyid], rep(1000) nodots : ///
    reg rincomeln partyid

In [None]:
* PART C: Creating the table

esttab, ///
mlabels("OLS" "Bootstrap") ///
collabels(none)  ///
cells(b(star fmt(2)) se(fmt(2) par)) ///
starlevels(^ .1 * .05 ** .01 *** .001) 

### PART D: Interpretation

Basically, you may see changes in the standard errors (the numbers in the parantheses) and potentially in statistical significance as well. The standard errors usually increase in the bootstrap SE model, adjusting for heteroscedasticity.

For my example, there is no change, which indicates that OLS is already robust on its own. Basically, my original SEs already provide a good estimate of the variability in my data, and the bootstrap sample that was generated is very similar to the original sample.

What is bootstrapping doing though? Bootstrapping is a resampling technique used to estimate the sampling distribution of an estimator by sampling (with replacement) from the original data. Basically, Stata is grabbing random points from your data to create a new sampling distribution, and they are doing this to recalculate the standard error.

## 4. Estimate robust standard errors [1 pt]

    A. Reestimate your model with robust standard errors and store the results.
    
    B. Use esttab to output the results of the robust model alongside the conventional and bootstrap models and explain how the the results compare in 1 or 2 sentences.
    
    C. In your own words, explain how the robust standard errors procedure differs from conventional procedures.

In [None]:
*PART A: OLS with Robust SEs

eststo: quietly reg rincomeln partyid, robust

In [None]:
*PART B: Creating the table (again but this time with the new model too)

esttab, ///
mlabels("OLS" "Bootstrap" "Robust") ///
collabels(none)  ///
cells(b(star fmt(2)) se(fmt(2) par)) ///
starlevels(^ .1 * .05 ** .01 *** .001) 

### PART C: Interpretation

What is going on in my data? There appears to be no change in standard errors between my OG model and the robust model. I can conclude this: the OLS method may actually be a good fit for the data, and the effect of any heteroscedasticity (if it's there) might be minor enough not to significantly affect the standard errors.

What is robust standard errors though? As Charlie stated in lecture, "error estimates that apply more weight to larger deviations and less weight to smaller deviations". 

In general, you should always do robust standard errors. Even if nothing changes, this strengthens the validity of your model.

## 5. Cluster robust standard errors [1 pt]

    A. Explain why or why not your model should be estimated with cluster robust standard errors. If yes, what is the clustering unit and why?
    
    B. If yes, reestimate your model with cluster robust standard errors and use esttab to output the results of the robust model alongside your other models and explain how the the results compare in 1 or 2 sentences.
    
    C. Write a couple sentences explaining what is the best method of standard error estimation for your models and why.

### PART A

As Charlie stated in lecture, I would not do cluster robust standard errors for party identification or political ideology. Why? There is too few "clusters" there. The clusters for party identification would be, for instance, Democrats, Republicans, and Independents (just 3 groups).

You would want to do cluster robust SEs for people from different regions (like the 50 US states), different schools (there's usually over 50 schools again in most datasets), .

In [None]:
*PART B: Cluster robust SEs (most of you will be skipping this)
// NOTE: I am doing it just to show it, but I wouldn't do it for polviews in a normal circumstance.

eststo: quietly reg rincomeln partyid,  cluster(polviews)

esttab, ///
mlabels("OLS" "Bootstrap" "Robust" "Cluster") ///
collabels(none) drop(_cons) ///
cells(b(star fmt(2)) se(fmt(2) par)) ///
starlevels(^ .1 * .05 ** .01 *** .001)  legend

### Interpreting for Part B if needed:

Once again, there's no change in SEs for this model. This indicates that there may not be clustering in variance occurring. However, the third model is no longer statistically significant. If you have a strong technical basis for this (e.g., you are looking at students from different schools and doing cluster SEs for this), this would indicate that your results may not be significant afterall after controlling for the correlation between the error terms within each cluster.

### PART C: Which model should you choose?

Go based off of data and model.

BOOSTRAP:
1. This can be done on representative but smaller sample sizes (depends on complexity of your model- but with anything below 200 and greater than 30, you might want to consider this method).
2. Use for more complex models (like ones with interaction terms, non-linear relationships, or a lot of covariates).

ROBUST:
1. This is more appropriate for larger sample sizes (again, depends on the complexity of your model- with more simple models, you can have smaller sample sizes).
2. Use for linear models if you see evidence of heteroskedasticity.

CLUSTER:
1. This should be based primarily on your dataset. Use this if you have large clusters, and you believe there may be similarities between people in those clusters.