Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

Closed
moritzdrechselgrau opened this issue Feb 27, 2023 · 0 comments

Comments

@moritzdrechselgrau
Copy link
Contributor

With large datasets and multiple fixed effects, the default tolerance setting of tol = 1e-6, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.

In Stata's reghdfe, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).

Here is a minimal working example using the Cigar.csv data in the repo which has to be tweaked a bit to make it work.

using DataFrames, CSV, FixedEffectModels, Random, StatsBase

# read the data
df = DataFrame(CSV.File(joinpath(dirname(pathof(FixedEffectModels)), "../dataset/Cigar.csv")))

# create a bigger dataset
sort!(df, [:State, :Year])
nstates = maximum(df.State)
dflarge = copy(df)
for i in 1:100
    dfnew = copy(df)
    dfnew.State .+= i .* nstates
    append!(dflarge, dfnew)
end

# create a dummy variable that is collinear with the State-FE
dflarge.highstate = dflarge.State .< median(dflarge.State)

# create a second 'high-dimensional' categorical variable
Random.seed!(1234)
dflarge.catvar = rand(1:200, nrow(dflarge))

# run the regression with the default setting (tol = 1e-6)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-6)

# run the regression with a lower tolerance (tol = 1e-8)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-8)

Running the regression with the default settings where highstate is not recognized as collinear:

                             Fixed Effect Model
============================================================================
Number of obs:                 139380  Degrees of freedom:                 1
R2:                             0.988  R2 Adjusted:                    0.988
F-Stat:                       549.718  p-value:                        0.000
R2 within:                      0.026  Iterations:                         7
============================================================================
Price     |   Estimate  Std.Error    t value Pr(>|t|)   Lower 95%  Upper 95%
----------------------------------------------------------------------------
highstate |   0.445994    21244.6 2.09933e-5    1.000    -41649.1    41650.0
Pop       | 0.00102457 3.09008e-5    33.1569    0.000 0.000963994 0.00108515
============================================================================

Reducing the tolerance 'fixes' the issue because the function FixedEffectModels.invsym! essentially uses sqrt(eps()) as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.

                           Fixed Effect Model
=========================================================================
Number of obs:               139380   Degrees of freedom:               1
R2:                           0.988   R2 Adjusted:                  0.988
F-Stat:                     1101.09   p-value:                      0.000
R2 within:                    0.026   Iterations:                       9
=========================================================================
Price     |   Estimate  Std.Error t value Pr(>|t|)   Lower 95%  Upper 95%
-------------------------------------------------------------------------
highstate |        0.0        NaN     NaN      NaN         NaN        NaN
Pop       | 0.00102414 3.08638e-5 33.1827    0.000 0.000963636 0.00108465
=========================================================================

I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's reghdfe which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller than min(1e-6, tol / 10).

eloualiche pushed a commit that referenced this issue Mar 6, 2023
…pending on tolerance (#222)

* drop regressors that are collinear with the FE depending on tolerance

* fix bug (initialize collinearity status to false instead of true)

* remove one-line functions sumofsquares and iscollinear_fe

* add test for detecting regressors that are collinear with fixed effects

* use sum(abs2, x) instead of x'*x to compute sum of squares
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants