Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

moritzdrechselgrau · 2023-02-27T16:37:46Z

With large datasets and multiple fixed effects, the default tolerance setting of tol = 1e-6, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.

In Stata's reghdfe, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).

Here is a minimal working example using the Cigar.csv data in the repo which has to be tweaked a bit to make it work.

using DataFrames, CSV, FixedEffectModels, Random, StatsBase

# read the data
df = DataFrame(CSV.File(joinpath(dirname(pathof(FixedEffectModels)), "../dataset/Cigar.csv")))

# create a bigger dataset
sort!(df, [:State, :Year])
nstates = maximum(df.State)
dflarge = copy(df)
for i in 1:100
    dfnew = copy(df)
    dfnew.State .+= i .* nstates
    append!(dflarge, dfnew)
end

# create a dummy variable that is collinear with the State-FE
dflarge.highstate = dflarge.State .< median(dflarge.State)

# create a second 'high-dimensional' categorical variable
Random.seed!(1234)
dflarge.catvar = rand(1:200, nrow(dflarge))

# run the regression with the default setting (tol = 1e-6)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-6)

# run the regression with a lower tolerance (tol = 1e-8)
reg(dflarge, @formula(Price ~ highstate + Pop + fe(Year) + fe(catvar) + fe(State)), Vcov.cluster(:State); tol=1e-8)

Running the regression with the default settings where highstate is not recognized as collinear:

                             Fixed Effect Model
============================================================================
Number of obs:                 139380  Degrees of freedom:                 1
R2:                             0.988  R2 Adjusted:                    0.988
F-Stat:                       549.718  p-value:                        0.000
R2 within:                      0.026  Iterations:                         7
============================================================================
Price     |   Estimate  Std.Error    t value Pr(>|t|)   Lower 95%  Upper 95%
----------------------------------------------------------------------------
highstate |   0.445994    21244.6 2.09933e-5    1.000    -41649.1    41650.0
Pop       | 0.00102457 3.09008e-5    33.1569    0.000 0.000963994 0.00108515
============================================================================

Reducing the tolerance 'fixes' the issue because the function FixedEffectModels.invsym! essentially uses sqrt(eps()) as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.

                           Fixed Effect Model
=========================================================================
Number of obs:               139380   Degrees of freedom:               1
R2:                           0.988   R2 Adjusted:                  0.988
F-Stat:                     1101.09   p-value:                      0.000
R2 within:                    0.026   Iterations:                       9
=========================================================================
Price     |   Estimate  Std.Error t value Pr(>|t|)   Lower 95%  Upper 95%
-------------------------------------------------------------------------
highstate |        0.0        NaN     NaN      NaN         NaN        NaN
Pop       | 0.00102414 3.08638e-5 33.1827    0.000 0.000963636 0.00108465
=========================================================================

I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's reghdfe which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller than min(1e-6, tol / 10).

The text was updated successfully, but these errors were encountered:

…pending on tolerance (#222) * drop regressors that are collinear with the FE depending on tolerance * fix bug (initialize collinearity status to false instead of true) * remove one-line functions sumofsquares and iscollinear_fe * add test for detecting regressors that are collinear with fixed effects * use sum(abs2, x) instead of x'*x to compute sum of squares

moritzdrechselgrau mentioned this issue Mar 6, 2023

address issue #221: drop regressors that are collinear with the FE depending on tolerance #222

Merged

matthieugomez closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

moritzdrechselgrau commented Feb 27, 2023

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

Drop regressors that are collinear with the fixed effects (depending on tolerance for partialling-out) #221

Comments

moritzdrechselgrau commented Feb 27, 2023