You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With large datasets and multiple fixed effects, the default tolerance setting of tol = 1e-6, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.
In Stata's reghdfe, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).
Here is a minimal working example using the Cigar.csv data in the repo which has to be tweaked a bit to make it work.
using DataFrames, CSV, FixedEffectModels, Random, StatsBase
# read the data
df =DataFrame(CSV.File(joinpath(dirname(pathof(FixedEffectModels)), "../dataset/Cigar.csv")))
# create a bigger datasetsort!(df, [:State, :Year])
nstates =maximum(df.State)
dflarge =copy(df)
for i in1:100
dfnew =copy(df)
dfnew.State .+= i .* nstates
append!(dflarge, dfnew)
end# create a dummy variable that is collinear with the State-FE
dflarge.highstate = dflarge.State .<median(dflarge.State)
# create a second 'high-dimensional' categorical variable
Random.seed!(1234)
dflarge.catvar =rand(1:200, nrow(dflarge))
# run the regression with the default setting (tol = 1e-6)reg(dflarge, @formula(Price ~ highstate + Pop +fe(Year) +fe(catvar) +fe(State)), Vcov.cluster(:State); tol=1e-6)
# run the regression with a lower tolerance (tol = 1e-8)reg(dflarge, @formula(Price ~ highstate + Pop +fe(Year) +fe(catvar) +fe(State)), Vcov.cluster(:State); tol=1e-8)
Running the regression with the default settings where highstate is not recognized as collinear:
Fixed Effect Model
============================================================================
Number of obs:139380 Degrees of freedom:1
R2:0.988 R2 Adjusted:0.988
F-Stat:549.718 p-value:0.000
R2 within:0.026 Iterations:7============================================================================
Price | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%----------------------------------------------------------------------------
highstate |0.44599421244.62.09933e-51.000-41649.141650.0
Pop |0.001024573.09008e-533.15690.0000.0009639940.00108515============================================================================
Reducing the tolerance 'fixes' the issue because the function FixedEffectModels.invsym! essentially uses sqrt(eps()) as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.
Fixed Effect Model
=========================================================================
Number of obs:139380 Degrees of freedom:1
R2:0.988 R2 Adjusted:0.988
F-Stat:1101.09 p-value:0.000
R2 within:0.026 Iterations:9=========================================================================
Price | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%-------------------------------------------------------------------------
highstate |0.0NaNNaNNaNNaNNaN
Pop |0.001024143.08638e-533.18270.0000.0009636360.00108465=========================================================================
I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's reghdfe which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller than min(1e-6, tol / 10).
The text was updated successfully, but these errors were encountered:
…pending on tolerance (#222)
* drop regressors that are collinear with the FE depending on tolerance
* fix bug (initialize collinearity status to false instead of true)
* remove one-line functions sumofsquares and iscollinear_fe
* add test for detecting regressors that are collinear with fixed effects
* use sum(abs2, x) instead of x'*x to compute sum of squares
With large datasets and multiple fixed effects, the default tolerance setting of
tol = 1e-6
, regressors that are collinear with the fixed effects may not be omitted even though they clearly should.In Stata's
reghdfe
, these regressors are dropped because of an additional check that compares the sum of squares of each variable before and after partialling out the fixed effects (for residualized collinear variables, the sum of squares is very close to zero).Here is a minimal working example using the
Cigar.csv
data in the repo which has to be tweaked a bit to make it work.Running the regression with the default settings where
highstate
is not recognized as collinear:Reducing the tolerance 'fixes' the issue because the function
FixedEffectModels.invsym!
essentially usessqrt(eps())
as the tolerance criterion for variables with very small sums of squares, i.e. collinear ones. The more precise the partialling-out, the more likely this function detects the collinearity.I do not think that simply changing the default tolerance solves this issue. I will shortly submit a PR that implements the procedure of Stata's
reghdfe
which is to drop variables where the sum of squares after residualizing divided by the sum of squares before residualizing is smaller thanmin(1e-6, tol / 10)
.The text was updated successfully, but these errors were encountered: