Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial design meeting #16

Closed
korenmiklos opened this issue Apr 9, 2024 · 22 comments
Closed

Initial design meeting #16

korenmiklos opened this issue Apr 9, 2024 · 22 comments
Milestone

Comments

@korenmiklos
Copy link
Member

With @gergelyattilakiss we are building a Julia package to help typical applied economics workflows for data cleaning, as well as exploratory and regression analysis. The syntax follows that of Stata (R), a statistical tool widely used by economists. Our work is inspired by https://github.com/TidierOrg/Tidier.jl and https://github.com/jmboehm/Douglass.jl.

The motivation is to help more economists adopt Julia, giving them a performant scientific computing language that they can use not only for macroeconomics simulations, but also applied microeconomics data work.

We have some ideas for the design of the tool (see code example below). We are also analyzing Stata code produced by economists, as submitted to journals, to study the patterns of coding and the most frequent commands used.

Please join us for a design meeting to discuss what such a tool should and should not do. If you have strong views about either Julia, Stata or applied economies, please come and share.

If you can join for a 2-hour Zoom meeting at the end of April, please let us know in a comment below.

@chain df begin
    @keep country_code year gdp population
    @generate log_gdp = log(gdp)
    @generate log_population = log(population)
    @egen min_log_gdp = min(log_gdp)
    @replace log_gdp = min_log_gdp @if missing(log_gdp)
    @collapse log_gdp log_population, by(country_code)
    @regress log_gdp log_population, robust
end
@korenmiklos
Copy link
Member Author

Can you join @kdpsingh, @jmboehm, @cpfiffer, @tpapp? It would be great to hear your thoughts on this!

@tpapp
Copy link

tpapp commented Apr 9, 2024

@korenmiklos, have you looked at Query.jl, DataFramesMeta.jl, and SplitApplyCombine.jl? What are you missing from these (and Douglass.jl) that requires a new package?

@floswald
Copy link

floswald commented Apr 9, 2024

I feel a bit like the uninvited guest who comes to spoil the party - sorry! Anyway, my aim is not to spoil the party. I've been talking many times with @jmboehm about this and I think the question of @tpapp is exactly spot on. What exactly is missing, and what's peculiar about applied economist's workflows here? Here is a plain vanilla DataFrames.jl pipeline. @pdeffebach could probably chip in some useful bits from DataFramesMeta.jl.

using CSV
using DataFrames
using GLM
using Chain
using Statistics

gapminder() = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/gapminder/gapminder.csv"), 
                    DataFrame)

function pipeline(d::DataFrame)
    @chain d begin
        select(:country, :year, :gdpPercap, :pop)
        transform([:pop, :gdpPercap] .=> (x -> log.(x)) .=> [:logpop, :loggdpPercap])
        transform(:loggdpPercap => (x -> replace(x, missing => minimum(skipmissing(x))) => :loggdpPercap))
        groupby(:country)
        combine([:logpop,:loggdpPercap] .=> mean .=> [:logpop,:loggdpPercap])
        lm(@formula(loggdpPercap ~ logpop), _)
    end
end

function run()
    d = gapminder()
    pipeline(d)
end

run()

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

loggdpPercap ~ 1 + logpop

Coefficients:
───────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   8.21888      1.00019     8.22    <1e-12   6.24144   10.1963
logpop       -0.00381135   0.0631324  -0.06    0.9519  -0.128627   0.121005
───────────────────────────────────────────────────────────────────────────

@maiaguell
Copy link

yes, please!! I would love to join and thanks for the initiative!

@kdpsingh
Copy link

kdpsingh commented Apr 9, 2024

@korenmiklos, thanks for including me. I don't have the bandwidth this month to join a 2-hour Zoom mainly because I have a lot of travel coming up this month and next month.

I agree with the sentiment that it's worth figuring out the value proposition, but I think that's generally true of all new packages and shouldn't stop you from experimenting.

A great example of this is the finalfit package in R. Even though regression is baked into R, finalfit provides a unifying interface to fixed effects and mixed effects models as well as publication-ready tables. Similarly, I would think through where you have friction in your workflow and would prioritize those things for Kezdi.

@pdeffebach
Copy link

pdeffebach commented Apr 9, 2024

Glad you are trying to expand applied micro-economics uses in Julia!

I agree with the above commentators that the current data cleaning ecosystem has very good "bones". I don't think we necessarily need a new data cleaning package. Here is the current pipe you have written

julia> df = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/gapminder/gapminder.csv"), DataFrame);

julia> m = @chain df begin
           @select :country :year :gdpPercap :pop
           @rtransform begin
               :logpop = log(:pop)
               :logdpPercap = log(:gdpPercap)
           end
           @by :country begin
               :logpop = mean(:logpop)
               :loggdpPercap = mean(:logdpPercap)
           end
           lm(@formula(loggdpPercap ~ logpop), _)
       end;

A few differences from above

  • We explicitly write mean(:loggdpPercap) instead f @collapse automatically using the mean. I view explicitly saying mean as a good thing. To avoid writing mean twice we could also do [:logpop, :loggdpPercap] .=> mean which seems okay.
  • We use :x instead of x to refer to variable names. This is intentional! A major hassel in dplyr is that what x means depends on if a column exists in a data frame or not. Being able to visually distinguish local variables and columns is a major plus. Stata doesn't have this problem, of course. But Julia is a "real" programming language, unlike Stata. With greater flexibility means more syntax to alleviate confusion. Note that you can also use variable names programatically in DataFramesMeta.jl via $.
  • We have two versions of macros, @rtransform and @transform. The first is for row-wise operations, and the second for column-wise. This is somewhat akin to egen vs gen in Stata.
  • Side note: You may be interested in the recent @label and @note macros introduced in DataFramesMeta.jl, to emulate Stata's metadata features.

This is to say: Maybe the syntax is occasionally more complicated in DataFramesMeta.jl. But the differences are the result of real tradeoffs. I don't think re-writing a new data-cleaning library is worth it at the moment.

What should be done instead

However I think new developers devoted to micro-economics is a great idea! The number one issue I would like to see is for our statistics and regression packages. Currently, we have three main regression packages.

  • FixedEffectModels.jl, maintained by @matthieugomez . Matthieu is a busy AP at Columbia and I don't think he has time to maintain this complicated package in a way to make it as robust as, say, fixest in R.
  • Econometrics.jl, maintained by @Nosferican, who is working at the Fed and is likewise very busy
  • GLM.jl, which aims to be a pretty minimal packages, emulating Base R's glm and lm without the features economists need.

Additionally, the HypothesisTests.jl and StatsBase.jl packages don't really have maintainers anymore. @andreasnoack occasionally reviews PRs but there are still bugs to be fixed. Additionally, StatsBase.jl could use much more support accommodating missing values and PRs would (probably) be welcomed. (Missing values are contentious for a variety of reasons, and probably not a good place for a newcomer to start working on PRs).

As for other estimation tools

  • There is nascent work on GMM with GMMTools.jl, again run by an AP at an R1 university. I think Gabriel would appreciate a summer intern who can contribute to this estimation.
  • We don't have a good DiD package similar to did in R.
  • @nilshg has a SynthControl.jl package for Synthetic controls. Maybe someone can help with the development of that package.
  • We have no package for marginal effects, certainly not as good as the excellent marginaleffects package in R.

So my take is that if a micro-economist wants to use Julia, data cleaning is not the issue. There are some data cleaning things that would make life easier. Two things I think about are (1) Better missing values support (which is kind of in my court, I have some PRs that can improve things) and (2) Better data viewing. Stata's viewer is excellent, but it shouldn't be impossible to write a QT-based data viewer.

These are marginal, however, compared to the minuscule size of the Julia statistics and econometrics ecosystem compared to R and Stata. That's where the energy needs to be.

@korenmiklos
Copy link
Member Author

Thanks all!

My purpose with this package is as follows. Stata is doing something well. A large chunk of applied micro work is in Stata (we will have precise numbers by end of April). If we want applied microeconomists to use Julia, we need to offer something that is as easy to use as Stata and easy to switch to. Julia has great existing tools, but they don't fill this gap.

@pdeffebach: I will look more into these packages. I agree there is value to be added there. Let's get more users, bigger communities, maybe it will help build the more innovative packages, too.

@korenmiklos
Copy link
Member Author

From @gbekes

Oh you can guess what I'll say.

  1. Why not base it on #fixest in R. It's comprehensive. It's the basis of PyFixest? lrberge.github.io/fixest/
  2. In terms of use cases, what could be better than github.com/gabors-data-an…
  3. Also check out LOST lost-stats.github.io/Model_Estimati…

@gbekes
Copy link

gbekes commented Apr 9, 2024

Indeed. And let me tag Laurent @lrberge for R fixest and Alex @s3alfisc for his Python version PyFixest. .
Lemme also tag @vincentarelbundock for modelsummary

My strong view is that the fixest way -- a regression wrapper with a clear stance + flexibility is the way ahead, and the same syntax should work in all languages.

@matthieugomez
Copy link

matthieugomez commented Apr 9, 2024 via email

@jmboehm
Copy link

jmboehm commented Apr 10, 2024

Thanks Miklos for tagging me.

As discussed with some of you, I feel that Julia would be able to bridge the two-language problem of economics which is that (1) data cleaning and reduced-form work is much simpler in Stata/R; (2) Structural work is much easier/faster in Python/C/Matlab/Julia. Karandeep has done terrific work on bringing R's tidyverse to Julia, but I'm not a R user so I still do my reduced-form work mostly in Stata.

My goal for starting Douglass.jl was to have a package that can take Stata code almost one-for-one and generate Julia code (DataFrames.jl/DataFramesMeta.jl). My difficulty when using Julia's tabular data packages is exactly that I have to think very carefully about how missing values propagate, and I'd prefer to have exactly the behavior of Stata (that's somehow hardwired into my brain). I haven't had as much time for free software development as I would have liked, so I haven't taken this all the way to the end and developed a polished package. Perhaps, if I get a large grant one day, I'll get a student to spend a summer expanding and polishing the package. But as always the cost is in the maintenance, not in the initial development. For that you need to get a critical mass of users and developers.

Tbh I don't see a point in developing a package that departs significantly from existing syntax or behavior: the lower the switching costs, the better. I share @pdeffebach's view that missing values are a problem (well, at least for me), and that data viewing is not as smooth as in e.g. Stata.

I'm travelling a lot for seminars at the end of the month, so not sure I can join, but keep me in the loop about the date/time.

@grantmcdermott
Copy link

grantmcdermott commented Apr 11, 2024

Someone just forwarded me this link, so please excuse another uninvited house guest.

@korenmiklos it’s your package and your prerogative, but I’d just like to plus 1 all the comments about caution against reinventing the wheel here and introducing yet another syntax to Julia. DataFrames.jl and derivatives are all very well developed now. If you do want another frontend, then it makes much, much more sense to mimic the tidyverse API than Stata’s, as @kdpsingh has already done. I mean this both in terms of numbers of absolute users, econ or not, and also the clear influence that the tidyverse has had on other languages and packages from Ibis to Polars to Tidier.jl etc.

What is clearly missing from Julia’s applied micro toolset IMO has already been mentioned above: marginaleffects, table-writing (at least that is as good as modelsummary & co.), inconsistent missings treatment, hypothesis testing and vcov adjustments, etc. Moreover, quite a lot of the Julia universe still suffers from poor documentation. I wrote something to this effect on the Julia Forums back in 2021 and I still feel like this is where the biggest bang for your buck is going to be. At the same time, you have to make a value proposition for why an applied economist should learn Julia instead of say R (or even Python). And I think the sales pitch here is much harder, since you are ignoring some of Julia’s obvious advantages (which are really felt in structural and macro).

@droodman
Copy link

droodman commented Apr 15, 2024

I'm curious to join. I'm in DC, but will be in California April 29-30.

I posted a working paper yesterday about using Julia as a back end for Stata and other environments, the motivating example being reghdfejl for Stata, which wraps FixedEffectModels.jl. I think the underlying julia package for Stata is getting pretty solid. I wouldn't be surprised if, say, Stata version 20 officially supports Julia.

I agree with the comments above that while confusing syntax is sometimes a barrier (contrasts=Dict(:v1=>DummyCoding()) instead of i.v1 for non-CategoricalVectors), and I appreciate the efforts to overcome that, bigger issues are poor documentation and lack of basic features like a data viewer and common regression and inference methods. It sounds like there are issues with missing. And I have seen no language for expressing hypotheses. (Stata has a standard for expressing hypotheses and constraints, especially linear ones.)

I wonder about institutional vs technical solutions. E.g., can Julia establish a documentation standard with incentives for compliance, whereby it would become normal for package developers to document all the functions and all their options in one place?! Another example of an institutional solution: Stata created the Stata Journal to reward academics for contributing software, in a currency academics appreciate, publication. It also has a mechanism for generating top-10 lists of most-downloaded packages, from SSC anyway.

I wonder if it matters more what would make developers switch to Julia than what would make regular users switch. If I'm a young econometrician with a clever new method, I'm probably going to implement in R first and maybe Stata and/or Matlab too, for obvious reasons. My new paper makes the case for Julia as a universal back end development environment. Maybe back ends are the back door to getting Julia used more as a front end too, in the long run? If the cores of useful, well-maintained packages are already in Julia, polishing the user experience becomes easier.

There's also the issue of money. IIRC NumPy and the like were solidified with public or private grant funds, on the idea that those packages are are global public goods. Chan-Zuckerberg has given grants for Julia work. To get $ for stats work in Julia, one would probably need to make the case for its distinctive potential to serve users of many software platforms.

@korenmiklos
Copy link
Member Author

@andrasvereckei

@korenmiklos
Copy link
Member Author

I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post.

All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here.

@droodman
Copy link

droodman commented Apr 25, 2024 via email

@maiaguell
Copy link

maiaguell commented Apr 26, 2024 via email

@pdeffebach
Copy link

I did not get the invite, either, but I can try and hop on the call

@pdeffebach
Copy link

A apologize for the abrupt and rude entry to the meeting.

I wanted to clarify the parsing rules situation. You are correct that if youxxx @when yyy parses as a tuple​. I apologize for being dismissive. I was wrong.

The issue I was recalling arises when expressions are in a begin​ block. I think julia wants either a new line or an end​, but not a multiple expressions on the same line.

julia> macro rsubset(args...)
           for arg in args
               dump(arg)
           end
       end;

julia> macro when(args...)
           nothing
       end;

julia> @rsubset begin
           :y = 1 + 2 @when :z == 1
       end
ERROR: ParseError:
# Error @ REPL[23]:2:16
@rsubset begin
    :y = 1 + 2 @when :z == 1
#              └───────────┘ ── Expected `end`
Stacktrace:
 [1] top-level scope
   @ none:1

without the begin​ block things work fine

julia> @rsubset :y = 1 + 2 @when :z == 1

DataFramesMeta.jl uses begin​ ... end​ for many transformations that get passed to the same transform call. Making transformations on their own line with @gen​ might solve this problem, but might also cause issues inside a @chain block. Having multiple transformations withing a begin ... end block also benefits leveraging DataFrames.jl's performance features for many transformations at once (which get multithreaded). In contrast to something like Tidier.jl, DataFramesMeta.jl tries to adhere closely to the DataFrames.jl API.

I don't want to discourage you from a new data manipulation library. I think it's good for the ecosystem to have many iterations on the same overall design.

A Wald Test package or Marginal Effects package, that has a "Julian" API (interacts nicely with the StatsModels API, for example) would be of broad interest to the community. I think a focus on a Stata-like API for running regressions runs two risks.

  1. It does not have enough features, maybe it's easy to run OLS but hard to run high-dimensional fixed effects, or a Wald Test, marginal effects, etc.
  2. Features are reliant on macros or idiosyncratic API specifications, making it hard for other packages to leverage any innovations and hard to make things work programatically and "at scale".

Again, I apologize for my rudeness this morning.

Peter

@droodman
Copy link

droodman commented May 2, 2024

I'll just add that WildBootTests.jl can do non-bootstrapped Wald and score tests after OLS, IV, and even ML estimation. However the interface is low-level. You express a linear hypothesis, or set of hypotheses, Rb=r by passing R and r. And you tell it the model, and in the case of ML the estimation result, in a similarly low-level way rather than passing a fitted reg() result. That's OK when using it as a back end in Stata or R. I intend to make it accept fitted Julia regression results. But I can't make it accept hypotheses in a nicer way until a formula-like language for expressing them is developed.

@korenmiklos
Copy link
Member Author

No worries, @pdeffebach, thanks for joining and sharing your thoughts.

The command ... if condition syntax is so ubiquitous in Stata, but also so helpful, I want to allow for this. Every Stata command begins with a reserved word, which we can use to parse @when (or @where) statements in the rest of the expression.

I will look into the other stats packages mentioned.

Tomorrow I will create+share a summary of this meeting and close this issue.

@korenmiklos korenmiklos added this to the Design milestone May 7, 2024
@korenmiklos
Copy link
Member Author

Thanks @droodman, @maiaguell, @gergelyattilakiss, @floswald, @jmboehm, @pdeffebach, @andrasvereckei for the productive meeting. My summary notes, without implying that you agree with all this.

I am closing this, but we can continue the discussion under individual issues related to design.


State of the art

The Stata universe

  1. Most applied micreconomists use Stata (70% of REStud packages, followed by Matlab 50%, and R 16%)
  2. Often combined with another language (e.g. Python for cleaning, Matlab for simulation)
  3. Vast majority (88%) of Stata scripts are devoted to data cleaning

The Julia universe

  1. DataFrames.jl de facto standard for tabular data
  2. Many grammars for data cleaning: Query, DataFramesMeta, TidierData

Broad goal:

Port Stata syntax and tools to Julia, like Tidier.jl did tidyverse.

Key tradeoff:

Users like convenience and sensible default choices. But explicit, verbose software is less bug prone.

Be mindful of trade-off throughout the project. Maybe the user can calibrate their level of risk tolerance.

Missing pieces in the Julia data universe

  1. Missing values
    1. Stata has common sense defaults
      1. also some quirky behavior, like . > anything
    2. Risky choices, make them explicit
    3. Input/output (how to read and write missing values) vs algebra (what is 4 + missing?)
    4. Type conversion is a pain, cannot put a missing into a vector of Floats
  2. Better documentation for existing packages
  3. Maintainers, curation for existing regression packages
  4. Wald test
    1. formula language for linear constraints
    2. test gender == schooling + 5
  5. ML estimation package
    1. standard errors, clustering
    2. regtables

Best of Stata

replace y = 0 if y < 0
regress y x if x > 0

contrasted with much harder syntax in Pandas, R, Julia.

if can be used with almost all commands. Convenient and verbose, no trade-off here. This feature should be implemented if at all possible.

Sensible default choices for missing values.

By default, operations are on variables (columns, vectors).

Opinion: variable scoping is interesting.

scalar n_users = 5
generate y = n_users + 1
replace y = . if y < n_users

BUT can lead to dangerous bugs:

scalar y = 5
generate y = y + 1

Contrasted with some existing grammars

  • explicitly refer to a df column, df.x, df[!, :x]
  • refer to symbol, like :x, :y or strings, "x" "y"
  • TidierData does it well, i.e., most like Stata

Explicit merge m:1 vs merge 1:1

by x: egen z = sum(1)

Value labels are different for categorical vectors.

  • BUT: no strings as factors
  • in Stata, variables don't have coding, i.gender and c.gender can be in the same regression
  • i. notation, changing the base, subset of categories

Not so good in Stata

  • quirky syntax, like egen vs collapse
  • no proper function returns

Code examples

using TidierData
@chain data begin
	@select command canonical_form
	@filter canonical_form == "generate"
	@group_by command
	@summarize n = n()
	@ungroup
	@arrange desc(n)
end
using Kezdi
@chain data begin
	@keep command canonical_form
	@keep @if canonical_form == "generate"
	@egen n = count(), by(canonical_form)
	@sort -n
end
replace y = . if y < 5
const n_users = 5
model_object = @chain data begin
	@replace y = 0 @if y < 0
	@regress y x @if x > n_users, vce(cluster country)
end
const n_users = 5
@chain data begin
	@replace y = 0 @if y < 0
	@aside model_object = @regress y x @if x > n_users, vce(cluster country)
	@keep @if x < 0	
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests