-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial design meeting #16
Comments
@korenmiklos, have you looked at Query.jl, DataFramesMeta.jl, and SplitApplyCombine.jl? What are you missing from these (and Douglass.jl) that requires a new package? |
I feel a bit like the uninvited guest who comes to spoil the party - sorry! Anyway, my aim is not to spoil the party. I've been talking many times with @jmboehm about this and I think the question of @tpapp is exactly spot on. What exactly is missing, and what's peculiar about applied economist's workflows here? Here is a plain vanilla DataFrames.jl pipeline. @pdeffebach could probably chip in some useful bits from DataFramesMeta.jl. using CSV
using DataFrames
using GLM
using Chain
using Statistics
gapminder() = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/gapminder/gapminder.csv"),
DataFrame)
function pipeline(d::DataFrame)
@chain d begin
select(:country, :year, :gdpPercap, :pop)
transform([:pop, :gdpPercap] .=> (x -> log.(x)) .=> [:logpop, :loggdpPercap])
transform(:loggdpPercap => (x -> replace(x, missing => minimum(skipmissing(x))) => :loggdpPercap))
groupby(:country)
combine([:logpop,:loggdpPercap] .=> mean .=> [:logpop,:loggdpPercap])
lm(@formula(loggdpPercap ~ logpop), _)
end
end
function run()
d = gapminder()
pipeline(d)
end
run()
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
loggdpPercap ~ 1 + logpop
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 8.21888 1.00019 8.22 <1e-12 6.24144 10.1963
logpop -0.00381135 0.0631324 -0.06 0.9519 -0.128627 0.121005
─────────────────────────────────────────────────────────────────────────── |
yes, please!! I would love to join and thanks for the initiative! |
@korenmiklos, thanks for including me. I don't have the bandwidth this month to join a 2-hour Zoom mainly because I have a lot of travel coming up this month and next month. I agree with the sentiment that it's worth figuring out the value proposition, but I think that's generally true of all new packages and shouldn't stop you from experimenting. A great example of this is the |
Glad you are trying to expand applied micro-economics uses in Julia! I agree with the above commentators that the current data cleaning ecosystem has very good "bones". I don't think we necessarily need a new data cleaning package. Here is the current pipe you have written
A few differences from above
This is to say: Maybe the syntax is occasionally more complicated in DataFramesMeta.jl. But the differences are the result of real tradeoffs. I don't think re-writing a new data-cleaning library is worth it at the moment. What should be done insteadHowever I think new developers devoted to micro-economics is a great idea! The number one issue I would like to see is for our statistics and regression packages. Currently, we have three main regression packages.
Additionally, the HypothesisTests.jl and StatsBase.jl packages don't really have maintainers anymore. @andreasnoack occasionally reviews PRs but there are still bugs to be fixed. Additionally, StatsBase.jl could use much more support accommodating missing values and PRs would (probably) be welcomed. (Missing values are contentious for a variety of reasons, and probably not a good place for a newcomer to start working on PRs). As for other estimation tools
So my take is that if a micro-economist wants to use Julia, data cleaning is not the issue. There are some data cleaning things that would make life easier. Two things I think about are (1) Better missing values support (which is kind of in my court, I have some PRs that can improve things) and (2) Better data viewing. Stata's viewer is excellent, but it shouldn't be impossible to write a QT-based data viewer. These are marginal, however, compared to the minuscule size of the Julia statistics and econometrics ecosystem compared to R and Stata. That's where the energy needs to be. |
Thanks all! My purpose with this package is as follows. Stata is doing something well. A large chunk of applied micro work is in Stata (we will have precise numbers by end of April). If we want applied microeconomists to use Julia, we need to offer something that is as easy to use as Stata and easy to switch to. Julia has great existing tools, but they don't fill this gap. @pdeffebach: I will look more into these packages. I agree there is value to be added there. Let's get more users, bigger communities, maybe it will help build the more innovative packages, too. |
From @gbekes Oh you can guess what I'll say.
|
Indeed. And let me tag Laurent @lrberge for R fixest and Alex @s3alfisc for his Python version PyFixest. . My strong view is that the fixest way -- a regression wrapper with a clear stance + flexibility is the way ahead, and the same syntax should work in all languages. |
I think FixedEffectModels is pretty good and flexible (should be a drop-in replacement for reghdfe). please file issues if you encounter problems or think of important missing functionalities
…On Tue, Apr 9, 2024 at 10:42 AM Gábor Békés ***@***.***> wrote:
Indeed. And let me tag Laurent @lrberge <https://github.com/lrberge> for
R fixest <https://lrberge.github.io/fixest/> and Alex @s3alfisc
<https://github.com/s3alfisc> for his Python version PyFixest
<https://github.com/s3alfisc/pyfixest>. .
Lemme also tag @vincentarelbundock <https://github.com/vincentarelbundock>
for modelsummary <https://modelsummary.com/>
My strong view is that the fixest way -- a regression wrapper with a clear
stance + flexibility is the way ahead, and the same syntax should work in
all languages.
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXKWKDCLSSWOJLOQQODY4P44PAVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGM2TKNZTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks Miklos for tagging me. As discussed with some of you, I feel that Julia would be able to bridge the two-language problem of economics which is that (1) data cleaning and reduced-form work is much simpler in Stata/R; (2) Structural work is much easier/faster in Python/C/Matlab/Julia. Karandeep has done terrific work on bringing R's tidyverse to Julia, but I'm not a R user so I still do my reduced-form work mostly in Stata. My goal for starting Douglass.jl was to have a package that can take Stata code almost one-for-one and generate Julia code (DataFrames.jl/DataFramesMeta.jl). My difficulty when using Julia's tabular data packages is exactly that I have to think very carefully about how missing values propagate, and I'd prefer to have exactly the behavior of Stata (that's somehow hardwired into my brain). I haven't had as much time for free software development as I would have liked, so I haven't taken this all the way to the end and developed a polished package. Perhaps, if I get a large grant one day, I'll get a student to spend a summer expanding and polishing the package. But as always the cost is in the maintenance, not in the initial development. For that you need to get a critical mass of users and developers. Tbh I don't see a point in developing a package that departs significantly from existing syntax or behavior: the lower the switching costs, the better. I share @pdeffebach's view that missing values are a problem (well, at least for me), and that data viewing is not as smooth as in e.g. Stata. I'm travelling a lot for seminars at the end of the month, so not sure I can join, but keep me in the loop about the date/time. |
Someone just forwarded me this link, so please excuse another uninvited house guest. @korenmiklos it’s your package and your prerogative, but I’d just like to plus 1 all the comments about caution against reinventing the wheel here and introducing yet another syntax to Julia. DataFrames.jl and derivatives are all very well developed now. If you do want another frontend, then it makes much, much more sense to mimic the tidyverse API than Stata’s, as @kdpsingh has already done. I mean this both in terms of numbers of absolute users, econ or not, and also the clear influence that the tidyverse has had on other languages and packages from Ibis to Polars to Tidier.jl etc. What is clearly missing from Julia’s applied micro toolset IMO has already been mentioned above: marginaleffects, table-writing (at least that is as good as modelsummary & co.), inconsistent missings treatment, hypothesis testing and vcov adjustments, etc. Moreover, quite a lot of the Julia universe still suffers from poor documentation. I wrote something to this effect on the Julia Forums back in 2021 and I still feel like this is where the biggest bang for your buck is going to be. At the same time, you have to make a value proposition for why an applied economist should learn Julia instead of say R (or even Python). And I think the sales pitch here is much harder, since you are ignoring some of Julia’s obvious advantages (which are really felt in structural and macro). |
I'm curious to join. I'm in DC, but will be in California April 29-30. I posted a working paper yesterday about using Julia as a back end for Stata and other environments, the motivating example being I agree with the comments above that while confusing syntax is sometimes a barrier ( I wonder about institutional vs technical solutions. E.g., can Julia establish a documentation standard with incentives for compliance, whereby it would become normal for package developers to document all the functions and all their options in one place?! Another example of an institutional solution: Stata created the Stata Journal to reward academics for contributing software, in a currency academics appreciate, publication. It also has a mechanism for generating top-10 lists of most-downloaded packages, from SSC anyway. I wonder if it matters more what would make developers switch to Julia than what would make regular users switch. If I'm a young econometrician with a clever new method, I'm probably going to implement in R first and maybe Stata and/or Matlab too, for obvious reasons. My new paper makes the case for Julia as a universal back end development environment. Maybe back ends are the back door to getting Julia used more as a front end too, in the long run? If the cores of useful, well-maintained packages are already in Julia, polishing the user experience becomes easier. There's also the issue of money. IIRC NumPy and the like were solidified with public or private grant funds, on the idea that those packages are are global public goods. Chan-Zuckerberg has given grants for Julia work. To get $ for stats work in Julia, one would probably need to make the case for its distinctive potential to serve users of many software platforms. |
I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post. All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here. |
I would be interested. Thank you. ***@***.*** ***@***.***> .
From: Miklós Koren ***@***.***>
Sent: Thursday, April 25, 2024 4:07 PM
To: codedthinking/Kezdi.jl ***@***.***>
Cc: droodman ***@***.***>; Comment ***@***.***>
Subject: Re: [codedthinking/Kezdi.jl] Initial design meeting (Issue #16)
I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post.
All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here.
—
Reply to this email directly, view it on GitHub <#16 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGB2Z2KMEAACUD4PVJ4YCD3Y7FO55AVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGA4DINJWGQ> .
You are receiving this because you commented. <https://github.com/notifications/beacon/AGB2Z2MGPPI4W2SWKL2KPZ3Y7FO55A5CNFSM6AAAAABF55NSAOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT33UG5I.gif> Message ID: ***@***.*** ***@***.***> >
|
Hi Miklos, I did not get the invite!
Maia
…On Thu, Apr 25, 2024 at 10:08 PM Miklós Koren ***@***.***> wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the
email is genuine and the content is safe.
I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If
you want to join but did not get an invite, please reply to this post.
All inputs are super useful and will be taken into account. I will also
share the statistical analysis we are currently doing and the outcome of
the design discussion here.
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK7ETGFM4IKUK3POGPZKRE3Y7FO55AVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGA4DINJWGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336. Is e buidheann carthannais a th’ ann an
Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
|
I did not get the invite, either, but I can try and hop on the call |
A apologize for the abrupt and rude entry to the meeting. I wanted to clarify the parsing rules situation. You are correct that if you The issue I was recalling arises when expressions are in a begin block. I think julia wants either a new line or an end, but not a multiple expressions on the same line.
without the begin block things work fine
DataFramesMeta.jl uses begin ... end for many transformations that get passed to the same transform call. Making transformations on their own line with @gen might solve this problem, but might also cause issues inside a I don't want to discourage you from a new data manipulation library. I think it's good for the ecosystem to have many iterations on the same overall design. A Wald Test package or Marginal Effects package, that has a "Julian" API (interacts nicely with the StatsModels API, for example) would be of broad interest to the community. I think a focus on a Stata-like API for running regressions runs two risks.
Again, I apologize for my rudeness this morning. Peter |
I'll just add that WildBootTests.jl can do non-bootstrapped Wald and score tests after OLS, IV, and even ML estimation. However the interface is low-level. You express a linear hypothesis, or set of hypotheses, Rb=r by passing R and r. And you tell it the model, and in the case of ML the estimation result, in a similarly low-level way rather than passing a fitted reg() result. That's OK when using it as a back end in Stata or R. I intend to make it accept fitted Julia regression results. But I can't make it accept hypotheses in a nicer way until a formula-like language for expressing them is developed. |
No worries, @pdeffebach, thanks for joining and sharing your thoughts. The I will look into the other stats packages mentioned. Tomorrow I will create+share a summary of this meeting and close this issue. |
Thanks @droodman, @maiaguell, @gergelyattilakiss, @floswald, @jmboehm, @pdeffebach, @andrasvereckei for the productive meeting. My summary notes, without implying that you agree with all this. I am closing this, but we can continue the discussion under individual issues related to design. State of the artThe Stata universe
The Julia universe
Broad goal:
Key tradeoff:
Be mindful of trade-off throughout the project. Maybe the user can calibrate their level of risk tolerance. Missing pieces in the Julia data universe
Best of Statareplace y = 0 if y < 0
regress y x if x > 0 contrasted with much harder syntax in Pandas, R, Julia.
Sensible default choices for missing values. By default, operations are on variables (columns, vectors). Opinion: variable scoping is interesting. scalar n_users = 5
generate y = n_users + 1
replace y = . if y < n_users BUT can lead to dangerous bugs: scalar y = 5
generate y = y + 1 Contrasted with some existing grammars
Explicit
Value labels are different for categorical vectors.
Not so good in Stata
Code examplesusing TidierData
@chain data begin
@select command canonical_form
@filter canonical_form == "generate"
@group_by command
@summarize n = n()
@ungroup
@arrange desc(n)
end using Kezdi
@chain data begin
@keep command canonical_form
@keep @if canonical_form == "generate"
@egen n = count(), by(canonical_form)
@sort -n
end replace y = . if y < 5 const n_users = 5
model_object = @chain data begin
@replace y = 0 @if y < 0
@regress y x @if x > n_users, vce(cluster country)
end const n_users = 5
@chain data begin
@replace y = 0 @if y < 0
@aside model_object = @regress y x @if x > n_users, vce(cluster country)
@keep @if x < 0
end |
With @gergelyattilakiss we are building a Julia package to help typical applied economics workflows for data cleaning, as well as exploratory and regression analysis. The syntax follows that of Stata (R), a statistical tool widely used by economists. Our work is inspired by https://github.com/TidierOrg/Tidier.jl and https://github.com/jmboehm/Douglass.jl.
The motivation is to help more economists adopt Julia, giving them a performant scientific computing language that they can use not only for macroeconomics simulations, but also applied microeconomics data work.
We have some ideas for the design of the tool (see code example below). We are also analyzing Stata code produced by economists, as submitted to journals, to study the patterns of coding and the most frequent commands used.
Please join us for a design meeting to discuss what such a tool should and should not do. If you have strong views about either Julia, Stata or applied economies, please come and share.
If you can join for a 2-hour Zoom meeting at the end of April, please let us know in a comment below.
The text was updated successfully, but these errors were encountered: