Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel processing in python 3: RDataFrame ? #51

Open
IzaakWN opened this issue Nov 22, 2023 · 2 comments
Open

Parallel processing in python 3: RDataFrame ? #51

IzaakWN opened this issue Nov 22, 2023 · 2 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@IzaakWN
Copy link
Collaborator

IzaakWN commented Nov 22, 2023

Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)

As a consequence, I am starting to look into completely redesigning the SampleSet.gethist/MergedSample.gethist/Sample.gethist routines using RDataFrame, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py and MultiDraw.py/MultiDraw.cxx. The latter also has some unexpected behavior for array branches of variable length.

Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame (see this section of the class reference):

from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently

and let RDataFrame optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.

Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)

@IzaakWN IzaakWN added bug Something isn't working enhancement New feature or request labels Nov 22, 2023
@IzaakWN
Copy link
Collaborator Author

IzaakWN commented Nov 23, 2023

The segmentation fault in parallel processing should be solved by PR #52.

In the next weeks, I will try to implement RDataFrame into TauFW's SampleSet, similar to this standalone example:
https://github.com/cms-tau-pog/TauFW/blob/master/Plotter/test/testRDataFrame.py

@IzaakWN
Copy link
Collaborator Author

IzaakWN commented Dec 20, 2023

Draft PR #56
Issue with mutlithreading and plans for RDataFrame implementation were presented and discussed in the TauPOG meeting (18/12/2023) here: https://indico.cern.ch/event/1358491/#3-plans-status-of-taufw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant