Parallel processing in python 3: `RDataFrame` ? #51

IzaakWN · 2023-11-22T16:32:04Z

Unfortunately the current parallel processing functionality for creating histograms from trees in SampleSet.gethist broke when switching to python 3. Segmentation faults seem to be caused by a conflict between how python and ROOT handle their objects in the memory. (The parallel processing is done (ab)using python's multithreading.)

As a consequence, I am starting to look into completely redesigning the SampleSet.gethist/MergedSample.gethist/Sample.gethist routines using RDataFrame, which is native to ROOT since v6.14. I will probably make this the default routine replacing the old routine based on python's multithreading by Plotter/python/plot/MultiThread.py and MultiDraw.py/MultiDraw.cxx. The latter also has some unexpected behavior for array branches of variable length.

Besides solving the memory issues, this should be more performant because we can string together multiple instances of RDataFrame (see this section of the class reference):

from ROOT import RDataFrame, RDF
df1 = RDataFrame("tree", "DY.root")
df2 = RDataFrame("tree", "TT.root")
df1_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
df2_sel = df1.Filter("q_1*q_2<0 && pt_1>50 && pt_2>50")
res1 = df1_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
res2 = df2_sel.Histo1D(("pt_1",50,0,250),"pt_1","genweight*idisoweight")
RDF.RunGraphs([res1,res2]) # runs df1 and df2 concurrently

and let RDataFrame optimize the parallel processing of many histograms (multiple samples x variables x selections) by itself.

Furthermore, we could even think of processing multiple variables and selections in one go. (The previous setup would only process multiple variables and samples in parallel, but sequentially for selections.)

The text was updated successfully, but these errors were encountered:

IzaakWN · 2023-11-23T17:46:06Z

The segmentation fault in parallel processing should be solved by PR #52.

In the next weeks, I will try to implement RDataFrame into TauFW's SampleSet, similar to this standalone example:
https://github.com/cms-tau-pog/TauFW/blob/master/Plotter/test/testRDataFrame.py

IzaakWN · 2023-12-20T15:50:35Z

Draft PR #56
Issue with mutlithreading and plans for RDataFrame implementation were presented and discussed in the TauPOG meeting (18/12/2023) here: https://indico.cern.ch/event/1358491/#3-plans-status-of-taufw

IzaakWN added bug Something isn't working enhancement New feature or request labels Nov 22, 2023

IzaakWN mentioned this issue Nov 23, 2023

Fix parallel processing for python3 #52

Merged

IzaakWN mentioned this issue Dec 15, 2023

Implement RDataFrame in Sample and SampleSet for plotting #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing in python 3: `RDataFrame` ? #51

Parallel processing in python 3: `RDataFrame` ? #51

IzaakWN commented Nov 22, 2023 •

edited

Loading

IzaakWN commented Nov 23, 2023

IzaakWN commented Dec 20, 2023

Parallel processing in python 3: RDataFrame ? #51

Parallel processing in python 3: RDataFrame ? #51

Comments

IzaakWN commented Nov 22, 2023 • edited Loading

IzaakWN commented Nov 23, 2023

IzaakWN commented Dec 20, 2023

Parallel processing in python 3: `RDataFrame` ? #51

Parallel processing in python 3: `RDataFrame` ? #51

IzaakWN commented Nov 22, 2023 •

edited

Loading