Skip to content
This repository has been archived by the owner on May 2, 2022. It is now read-only.

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
TaiSakuma committed Oct 2, 2019
1 parent f3cc4e4 commit 3153d7a
Showing 1 changed file with 11 additions and 246 deletions.
257 changes: 11 additions & 246 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@

---

A python library for summarizing event data into multivariate categorical data
A Python library for summarizing event data into multivariate categorical data

### Description
_AlphaTwirl_ is a python library that summarizes event data into multivariate categorical data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in [ROOT](https://root.cern.ch/) [TTrees](https://root.cern.ch/doc/master/classTTree.html) with one entry per collision event of an [LHC](https://home.cern/topics/large-hadron-collider) experiment at [CERN](http://home.cern/). Event data are often large—too large to be loaded in memory—because they have as many entries as events. Multivariate categorical data, the output of AlphaTwirl, have one row for one category. They are usually small—small enough to be loaded in memory—because they only have as many rows as categories. Users can, for example, import them as data frames into [R](https://www.r-project.org/) and [pandas](http://pandas.pydata.org/), which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas.
_AlphaTwirl_ is a Python library that summarizes event data into multivariate categorical data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in [ROOT](https://root.cern.ch/) [TTrees](https://root.cern.ch/doc/master/classTTree.html) with one entry per collision event of an [LHC](https://home.cern/topics/large-hadron-collider) experiment at [CERN](http://home.cern/). Event data are often large—too large to be loaded in memory—because they have as many entries as events. Multivariate categorical data, the output of AlphaTwirl, have one row for one category. They are usually small—small enough to be loaded in memory—because they only have as many rows as categories. Users can, for example, import them as data frames into [R](https://www.r-project.org/) and [pandas](http://pandas.pydata.org/), which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas.

****

Expand All @@ -21,255 +21,20 @@ _AlphaTwirl_ is a python library that summarizes event data into multivariate ca

****

### CHEP 2018
### Publication

- Tai Sakuma, *"AlphaTwirl: a python library for summarizing event data into
multi-dimensional categorical data"*, **CHEP 2018**, 9-13 July 2018
Sofia, Bulgaria, ([indico](https://indico.cern.ch/event/587955/contributions/2937634/))

[<img src="https://raw.githubusercontent.com/alphatwirl/alphatwirl/v0.18.6/images/tai_20180709_CHEP2018_corrected_01_1900.png" width="200">](https://indico.cern.ch/event/587955/contributions/2937634/attachments/1680105/2731035/tai_20180709_CHEP2018_corrected_01.pdf)
- Tai Sakuma, *"AlphaTwirl: A Python library for summarizing event data into multivariate categorical data"*,
EPJ Web of Conferences **214**, 02001 (2019), [doi:10.1051/epjconf/201921402001](https://doi.org/10.1051/epjconf/201921402001),
[1905.06609](https://arxiv.org/abs/1905.06609)

****

### Features

#### Input format

- **Event data:** input data of alphatwirl are event data in general
- Event data are any data with one entry (row) for one event.
- Data in ROOT trees are typically event data
- e.g., one entry for one proton-proton collision event
- Event data are often large because they have as many entries as
events
- e.g., they are often stored in many files in a server
machine or a dedicated storage system
- [**ROOT trees:**](https://root.cern.ch/root/html/guides/users-guide/Trees.html) the main input format of alphatwirl
- **Flat trees:** ROOT trees with only primitive
types such as _int_ and _float_ and an array of those.
- [Delphes trees](https://cp3.irmp.ucl.ac.be/projects/delphes/wiki/WorkBook/RootTreeDescription)
- [Heppy trees](https://github.com/cbernet/heppy)
- [CMS NanoAOD](https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookNanoAOD)
- With additional code to access each class, it is also possible to read
trees with persistent objects
- [CMS EDM formats](https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookCMSSWFramework)
- Users can write modules to support other formats

#### Output format

- **Multivariate categorical data**: output data of alphatwirl
are multivariate categorical data
- They are usually small because they only have as many entries as categories.
- Often small enough to be stored as text files in a laptop computer.
- **Fixed width format**: text files with fixed width format have been
primarily used as output format
- This format is convenient as long as the data size is small. You
can browse it with a text editor. You can import it in
[R](https://www.r-project.org/) and
[pandas](http://pandas.pydata.org/) as a data frame.
- An example output file looks like

```
process htbin njetbin minChi n
QCD 400 2 0 8.15e+05
QCD 400 2 0.05 3.49e+05
QCD 400 2 0.1 1.18e+05
QCD 400 2 0.15 3.78e+04
TTJets 1200 6 1.45 0.00
TTJets 1200 6 1.5 0.00
```

- There are plans to support [_feather_](https://github.com/wesm/feather).
- Users can write modules to support other formats


#### Split-apply-combine strategy

- The general idea of alphatwirl is to employ the [_split-apply-combine
strategy_](https://www.jstatsoft.org/article/view/v040i01) on event data.
- _split_ event data into groups by categories, _apply_ a
function to data in each group, and _combine_ the results as a
**table** of multivariate categorical data.
- Histograms can be created in this strategy&mdash;split data into
bins, count the number of entries in each bin, and combine the
results as a table.
- Summarizing events in alphatwirl is generalization of creating
histograms.


#### Keys and values

- **Keys:** categories are defined in terms of keys
- **Values:** values are summarized in each group defined by
categories
- Keys and values are attributes of the event object, they are either
- stored in the input file
- or created by _scribllers_

#### Table configuration

- Tables can be configured by a list of python dictionaries.
- The example code below configures five tables

```python
htbin = Binning(boundaries=(0, 200, 400, 800))
njetbin = Binning(boundaries=(1, 2, 3, 4, 5))
tblcfg = [
dict(keyAttrNames=('mht40', ),
binnings=(Round(10, 0), ),
keyOutColumnNames=('mht', )),
dict(keyAttrNames=('ht40', ‘mht40'),
binnings=(htbin, Round(10, 0)),
keyOutColumnNames=('ht', 'mht')),
dict(keyAttrNames=('ht40', 'nJet40', ‘mht40'),
binnings=(htbin, njetbin, Round(10, 0)),
keyOutColumnNames=('ht', 'njet', 'mht')),
dict(keyAttrNames=('ht40', ‘jet_pt'),
binnings=(htbin, RoundLog(0.1, 100)),
keyIndices=(None, 0),
keyOutColumnNames=('ht', 'jet_pt')),
dict(keyAttrNames=('ht40', ‘jet_pt'),
binnings=(htbin, RoundLog(0.1, 100)),
keyIndices=(None, ‘*'),
keyOutColumnNames=('ht', 'jet_pt')),
]
```

- A more complex example

```python
dict(
keyAttrNames=('ieta', 'iphi', 'depth', 'QIE10_index'),
keyIndices=('(*)', '\\1', '\\1', '\\1'),
binnings=(echo, echo, echo, echo),
valAttrNames=('QIE10_energy', ),
valIndices=('\\1', ),
keyOutColumnNames=('ieta', 'iphi', 'depth', 'idxQIE10'),
valOutColumnNames=('energy', ),
summaryClass=alphatwirl.Summary.Sum
)
```

##### Indices

- Variables are scalar or arrays. Indices specify elements of an array
- Indices can be flexibly configured
- a simple example:<br />
`dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, 0), ⋯ )`<br />
`ht40` is scalar; the index is `None`. `jet_pt` is an array; `0`
specifies the first element of `jet_pt`.
- inclusive:<br />
`dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, '*'), ⋯ )`<br />
`'*'` means all elements. all pairs of `ht40` and an element of
`jet_pt`.
- all combinations:<br />
`dict(keyAttrNames=('jet_pt', 'muon_pt'), keyIndices=('*' '*'), ⋯ )`<br />
all combinations of `jet_pt` and `muon_pt`
- back reference:<br />
`dict(keyAttrNames=('jet_pt', ‘jet_eta'), keyIndices = ('(*)', '\\1'), ⋯ )`<br />
pairs of `jet_pt` and `jet_eta` with same index.
The parenthesis in `'(*)'` indicates to remember the index.
`'\\1'` refers the index in the first parenthesis.
- a more complex example:<br />
`dict(keyAttrNames=('jet_pt', 'jet_eta', 'muon_pt', 'muon_eta'), keyIndices=('(*)', '\\1', '(*)', '\\2'), ⋯ )`
##### Binnings
- Four binnigs classes are implemented
- **Binning:** bin boundaries are manually specified by a user<br />
`Binning(boundaries=(0, 200, 400, 800))`
- **Round:** equal bin width<br />
`Round(10, 0)`<br />
`10` is the bin width and `0` is a boundary. The lower edge of a
bin is included. The upper edge belongs to the next bin.
- **RoundLog:** equal bin width in logarithm<br />
`RoundLog(0.1, 100)`<br />
- **Echo:** the value itself<br />
`Echo(0.1, 100)`<br />
- Users can write own custom binning classes
##### Scribblers
- If variables necessary for table configuration or event selection are not in
the input file, users can write _scribblers_ to create them on the fly
- The variables stored in the input files and the variables created by
scribblers can be used as keys and values in the same way in the
table configuration and event selection
#### Event selection
- Conditions of event selections can be specified by nested tuples and dictionaries.
```python
dict(All=(
'ev : ev.ht[0] >= 400',
'ev : ev.mht[0] >= 200',
dict(Any=(
'ev : ev.nJet[0] == 1',
dict(All=(
'ev : ev.nJet[0] >= 2',
'ev : ev.minChi[0] >= 0.7’))
))))
```
- A nested combination of _all_ and _any_
- **All:** all conditions need to be met
- **Any:** at least one of the conditions needs to be met
- Users can write their own implementation of All and Any to add
functionalities, for example, to count number of events that satisfy
each condition
#### Dependency injection
- Classes in alphatwirl generally operate on abstract classes (in
python, abstract classes don’t actually need to exist. duck typing
is used instead).
- Particular implementations of most operations are determined at run
time: input formats, output formats, a concurrency method, event
selections, object selections, categorization, event summarizing
methods, summary collecting methods, delivery methods, and even
progress bars.
- Furthermore, each particular implementation doesn’t generally
depend on the framework either. In fact, the same event
selection code can be used in Heppy.
- Particular implementations are specified by configuration.
#### Fast branch access by addresses
- Although using [PyROOT](https://root.cern.ch/pyroot), instead of
accessing to branches by attributes of a tree object, alphatwirl
uses `SetBranchAddress()`, which is much faster&mdash;can be more
than ten times faster.
### License

#### Multiprocessing
- AlphaTwirl is licensed under the BSD license.

- [Multiprocessing](https://docs.python.org/3.6/library/multiprocessing.html)
can be used to concurrently process events
- Progress bars grow in parallel on terminal screen to indicate the progress of each process.
*****

```
25.10% :::::::::: | 753 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
30.47% :::::::::::: | 914 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
29.30% ::::::::::: | 879 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
85.40% :::::::::::::::::::::::::::::::::: | 854 / 1000 |: WJetsToLNu_HT1200to2500_madgraph
27.57% ::::::::::: | 827 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
25.47% :::::::::: | 764 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
79.60% ::::::::::::::::::::::::::::::: | 796 / 1000 |: WJetsToLNu_HT2500toInf_madgraphM
25.50% :::::::::: | 765 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
```
#### Batch system
### Contact

- Instead of multiprocessing, a batch system can be also used
- Currently, the interface to
[HTCondor](https://research.cs.wisc.edu/htcondor/) is implemented.
- Users can write modules to use other batch system.
- While jobs are running in a batch system, the main process is
running in the foreground, monitoring the progress of the jobs, and
collecting the results as the jobs finish.
- Failed jobs are automatically resubmitted.
- Jobs can be split in terms of the number of input files and events.
- one input file can be split into multiple jobs
- one job can include multiple input files
- Tai Sakuma - tai.sakuma@gmail.com

0 comments on commit 3153d7a

Please sign in to comment.