# Nested jagged array implementation



A jagged array is an array of arrays. Each array is not guaranteed to be of the same size. You could have one parent array that each of its component is another array.

But why do we need such kind of data structure? It will be clear after we take a look at the dummy_nanoevents.cc file. For convenience, I pasted the related codes below:


```C++
#include "TFile.h"
#include "TRandom.h"
#include "TTree.h"

void
dummy_nanoevent()
{
  UInt_t run, event, luminosityBlock;

  std::vector<double> Jets_Pt;
  std::vector<double> Jets_Eta;
  std::vector<double> Jets_Phi;
  std::vector<double> Jets_E;
  std::vector<double> Jets_ID;
  std::vector<int> Jets_SubjetsCounts;
  std::vector<double> Jets_subjet_Pt;
  std::vector<double> Jets_subjet_Eta;
  std::vector<double> Jets_subjet_Phi;
  std::vector<double> Jets_subjet_E;

  TFile* Tfile = Tfile = TFile::Open( "dummy_nanoevents.root", "RECREATE" );
  TTree* ttree = new TTree( "Events", "" );

  ttree->Branch( "run",                &run,             "run/I" );
  ttree->Branch( "luminosityBlock",    &luminosityBlock, "luminosityBlock/I" );
  ttree->Branch( "event",              &event,           "event/I" );

  ttree->Branch( "Jet_pt",            &Jets_Pt );
  ttree->Branch( "Jet_eta",           &Jets_Eta );
  ttree->Branch( "Jet_phi",           &Jets_Phi );
  ttree->Branch( "Jet_energy",             &Jets_E );
  ttree->Branch( "Jet_ID",            &Jets_ID );
  ttree->Branch( "Jet_SubjetsCounts", &Jets_SubjetsCounts );
  ttree->Branch( "Subjet_pt",     &Jets_subjet_Pt );
  ttree->Branch( "Subjet_eta",    &Jets_subjet_Eta );
  ttree->Branch( "Subjet_phi",    &Jets_subjet_Phi );
  ttree->Branch( "Subjet_energy",      &Jets_subjet_E );

  for( Int_t ev = 0; ev < 100; ev++ ){
    run             = 1;
    event           = ev;
    luminosityBlock = 1000 * ev;
    Int_t njet = Int_t( 3 * gRandom->Rndm() + 1 );

    Jets_Pt.clear();
    Jets_Eta.clear();
    Jets_Phi.clear();
    Jets_E.clear();
    Jets_ID.clear();
    Jets_SubjetsCounts.clear();
    Jets_subjet_Pt.clear();
    Jets_subjet_Eta.clear();
    Jets_subjet_Phi.clear();
    Jets_subjet_E.clear();

    for( int i = 0; i < njet; i++ ){
      Jets_Pt.push_back( 10 * gRandom->Rndm() );
      Jets_Eta.push_back( gRandom->Rndm() );
      Jets_Phi.push_back( gRandom->Rndm() );
      Jets_E.push_back( gRandom->Gaus( 50, 10 ) );
      Jets_ID.push_back( int( 7*gRandom->Rndm() ) );
      // subjets
      Int_t jets_sub = Int_t( 3 * gRandom->Rndm() );
      Jets_SubjetsCounts.push_back( jets_sub );

      for( int i = 0; i < jets_sub; i++ ){
        Jets_subjet_Pt.push_back( 10 * gRandom->Rndm() );
        Jets_subjet_Eta.push_back( gRandom->Rndm() );
        Jets_subjet_Phi.push_back( gRandom->Rndm() );
        Jets_subjet_E.push_back( gRandom->Gaus( 25, 10 ) );
      }
    }

    ttree->Fill();
  }

  ttree->Write();
}
```



In the file, we generated `{njet}` jets and each jet has `{jets_sub}` subjets. So the structure looks like this:
```python
`jet` : {
          `pt` : [n],
          `eta` : [n],
          `phi` : [n],
          `energy` : [n],
          `ID` : [n],
          `subjets` : [
              {`pt`:[m0], `eta`:[m0], `phi`:[m0], `energy`:[m0]},   # 0
              {`pt`:[m1], `eta`:[m1], `phi`:[m1], `energy`:[m1]},   # 1
              ...
              {`pt`:[m], `eta`:[m], `phi`:[m], `energy`:[m]},   # n-1
          ]
}
```
We have `n` jets here and each jet has `m` subjets. `m` does not need to be the same for all the jets, that's where the name `jagged` comes. `jet.pt` returns array with `n` numbers (`[0, 1, ... n-1]`), while `jet.subjets.pt` returns `n` arrays (`[[m0], [m1], ... [m]]`).

Note that when we create the TBranches, we can define it to store vector of vectors `vector< vector<double> >`, this is intuitively the same structure as we shown above. However, Coffea doesn't recognize this structure.

We need to store the jets as a flat vector with length `n` and store all the subjets as a flat vector with length `N` (`N>=n`). Additionally we need another Branch `Jet_SubjetsCounts` (with length `n`) to tell us how many subjets that each jet contains. 

So if we have the following vectors:
```
Jets_energy = [10, 10.1, 10.2];
Jet_SubjetsCounts = [1, 3, 2];
Jets_subjet_energy = [10, 1.1, 5, 4, 5.1, 5.1];
```
It means we have 3 jets, the mapping is: `Jet: [10, 10.1, 10.2]  -> Subjets: [ {10}, {1.1, 5, 4}, {5.1, 5.1}]`

So how to implement this structure in coffea? It is actually very easy because we only need to `import nest_jagged_forms` and follow the rules:

In [1]:
from coffea.nanoevents import NanoEventsFactory, BaseSchema, NanoAODSchema
from coffea.nanoevents.schemas import zip_forms, nest_jagged_forms

class DummySchema(BaseSchema):
    mixins = {
      'Electron': "PtEtaPhiMLorentzVector",
      'Muon': 'PtEtaPhiMLorentzVector',
      'Jet': 'PtEtaPhiELorentzVector',
      'Subjet': 'PtEtaPhiELorentzVector',
    }

    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self,branch_forms):
        output = {}

         ## Making the basic 
        for name in self.mixins:
            mixin = self.mixins.get(name, "NanoCollection")
            output[name] = zip_forms({
                k[len(name) + 1 :]: branch_forms[k]
                for k in branch_forms
                if k.startswith(name + "_")
            },
            name,
            record_name=mixin)

        nest_jagged_forms(output['Jet'],
                          output.pop('Subjet'),
                          'SubjetsCounts',
                          'Subjets')
        return output


    @property
    def behavior(self):
        from coffea.nanoevents.methods import base, vector
        behavior = {}
        behavior.update(base.behavior)
        behavior.update(vector.behavior)
        return behavior


[`nest_jagged_forms` is defined as follows](https://github.com/yihui-lai/coffea/blob/351cc727845ab83a8e31a193dc06e534bedb97fe/coffea/nanoevents/schemas/base.py#L62). The first input parameter is the parent array, the second parameter is the child array. Then the third one is the counts array. Final one is the new name for the child array. 

```python 
def nest_jagged_forms(parent, child, counts_name, name):
    """Place child listarray inside parent listarray as a double-jagged array"""
    if not parent["class"].startswith("ListOffsetArray"):
        raise ValueError
    if parent["content"]["class"] != "RecordArray":
        raise ValueError
    if not child["class"].startswith("ListOffsetArray"):
        raise ValueError
    counts = parent["content"]["contents"][counts_name]
    offsets = transforms.counts2offsets_form(counts)
    inner = listarray_form(child["content"], offsets)
    parent["content"]["contents"][name] = inner
```




We can try this DummySchema now:

In [2]:
fname = "file:dummy_nanoevents.root"
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=DummySchema
         ).events()
print(events.fields)
print(events.Jet.fields)
print(events.Jet.energy[2])
print(events.Jet.Subjets.energy[2])

['Electron', 'Muon', 'Jet']
['pt', 'eta', 'phi', 'energy', 'ID', 'SubjetsCounts', 'Subjets']
[51.8, 47.9, 56]
[[17.4], [22.5, 24.7], [28.3, 23]]
