# Writting coffea Schema files

TO Be Written -- assigned Yihui Lai

The interpretation of the TTree data is configurable via schema objects. 

Schema teachs the event processor how to group variables into collections, so operations can be run over entire collection at once. And we can also define some handy [behaviors](https://awkward-array.readthedocs.io/en/latest/ak.behavior.html) for a specific collection in schema.

In this demo, we will create our own dummy schema and implement our own behavior. 

First, Let's look at the dummy_nanoevents root file with `BaseSchema` and see what's inside of this file. We'll show how to construct a schema that can be used to interpret this root file. 

The events object can be instantiated as follows:


In [1]:
from coffea.nanoevents import NanoEventsFactory, BaseSchema, NanoAODSchema
fname = "file:dummy_nanoevents.root"
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=BaseSchema
         ).events()
print(events.fields)

['run', 'luminosityBlock', 'event', 'nMuon', 'Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_mass', 'Muon_charge', 'Muon_flag', 'Muon_dxy', 'Muon_dxyErr', 'Muon_dz', 'Muon_dzErr', 'Electron_pt', 'Electron_eta', 'Electron_phi', 'Electron_mass', 'Electron_charge', 'Electron_flag', 'Electron_dxy', 'Electron_dxyErr', 'Electron_dz', 'Electron_dzErr', 'Jet_pt', 'Jet_eta', 'Jet_phi', 'Jet_energy', 'Jet_ID', 'Jet_SubjetsCounts', 'Subjet_pt', 'Subjet_eta', 'Subjet_phi', 'Subjet_energy']


Now we can copy the skeleton of a schema class:

In [2]:
class DummySchema(BaseSchema):
    """
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        return output

    @property
    def behavior(self):
        from coffea.nanoevents.methods import base, vector
        behavior = {}
        behavior.update(base.behavior)
        behavior.update(vector.behavior)
        return behavior

As you can see, this schema is so simple and it is not useful currently. If we call the events again with our own schema, we'll find it contains nothing.

In [3]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=DummySchema
         ).events()
events.fields

[]

## Create collections

In schema, the `branch_forms` is a python dictionary used to define branch grouping. 

By default (`BaseSchema`), it will be completely flat:
```python
branch_forms={
  "particle_pt":{},
  "particle_eta":{},
  "particle_phi":{},
  "particle_mass":{},
  ...
}
```
Just as you have seen when we open the dummy root file with `BaseSchema`, all the branches are listed when do `print(events.fields)`. 

We would like to put some branches into the same collection, as what follows:

```python
new_branch_forms={
  "particle": schemas.zip_forms({
      "pt" : branch_forms["particle_pt"],
      "eta" : branch_forms["particle_eta"],
      "phi" : branch_forms["particle_phi"],
      "mass" : branch_forms["particle_mass"],
  })
}
```
So when we want to call `particle_pt`, we actually do `particle.pt`.

All of this is to be implemented in the `Schema._build_collections` method. 

For example, let's add the `Electron` collection to our schema. To do this we also need to import `zip_forms`.

In [4]:
from coffea.nanoevents.schemas import zip_forms, nest_jagged_forms
class DummySchema(BaseSchema):
    """
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        output["Electron"] = zip_forms(
            {
                "pt" : branch_forms["Electron_pt"],
                "eta" : branch_forms ["Electron_eta"] , 
                "phi": branch_forms["Electron_phi"],
                "mass": branch_forms["Electron_mass"],
                #"xx": branch_forms["Electron_xx"],
            },
            "Electron",
        )
        return output

    @property
    def behavior(self):
        from coffea.nanoevents.methods import base, vector
        behavior = {}
        behavior.update(base.behavior)
        behavior.update(vector.behavior)
        return behavior

Now we successfully created a schema with one collection `Electron`. It will be able to recognize branches with name `Electron_pt, Electron_eta, Electron_phi, Electron_mass`.
Try to call the `events` again.

In [5]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=DummySchema
         ).events()
print(events.fields)
print(events.Electron.fields)

['Electron']
['pt', 'eta', 'phi', 'mass']


Congratualtions, it should work. We can use the mask and do selection on the whole collection at once now:

In [6]:
mask = (events.Electron.pt>3) & (events.Electron.pt<60)
good_elec = events.Electron[mask]
print(good_elec.pt)
print(good_elec.eta)

[[49.3, 38.1], [48.9, 43.9, 50.8, 43.9], ... 40.4], [44.6, 45.5, 49.2, 58.3, 51.9]]
[[0.366, 0.437], [0.149, 0.236, 0.472, ... [0.442, 0.854, 0.344, 0.156, 0.564]]


However, if you put some unknown branches to the collection, errors will be returned. 
For example, uncomment the following line in `DummySchema`:
```python
"xx": branch_forms["Electron_xx"],
```
Run the above code again, you will see:
```bash
KeyError: 'Electron_xx'
```
Of course we can make a long list and manually group all the TBranches. 
But use naming convetions would enable you to write the schema in a very neat way. Like what we have in the dummy root file, branches are named as `object_varible`. 
So we can define some collections and put TBranches  `{collection_name}_xx` into the collection:

In [7]:
class DummySchema(BaseSchema):
    """
    """
    mixins = {
        'Electron': "PtEtaPhiMLorentzVector",
        'Muon': 'PtEtaPhiMLorentzVector',
        'Jet': 'PtEtaPhiELorentzVector',
        'Subjet': 'PtEtaPhiELorentzVector',
    }
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        ## Making the basic 
        for name in self.mixins:
            mixin = self.mixins.get(name, "NanoCollection")
            output[name] = zip_forms({
                k[len(name) + 1 :]: branch_forms[k]
                for k in branch_forms
                if k.startswith(name + "_")
            },
            name,
            record_name=mixin)
        return output

    @property
    def behavior(self):
        from coffea.nanoevents.methods import base, vector
        behavior = {}
        behavior.update(base.behavior)
        behavior.update(vector.behavior)
        return behavior

We defined 4 collections `Electron, Muon, Jet, Subjet` in the above DummySchema and we used `for k in branch_forms` to search all the braches start with `{collection_name}_`. Now open the dummy root file again, we can see defined collections. 

In [8]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=DummySchema
         ).events()
print(events.fields)
print(events.Electron.fields)
print(events.Electron.px)
print(events.Electron.theta)

['Electron', 'Muon', 'Jet', 'Subjet']
['pt', 'eta', 'phi', 'mass', 'charge', 'flag', 'dxy', 'dxyErr', 'dz', 'dzErr']
[[48.7, 33.9], [26.6, 40.2, 29.8, 31.6], ... 47.3, 24.6], [33, 30.2, 47.8, 31.5, 45]]
[[1.21, 1.15], [1.42, 1.34, 1.12, 1.47], ... 1.56], [1.14, 0.805, 1.23, 1.42, 1.03]]


Maybe you already noticed we printed something called `events.Electron.px` and `events.Electron.theta` in the last block, but they don't exist in the dummy root file. How were thay created? This is actually the `behavior` this collection has, we'll talk about this in the next section. 

Look at the collection list:
```python
mixins = {
        'Electron': "PtEtaPhiMLorentzVector",
        'Muon': 'PtEtaPhiMLorentzVector',
        'Jet': 'PtEtaPhiELorentzVector',
        'Subjet': 'PtEtaPhiELorentzVector',
    }
```
`PtEtaPhiMLorentzVector` is the name of `behavior` that each collection has. It is defined [here](https://github.com/yihui-lai/coffea/blob/351cc727845ab83a8e31a193dc06e534bedb97fe/coffea/nanoevents/methods/vector.py#L497). 
And we imported it through `from coffea.nanoevents.methods import base, vector`



## Create behavior

Aside from put different branches into collections, we can also add `behavior` to collections. This means additional awkward arrays are generated on-the-fly via predefined algorithm. Like we can get `px, theta` previously. 

A bunch of other common physics behaviors are already provided in coffea, and you can find them in [methods](https://github.com/CoffeaTeam/coffea/tree/a95401cad91e88ceac47a4c693068bc4cbc7d338/coffea/nanoevents/methods).

To write our own coffea behavior, first we need to define the `behavior` first. 
In the following code, we define `DummyBehavior`. It only has one function `plus1()`, which returns the `particle.mass+1` when you call `particle.plus1`.


In [9]:
import awkward1
dummybehavior={}
@awkward1.mixin_class(dummybehavior)
class DummyBehavior:
    @property
    def plus1(self):
        return self.mass+1 

class DummySchema(BaseSchema):
    """
    """
    mixins = {
        'Electron': "DummyBehavior",
        'Muon': 'PtEtaPhiMLorentzVector',
        'Jet': 'PtEtaPhiELorentzVector',
        'Subjet': 'PtEtaPhiELorentzVector',
    }
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        ## Making the basic 
        for name in self.mixins:
            mixin = self.mixins.get(name, "NanoCollection")
            output[name] = zip_forms({
                k[len(name) + 1 :]: branch_forms[k]
                for k in branch_forms
                if k.startswith(name + "_")
            },
            name,
            record_name=mixin)
        return output

    @property
    def behavior(self):
        from coffea.nanoevents.methods import base, vector
        behavior = {}
        behavior.update(base.behavior)
        behavior.update(vector.behavior)
        behavior.update(dummybehavior)
        return behavior

The `behavior` of `Electron` is changed to `DummyBehavior` in the above DummySchema. 
Now try our self-defined behavior:

In [10]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=DummySchema
         ).events()
print(events.fields)
print(events.Electron.fields)
print(events.Electron.mass)
print(events.Electron.plus1)

['Electron', 'Muon', 'Jet', 'Subjet']
['pt', 'eta', 'phi', 'mass', 'charge', 'flag', 'dxy', 'dxyErr', 'dz', 'dzErr']
[[0.00051, 0.00051], [0.00051, ... 0.00051, 0.00051, 0.00051, 0.00051, 0.00051]]
[[1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, ... 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1]]


Since we changed the `behavior`, `print(events.Electron.theta)` should not work now. 

In [11]:
print(events.Electron.theta)

AttributeError: no field named 'theta'

(https://github.com/scikit-hep/awkward-1.0/blob/0.4.5/src/awkward1/highlevel.py#L1084)