# Belle II Advanced Tutorial: Skimming

Welcome to the jupyter notebook skimming tutorial. If you are unfamiliar with jupyter, the main thing you need to know is how to excecute the cells below - this is done by selecting the desired cell and pressing `Shift + Enter`.

This tutorial contains:
1. [What are skims?](#Intro)
1. [The structure of skims](#Structure)
1. [How to create your own skim](#DiY)
    1. [Run a sample script](#ExA)
    1. [Determine your skim statistics](#ExB)
    1. [Run on different generic backgrounds](#ExC)
    1. [Skim Validation](#ExD)
  

### Imports

In [1]:
# Wide style
from ipython_tools import handler
handler.style()

### What are skims? <a name="Intro"></a>

The aim of a skim is to produce MC and data samples with a more manageable size. This:
* reduces computing time.
* makes files more analysis oriented.

Eventually all analysts will be required to use skimmed data and MC. They will not have access to the original unskimmed files.

Information about each skim is available at the [Skimming Homepage](https://confluence.desy.de/pages/viewpage.action?spaceKey=BI&title=Skimming+Homepage)

The requirements for each skim must be chosen wisely. They must reduce the size of the data samples without constraining the analysis. 
Each skim must be submitted to the skimming coordinator for testing before full production

### Skim Structure <a name="Structure"></a>

**Format**: They are simple basf2 steering scripts that are usually written by the analyst or their skim liaison to meet their needs. 

**Output**: skimmed files are produced as uDST files, containing particle reconstruction information.

## How to create your own skim <a name="DiY"></a>

A skimming script for a specific analysis usually consists of two scripts:
`standalone/YourSkimName_Skim_Standalone.py` - a file that calls the necessary particle lists and declares the input and output files.
`scripts/skim/skimType.py` - a file that contains the definition of the particle lists you want to form and the associated cuts. You will likely be able to add your definition to an existing file (see later).

Code that you may find useful:

* Standard Particle Lists : [stdCharged.py](https://stash.desy.de/projects/B2/repos/software/browse/analysis/scripts/stdCharged.py)
* Standard Photon List: [stdPhotons.py](https://stash.desy.de/projects/B2/repos/software/browse/analysis/scripts/stdPhotons.py)
* Standard Pi0 List: [stdPi0s.py](https://stash.desy.de/projects/B2/repos/software/browse/analysis/scripts/stdPi0s.py)
* Standard Ks List: [stdVOs.py](https://stash.desy.de/projects/B2/repos/software/browse/analysis/scripts/stdV0s.py)
* Standard Kl List: [stdKlongs.py](https://stash.desy.de/projects/B2/repos/software/browse/analysis/scripts/stdKlongs.py)

### Excercise 1: Run a sample script <a name="ExA"></a>

Create a development setup of basf2 in your preferred workspace according to the instructions on [Sphinx](https://b2-master.belle2.org/software/development/sphinx/build/tools_doc/index-01-tools.html#development-setup). If this link is out of date, you should be able to find the instructions through a simple Sphinx search.

After following these instructions you should have a new basf2 installation in a folder called `development/` (or whatever you have named it).

Make sure to compile your code before trying to do anything. This is done by running the command `scons`.
If you make any changes within `development/`, you will need to recompile again.

Firstly, go to `development/skims`, and take a look around. As you work more with skims you'll use more of what's available in the package.

For a beginner, however, the important directories are `standalone/` and `scripts/skim/`. 

Let's take a look at our first script, for example `standalone/ALP3Gamma_Skim_Standalone.py`.
With the Standalone script, the important analysis tools and scripts are loaded, the input and output files are specified, and the statistics are printed. Read through the code and try to understand the steps.

Now let's take a look at the particle lists script stored in `scripts/skim/`. The relevant one for the example standalone file is `dark.py` (you can see that it is called within the standalone script).

In the list script, you specify the type of cuts or particle lists that you want.  It is best to start with loose requirements and tighten them in the analysis stage.

Now let's run over a sample number of events. When testing skims, we only need small samples.

(When doing this in your command line you do not need the `!`)

In [2]:
!basf2 ~/skimming/development/skim/standalone/ALP3Gamma_Skim_Standalone.py –n 100 
#Change this to your directory if you need to!

[INFO] Steering file: /home/belle2/hmwakel/skimming/development/skim/standalone/ALP3Gamma_Skim_Standalone.py
[m[INFO] Adding new particle 'beam' (pdg=55, mass=999 GeV, width=999 GeV, charge=0, spin=0)
[m[INFO] Adding new particle 'ALP' (pdg=9000006, mass=999 GeV, width=999 GeV, charge=0, spin=0)
[mALP:0 -> gamma:cdcAndMinimumEnergy  gamma:cdcAndMinimumEnergy
beam:0 -> gamma:minimumEnergy ALP:0
[INFO] Starting event processing, random seed is set to '2799e6b056f84c75623af838cee9d163488d9a104c18ac78e1be1fa2736f9baa'
[m[INFO] Added file /ghi/fs01/belle2/bdata/MC/release-03-01-00/DB00000547/MC12b/prod00007392/s00/e1003/4S/r00000/mixed/mdst/sub00/mdst_000141_prod00007392_task10020000141.root
[m[INFO] ParticleLoader's Summary of Actions:
[m[INFO]  o) creating (anti-)ParticleList with name: gamma:cdcAndMinimumEnergy (gamma:cdcAndMinimumEnergy)
[m[INFO]    -> MDST source: ECLClusters
[m[INFO]    -> With cuts  : E >= 0.1 and theta >= 0.297 and theta <= 2.618
[m[INFO] ParticleLoader's S

##### Congratulations!

You just ran your first skim script. Feel free to up the number of sampled particles but be aware that this will take much longer.

### Exercise B: Determine your skim statistics <a name="ExB"></a>

Your skim script will have output a `.root` file. You now want to analyse the printed output to ensure your skim will qualify for the skim package.

Important factors to look for:

* **Retention Rate**: fraction of events that survive the skim. Retention rate is required to be around 10% for a skim to qualify.
* **Average Candidate multiplicity**: Number of candidates per event. 
* **Processing time per event**
* **Size of output uDST files**.


Look for retention rate, average candidate multiplicity, and processing time in the output of `ALP3Gamma_Skim_Standalone.py`.

This can be done by reading the results printed once your script has finished. A handy thing to do is to store the output of the script in the text file.

In [1]:
!basf2 ALP3Gamma_Skim_Standalone.py –n 10 > output.txt

Now you can work on improving the retention rate. Take a look at PID requirements, mass cuts, deltaE cuts and see if there are some generic cuts that would improve the retention rate without cutting out physics for those that would use your skim.

### Exercise C: Test your skim on generic MC and data <a name="ExC"></a>

Now, test your skim on a small subset of generic backgrounds on a recent MC running `b2skim-stats-submit -s your_skim_name` _within_ the skim directory. This command runs over 10,000 events for each of the generic MC and certain data collections.

Once your skim has finished running on the different types of MC and data (you can check to see whether your jobs have finished runnning on kekcc through `bjobs`), you can determine the skim statistics by running `b2skim-stats-print -s your_skim_name` and can even add the `-C` option to print out to a Confluence friendly table.

Each approved and used skim has to be tested on the various background types, validated with data and the statistics should be made available via the Skimming Homepage.

### Exercise D: Skim Validation <a name="ExD"></a>

It is important to check our skims for each build of the software. Go to `validation/` and look at the structure within. There is an automatic nightly build validation that must be set up for each skim.

3 scripts are required per skim in the validation directory: 

* `test0_myScript.py` ==> this is used to generate a SIGNAL mode that will have a high survival rate in the skim to be tested. 

* `test1_myScript.py` ==> this is used to run the skim on the .mdst produced by `test0_myScript.py` and produce the output .udst

* `test2_myScript.py` ==> this is used to run on the output .udst produced by `test1_myScript.py` and  make at least 3-4 plots of important variables that will be used for validation. 

Use the current scripts as examples to write your own.

Once this skim is accepted into the skim package, the output `.root` file produced using `test2_myScript.py` should also be pushed to the validation directory as a reference file.

That's it! Thank you for your interest and involvement in developing skims. If you have any feedback or questions about the skimming tutorial, please contact [hannah.wakeling@mail.mcgill.ca](hannah.wakeling@mail.mcgill.ca).