# Notes on DataSet Discovery in Coffea 2024.8.1

### To do analysis we are often needing to use simulated (Monte Carlo, "MC" for short) data, its crucial for building our analysis tools before we go and apply to real collision data. We generate a ton of these simulated events as MC data and these massive data sets (Terabytes in size) are split into fractional .root files and stored all across the world at various computing facilities. The quality of these facilities is [tiered](https://cms.cern/detector/computing-grid) from 0 to 3, which has some implications for the availability of these files. Not too important at the moment.

### Querying datasets and finding the right kind of MC events for a specific analysis is quite a chore sometimes. [CMS DAS]()

### Searching for data sets can be sped up by using the `coffea.dataset_tools`, which are not in the `coffea` so they must be explicity imported like: `from coffea.dataset_tools.dataset_query import DataDiscoveryCLI` or whatever other tool you'd like to use from the package. In this example, the *DataDiscoveryCLI* tool will be used to interactively search for data samples. It can be used to search for MC or collision data since it uses rucio to query grid sites.

#### Here is a basic example: 

In [1]:
import coffea
coffea.__version__

'2024.8.1'

In [2]:
from coffea.dataset_tools import rucio_utils
from coffea.dataset_tools.dataset_query import print_dataset_query
from rich.console import Console
from coffea.dataset_tools.dataset_query import DataDiscoveryCLI

ddc = DataDiscoveryCLI()
ddc.do_query("/SlepSnuCascade*")

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Output()

___
### This example code is taken from the [coffea documentation on the topic](https://coffea-hep.readthedocs.io/en/latest/notebooks/dataset_discovery.html) however I made some changes to tinker with it and use it in a simpler way, how I see fit. 

### The API documentation can be found [here](https://coffea-hep.readthedocs.io/en/latest/dataset_tools.html#coffea.dataset_tools.dataset_query.DataDiscoveryCLI), I made a very basic query script here, by providing the `do_query` method a string of the datasets I'm interested in (also a wildcard to simplify the search) the DataDiscoveryCLI will return to me a very pretty formatted table of the available datasets. I am only interested in the NanoAOD sim files so lets refine our query:

(we're looking for a file like: /SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer23BPixNanoAODv12-130X_mcRun3_
2023_realistic_postBPix_v6-v3/NANOAODSIM)
___

In [3]:
ddc.do_query("/SlepSnuCascade*NanoAOD*v3*")

Output()

___
### You can also use the query interactively:
#### (no need to put the string between "quotes", just type something like /SlepSnu\*NanoAOD\*v3* and you should get the same result as above)
___

In [4]:
ddc.do_query()

 /SlepSnu*NanoAOD*v3*


Output()

___
### Very cool. Ok, **PLEASE NOTE: THE LAST QUERY IS STORED IN THE DataDiscoveryCLI OBJECT AND IS OVERWRITTEN EACH TIME YOU DO A NEW QUERY**. This is different from when you start selecting data sets as we will see below: Anyways, here is proof that the query results are stored in the object, you can retrieve them at any time by calling `do_query_results()`:
___

In [5]:
ddc.do_query_results()

### And if the previous query is always stored then we can run further methods like `do_selected`:

In [6]:
ddc.do_select("all")

___
### Now we can print the list of our selected datasets:
___

In [7]:
ddc.do_list_selected()

___
### Learning lesson though, the internal list of selected datasets is always appended to when you run `do_select`. This means that if you go up and rerun the `ddc.do_select("all")` cell then rerun the `ddc.do_list_selected()` you will see what has happened to the internally stored data sets... They are never overwritten, as far I know.
___

### We would love to know where the copies of these files are stored, their "replicas" so to speak. So with the next command we can print out the sites a given selected dataset is stored at: 
("first" means "take the first site from the rucio query" and "all" means "get replicas for all the selected data files")
___

In [8]:
ddc.do_replicas("first", "all")

Output()

Output()

Output()

{'/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer23BPixNanoAODv12-130X_mcRun3_2023_realistic_postBPix_v6-v3/NANOAODSIM': {'files': {'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/0238744e-b5c3-48d0-9415-97854405b503.root': 'Events',
   'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/1b117b99-b288-4121-a3ad-e9cdee0deebd.root': 'Events',
   'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/1c3d452f-62da-4dd1-9734-fbd7024de64d.root': 'Events',
   'roo

### So the above output is very large for 3 files but its complete (I think?). If we would like to filter based on site, i.e. forbid certain sites from the replica query, we can block them with `do_blocklist_sites`

In [9]:
blocked_sites = ["T1_UK_RAL_Disk", "T1_IT_CNAF_Disk", "T2_US_Vanderbilt"] #idk, just picked some random sites
ddc.do_blocklist_sites(blocked_sites)

### The above list also only appends, however there is a command to print the block list and clear it if you wish:

In [10]:
ddc.do_sites_filters()

 y


### The more convenient way to filter sites is probably by only allowing specific sites you wish, the inverse of the block command. `do_allowlist_sites` handles that: 

In [11]:
allowed_sites = ["T2_DE_DESY", "T1_US_FNAL_Disk", "T2_FR_IPHC", "T2_IT_Legnaro"]
ddc.do_allowlist_sites(allowed_sites)

### Now lets look at the site_filter list:

In [12]:
ddc.do_sites_filters()

 y


### This can be greatly simplified by doing `do_regex_sites`

In [13]:
#ddc.do_regex_sites(r"T[123]_(CH|IT|UK|FR|DE)_\w+")
ddc.do_regex_sites(r"T[12]_(US)_\w+")

In [14]:
ddc.do_replicas("first", "all")

Output()

Output()

Output()

{'/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer23BPixNanoAODv12-130X_mcRun3_2023_realistic_postBPix_v6-v3/NANOAODSIM': {'files': {'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/0238744e-b5c3-48d0-9415-97854405b503.root': 'Events',
   'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/1b117b99-b288-4121-a3ad-e9cdee0deebd.root': 'Events',
   'root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/Run3Summer23BPixNanoAODv12/SlepSnuCascade_MN1-220_MN2-260_MC1-240_TuneCP5_13p6TeV_madgraphMLM-pythia8/NANOAODSIM/130X_mcRun3_2023_realistic_postBPix_v6-v3/2520000/1c3d452f-62da-4dd1-9734-fbd7024de64d.root': 'Events',
   'roo

### To bring it all home, we would like to create a preprocess file of selected data files. This preprocess file can be saved a .json and then later loaded in a different script/notebook to perform an analysis on. Lets try:

In [15]:
fileset_total = ddc.do_preprocess(output_file="fileset",
                  step_size=10000,  #chunk size for files splitting
                  align_to_clusters=False,
                 scheduler_url=None)

 y


 1


 


Output()

RuntimeError: Nanny failed to start.