# The probe phase
There are several ways of analyzing the topology and trajectory pairs, depending on the number of trajectory files per a topology file, the continuity of trjaectory files, organization of files in a directory, and the parallel or sequencial arrangement of the computation powerhorse.


## To-do-list

This notebook is a living documnent. It is an integration of all the past notebooks and scripts written for the *probe* phase in the *PolyPhys* package (or formerly-called *extraction* phase in the decryped *sumrule* package)

BThe list below allows to review the past notebooks and scripts and combine them into this notebook:

- [] Sum-rule: segments: bug test and run on **date:NODATE-YET**
- [] Sum-rule: segments: all test and run on **date:NODATE-YET**
- [] Sum-rule: wholes: bug test and run on **date:NODATE-YET**
- [] Sum-rule: wholes: all test and run on **date:NODATE-YET**
- [] Sum-rule: all_in_one: test and run on **date:NODATE-YET**
- [X] Trans-Foci: segments: bug test and run on **date:20220621**
- [] Trans-Foci: segments: all test and run on **date:NODATE-YET**
- [] Trans-Foci: wholes: bug test and run on **date:NODATE-YET**
- [] Trans-Foci: wholes: all test and run on **date:NODATE-YET**
- [] Trans-Foci: all_in_one: test and run on **date:NODATE-YET**

### Settings for testing and running on a PC.

In [None]:
# settings for testing and running on a PC.
from glob import glob
import pathlib
from polyphys.manage import organizer
from polyphys.manage.parser import SumRule
from polyphys.probe import prober

In [None]:
from dask.distributed import Client
from dask import delayed
from dask import compute
client = Client(n_workers=2)
client

## HPC Cluster: gnuparallel

### Separated *whole* simulation directories

On a cluster, *whole* simulations are organized into *whole* directories, where each *whole* directory contains all the files for a given *whole* simulation. The **gnuparallel** is used to parallalize the **probe** phase at the **shell** level. For this purpose, all the python modules and scripts are separatedly installed and run on each core. For instance, if 32 cores are available, then the files in 32 *whole* directories are simulatenously installed. However, each *whole* directory may contains multiple toplogy and trajectory pairs. Thus, there is parallelization at the level of *whole* directories, not at the levle of the *segment* or *whole* trajectories inside a *whole* directory. Inside each *whole* directory, a python **main_probe.py** script analyzes the trajectories in a sequencial way.

- trj and all *segments* on a cluster

For each *whole* directory, the following script is executed by means of *gnuparallel*. See these scripts: *probe-1.7-all_trj_segments.py* and *probe-1.7-bug_trj_segments*

#### Sum-rule project

##### segments
Each *bug/all topology* comes with only **more than one** *bug/all trajectories*.

###### bug

In [None]:
from glob import glob
from polyphys.manage import organizer
from polyphys.manage.parser import SumRule
from polyphys.probe import prober


geometry = 'biaxial'
trj_lineage = 'segment'
save_to = "./"
bug_trjs = glob("./N*.bug.lammpstrj")
bug_trjs = organizer.sort_filenames(bug_trjs, fmts=['.bug.lammpstrj'])
bug_trjs = [bug_trj[0] for bug_trj in bug_trjs]
bug_topo = glob("./N*.bug.data")
bug_topo = organizer.sort_filenames(bug_topo, fmts=['.bug.data'])
bug_topo = bug_topo[0][0]
print(bug_topo)
for bug_trj in bug_trjs:
    print(bug_trj)
    trj_info = SumRule(bug_trj, geometry=geometry, group='bug',
                       lineage=trj_lineage)
    # all the frames in the last segment are probed:
    if trj_info.segment_id == len(bug_trjs):
        prober.sum_rule_bug(
            bug_topo, bug_trj, geometry, trj_lineage, save_to
        )
    # the last frame in the all other segments is ignored:
    else:
        prober.sum_rule_bug(
            bug_topo, bug_trj, geometry, trj_lineage, save_to, continuous=True
        )

###### all

In [None]:
from glob import glob
from polyphys.manage import organizer
from polyphys.manage.parser import SumRule
from polyphys.probe import prober

geometry = 'biaxial'
trj_lineage = 'segment'
save_to = "./"
all_trjs = glob("./N*.all.lammpstrj")
all_trjs = organizer.sort_filenames(all_trjs, fmts=['.all.lammpstrj'])
all_trjs = [all_trj[0] for all_trj in all_trjs]
all_topo = glob("./N*.all.data")
all_topo = organizer.sort_filenames(all_topo, fmts=['.all.data'])
all_topo = all_topo[0][0]
print(all_topo)
for all_trj in all_trjs:
    print(all_trj)
    trj_info = SumRule(all_trj, geometry=geometry, group='all',
                       lineage=trj_lineage)
    # all the frames in the last segment are probed:
    if trj_info.segment_id == len(all_trjs):
        prober.sum_rule_all(all_topo, all_trj, geometry, trj_lineage, save_to)
    # the last frame in the all other segments is ignored:
    else:
        prober.sum_rule_all(
            all_topo,
            all_trj,
            geometry,
            trj_lineage,
            save_to,
            continuous=True
        )

#### Trans-Foci project

##### 2. wholes
Each *bug/all topology* comes with only **one** *bug/all trajectory*.

###### bug

In [None]:
from glob import glob
from polyphys.manage import organizer
from polyphys.manage.parser import TransFoci
from polyphys.probe import prober

# analyzing bug files.
geometry = 'biaxial'
trj_lineage = 'whole'
save_to = "./"
bug_pairs = glob("./eps*.bug.*")
bug_pairs = organizer.sort_filenames(bug_trjs)
for (bug_topo, bug_trj) in bug_pairs:
    prober.trans_fuci_bug(
        bug_topo,
        bug_trj,
        lineage=lineage,
        geometry=geometry
    )

# PC: Serial and Dask

There are 4 different types of directories from which only one type can be in a *space* directory.

There are separated **whole** directories in each of which there **all** and **bug** **whole** trajectories; or, there are again separated **whole** directories in each of which there are **all** and **bug** **segment** trjaectories. Below there are two groups of scrips for **serial** and **parallel** runnung schemes.


#### *bug* *segment* trjs:

On a PC, the *whole* directories are located in a master *space-trjs* directory; however, one main python script probes all the *whole* directories in a parallel scheme via Dask. This is different from the *gnuparallel*-based approach in which each *whole* directory has its own copy of the required scripts and a main pytohn script is run to probe that direcotry individually.

## Serial scheme: 

### Sumrule project


In [None]:
from glob import glob
from polyphys.manage import organizer
from polyphys.manage.parser import SumRule
from polyphys.probe import prober

geometry = 'biaxial'
trj_lineage = 'segment'
save_to="./"

bug_trjs = glob("../test_data/trjs-continuous/N500D10.0ac0.8-trjs/N500epsilon5.0r5.5lz205.5sig0.8nc12012dt0.002bdump1000adump5000ens2/N*.bug.lammpstrj")
bug_trjs = organizer.sort_filenames(bug_trjs,fmts=['.bug.lammpstrj'])
bug_trjs = [bug_trj[0] for bug_trj in bug_trjs]
bug_topo = glob("../test_data/trjs-continuous/N500D10.0ac0.8-trjs/N500epsilon5.0r5.5lz205.5sig0.8nc12012dt0.002bdump1000adump5000ens2/N*.bug.data")
bug_topo = organizer.sort_filenames(bug_topo,fmts=['.bug.data'])
bug_topo = bug_topo[0][0]
print(bug_topo)
for bug_trj in bug_trjs:
    print(bug_trj)
    trj_info = SumRule(bug_trj, geometry=geometry, group='bug',lineage=trj_lineage)
    # all the frames in the last segment are probed:
    if trj_info.segment_id ==len(bug_trjs):
        #print("last: " + bug_trj)
        prober.sum_rule_bug(bug_topo, bug_trj, geometry, trj_lineage, save_to)
    # the last frame in the all other segments is ignored:
    else:
        #print(bug_trj)
        prober.sum_rule_bug(bug_topo, bug_trj, geometry, trj_lineage, save_to, continuous=True)
trj_lineage = 'segment'
all_trjs = glob("../test_data/trjs-continuous/N500D10.0ac0.8-trjs/N500epsilon5.0r5.5lz205.5sig0.8nc12012dt0.002bdump1000adump5000ens2/N*.all.lammpstrj")
all_trjs = organizer.sort_filenames(all_trjs, fmts=['.all.lammpstrj'])
all_trjs = [all_trj[0] for all_trj in all_trjs]
all_topo = glob("../test_data/trjs-continuous/N500D10.0ac0.8-trjs/N500epsilon5.0r5.5lz205.5sig0.8nc12012dt0.002bdump1000adump5000ens2/N*.all.data")
all_topo = organizer.sort_filenames(all_topo, fmts=['.all.data'])
all_topo = all_topo[0][0]
print(all_topo)
for all_trj in all_trjs:
    print(all_trj)
    trj_info = SumRule(all_trj, geometry=geometry, group='all',lineage=trj_lineage)
    # all the frames in the last segment are probed:
    if trj_info.segment_id ==len(all_trjs):
        #print("last: " + all_trj)
        prober.sum_rule_all(all_topo, all_trj, geometry, trj_lineage, save_to)
    # the last frame in the all other segments is ignored:
    else:
        #print(all_trj)
        prober.sum_rule_all(all_topo, all_trj, geometry, trj_lineage, save_to,continuous=True)

In [None]:
## This approach from HERE
path = pathlib.Path('../test_data/trjs-continuous/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
if not pathlib.Path(input_database).exists():
    raise OSError(f"'{input_database}'"
                    "path does not exist.")
## to HERE, does not work of * is used in the string input for Path.
geometry = 'biaxial'
group = 'bug'
hierarchy = '/N*/N*'
observations = glob(input_database + hierarchy)
if observations == []:
    raise OSError(
        "File not found in "
        f"'{input_database + hierarchy}'"
        )
topologies = organizer.sort_filenames(observations, fmts=['.bug.data'])
trajectories = organizer.sort_filenames(observations, fmts=['.bug.lammpstrj'])
# 'bug' time series and historams
save_to = analyzer.database_path(input_database, phase='probe', stage='segment', group=None)
topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
for topology in topologies:
    print(topology[0])
    topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
    save_to_whole = save_to + '/' + topo_info.whole
    save_to_whole = pathlib.Path(save_to_whole) 
    try:
        save_to_whole.mkdir(parents=True, exist_ok=False)
    except FileExistsError as error:
        print(error)
        print(
            f"Directory '{save_to_whole}'"
            " exist. Files are saved/overwritten to an existing directoy.")
    finally:
        save_to_whole = str(save_to_whole) + '/'
    for trajectory in trajectories:
        trj_info = SumRule(trajectory[0],geometry=geometry, group=group, lineage='segment')
        if trj_info.whole == topo_info.whole:
            if trj_info.segment_id ==10:
                prober.sum_rule_bug(topology[0], trajectory[0], geometry, 'segment', save_to_whole)
            else:
                prober.sum_rule_bug(topology[0], trajectory[0], geometry, 'segment', save_to_whole, continuous=True)

#### *bug* *whole* trjs

In [None]:
## This approach from HERE
path = pathlib.Path('/Users/amirhsi_mini/trjs/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
if not pathlib.Path(input_database).exists():
    raise OSError(f"'{input_database}'"
                    "path does not exist.")
## to HERE, does not work of * is used in the string input for Path.
geometry = 'biaxial'
group = 'bug'
hierarchy = '/N*/N*'
observations = glob(input_database + hierarchy)
if observations == []:
    raise OSError(
        "File not found in "
        f"'{input_database + hierarchy}'"
        )
topologies = organizer.sort_filenames(observations, fmts=['.bug.data'])
trajectories = organizer.sort_filenames(observations, fmts=['.bug.lammpstrj'])
# 'bug' time series and historams
save_to = analyzer.database_path(input_database, phase='probe', stage='segment', group=None)
for topology in topologies:
    print(topology[0])
    topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
    save_to_whole = save_to + '/' + topo_info.whole
    save_to_whole = pathlib.Path(save_to_whole) 
    try:
        save_to_whole.mkdir(parents=True, exist_ok=False)
    except FileExistsError as error:
        print(error)
        print(
            f"Directory '{save_to_whole}'"
            " exist. Files are saved/overwritten to an existing directoy.")
    finally:
        save_to_whole = str(save_to_whole) + '/'
    for trajectory in trajectories:
        trj_info = SumRule(trajectory[0],geometry=geometry, group=group, lineage='whole')
        if trj_info.whole == topo_info.whole:
            prober.sum_rule_bug(topology[0], trajectory[0], geometry, 'whole', save_to_whole, continuous=False)

### Trans-Foci project

#### 1. whole trjs

In [None]:
# analyzing bug files.
geometry = 'biaxial'
lineage = 'segment'
#macmini_path = "/Users/amirhsi_mini/trjs/epss5.0epsl5.0r10.5al5.0nl5ml125ns200ac1.0nc*lz77.0dt0.005bdump5000adump5000ens1ring/*.bug*"
macbookpro_path = "/Users/amirhsi/Downloads/ns400nl5al5D20ac1-trjs/epss5epsl5r10.5al5nl5ml125ns400ac1nc*lz77dt0.005bdump2000adump5000ens*.ring/*.bug*"
bug_pairs = glob(macbookpro_path)
bug_pairs = organizer.sort_filenames(bug_pairs)
for (bug_topo, bug_trj) in bug_pairs:
    prober.trans_fuci_bug(
        bug_topo,
        bug_trj,
        lineage=lineage,
        geometry=geometry
    )

##### Explore the outputs

In [None]:
import numpy as np
npys = glob("*.npy")
npys = organizer.sort_filenames(npys)
npys_arrs = {}
for npy in npys:
    var = npy.split('-')[-1].split('.')[0]
    npys_arrs[var] = np.load(npy)
npys_arrs

## Parallel scheme with dask

#### *bug* *whole* trjs: NOT tested 20220603: 

In [None]:
## This approach from HERE
path = pathlib.Path('/Users/amirhsi_mini/trjs/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
if not pathlib.Path(input_database).exists():
    raise OSError(f"'{input_database}'"
                    "path does not exist.")
## to HERE, does not work of * is used in the string input for Path.
geometry = 'biaxial'
group = 'bug'
hierarchy = '/N*/N*'
observations = glob(input_database + hierarchy)
if observations == []:
    raise OSError(
        "File not found in "
        f"'{input_database + hierarchy}'"
        )
topologies = organizer.sort_filenames(observations, fmts=['.bug.data'])
trajectories = organizer.sort_filenames(observations, fmts=['.bug.lammpstrj'])
# 'bug' time series and historams
save_to = analyzer.database_path(input_database, phase='probe', stage='segment', group=None)
trjs_computed = []
for topology in topologies:
    print(topology[0])
    topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
    save_to_whole = save_to + '/' + topo_info.whole
    save_to_whole = pathlib.Path(save_to_whole) 
    try:
        save_to_whole.mkdir(parents=True, exist_ok=False)
    except FileExistsError as error:
        print(error)
        print(
            f"Directory '{save_to_whole}'"
            " exist. Files are saved/overwritten to an existing directoy.")
    finally:
        save_to_whole = str(save_to_whole) + '/'
    for trajectory in trajectories:
        trj_info = SumRule(trajectory[0],geometry=geometry, group=group, lineage='whole')
        if trj_info.whole == topo_info.whole:
            trj_delayed = delayed(prober.sum_rule_bug)(topology[0], trajectory[0], geometry, 'whole', save_to_whole, continuous=False)
            trjs_computed.append(trj_delayed)

In [None]:
%%time
# it takes 9min and 34s.
results = compute(trjs_computed)

#### *bug* *segment* trjs: NONT written: NOT tested 20220603: 

#### *all* *segment* trjs: probe_all: NOT dask style: NOT tested 20220603

In [None]:
path = pathlib.Path('../test_data/trjs-continuous/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
geometry = 'biaxial'
group = 'all'
hierarchy = '/N*/N*'
if not pathlib.Path(input_database).exists():
    raise OSError(f"'{input_database}'"
                    "path does not exist.")
observations = glob(input_database + hierarchy)
if observations == []:
    raise OSError(
        "File not found in "
        f"'{input_database + hierarchy}'"
        )
topologies = organizer.sort_filenames(observations, fmts=['.all.data'])
trajectories = organizer.sort_filenames(observations, fmts=['.all.lammpstrj'])
# 'bug' time series and historams
save_to = analyzer.database_path(input_database, phase='probe', stage='segment', group=None)
for topology in topologies:
    topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
    save_to_whole = save_to + '/' + topo_info.whole
    save_to_whole = pathlib.Path(save_to_whole) 
    try:
        save_to_whole.mkdir(parents=True, exist_ok=False)
    except FileExistsError as error:
        print(error)
        print(
            f"Directory '{save_to_whole}'"
            " exist. Files are saved/overwritten to an existing directoy.")
    finally:
        save_to_whole = str(save_to_whole) + '/'
    for trajectory in trajectories:
        trj_info = SumRule(trajectory[0],geometry=geometry, group=group, lineage='segment')
        if trj_info.segment_id ==10:
            prober.sum_rule_all(topology[0], trajectory[0], geometry, 'segment', save_to_whole)
        else:
            prober.sum_rule_all(topology[0], trajectory[0], geometry, 'segment', save_to_whole, continuous=True)

In [None]:
path = pathlib.Path('../test_data/trjs-continuous/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
geometry = 'biaxial'
hierarchy = '/N*/N*'
observations = glob(input_database + hierarchy)
all_tuples =  organizer.sort_filenames(observations,fmts=['all.lammpstrj'])
all_trjs = [all_tuple[0] for all_tuple in all_tuples]
all_data =  organizer.sort_filenames(observations,fmts=['all.data'])
all_data = all_data[0][0]

    
for all_trj in all_trjs:
    print(all_trj)
    #PipeLine.extract_trj_all(all_data, all_trj, geom, save_to)

#### *all* *segment* trjs: probe_all_new:: NOT tested 20220603

In [None]:
path = pathlib.Path('/Users/amirhsi_mini/trjs/N500D10.0ac0.8-trjs')
path = path.resolve() # convert relative path to aabsolute one
input_database = str(path)
geometry = 'biaxial'
group = 'all'
hierarchy = '/N*/N*'
if not pathlib.Path(input_database).exists():
    raise OSError(f"'{input_database}'"
                    "path does not exist.")
observations = glob(input_database + hierarchy)
if observations == []:
    raise OSError(
        "File not found in "
        f"'{input_database + hierarchy}'"
        )
topologies = organizer.sort_filenames(observations, fmts=['.all.data'])
trajectories = organizer.sort_filenames(observations, fmts=['.all.lammpstrj'])
# 'bug' time series and historams
save_to = analyzer.database_path(input_database, phase='probe', stage='segment', group=None)
trjs_computed = []
for topology in topologies:
    topo_info = SumRule(topology[0],geometry=geometry, group=group, lineage='whole')
    save_to_whole = save_to + '/' + topo_info.whole
    save_to_whole = pathlib.Path(save_to_whole) 
    try:
        save_to_whole.mkdir(parents=True, exist_ok=False)
    except FileExistsError as error:
        print(error)
        print(
            f"Directory '{save_to_whole}'"
            " exist. Files are saved/overwritten to an existing directoy.")
    finally:
        save_to_whole = str(save_to_whole) + '/'
    for trajectory in trajectories:
        trj_info = SumRule(trajectory[0],geometry=geometry, group=group, lineage='segment')
        if trj_info.whole == topo_info.whole:
            if trj_info.segment_id ==14:
                trj_delayed = delayed(prober.probe_all_new)(topology[0], trajectory[0], geometry, 'segment', save_to_whole, continuous=False)
                trjs_computed.append(trj_delayed)
            else:
                trj_delayed = delayed(prober.probe_all_new)(topology[0], trajectory[0], geometry, 'segment', save_to_whole, continuous=True)
                trjs_computed.append(trj_delayed)  

In [None]:
results = compute(trjs_computed)