# The analyze phase:
In this phase, if there are any segment files in the *probe* phase, they are merged into *whole* files. Then,  ensemble, ensemble-averaged, and space files are created from whole files.

### To-do list:

- [x] analyzing *bug* *segement* and *whole* files in both project in a serial manner.
- [ ] analyzing *bug* *segement* and *whole* files in both project in a parallel manner with Dask: memory linkage problem
- [ ] analyzing *all* *segement* and *whole* files in both project in a serial manner.
- [ ] analyzing *all* *segement* and *whole* files in both project in a parallel manner with Dask

### Naming convention:

This is the pattern of file or directory names:

1. **whole** files: whole-group-property_[-measure][-stage][.ext]
2. **ensemble** files: ensemble-group-property_[-measure][-stage][.ext]
3. **ensemble_long** files: ensemble_long-group-property_[-measure][-stage][.ext]
4. **space** files: space-group-property_[-measure][-stage][.ext]
5. **all in one** files: space-group-**species**-**allInOne**-property-_[-measure][-stage][.ext]

[keyword] means that the keyword in the file name is option. [-measure] is a physical measurement such as the auto correlation function (AFC) done on the physical 'property_'.

### Settings for testing and running on a PC

In [None]:
def parents_stamps(
    stamps: pd.DataFrame,
    geometry: str,
    group: str,
    lineage: str,
    properties: Optional[Dict[str, Callable]] = None,
    save_to: Optional[str] = None
) -> pd.DataFrame:
    """perform merging/ensemble-averaging over all the 'segment/'whole'
    simulation stamps in a 'space' in a given `geometry` for a given
    `group` basedon the given `lineage`.

    Parameters
    ----------
    stamps: DataFrame
        Dataframe of all the simulation stamps in the `group` in a space.
    geometry : {'biaxial', 'slit', 'box'}
        Shape of the simulation box
    group: {'bug', 'all'}
        Type of the particle group.
    lineage: {('segment', 'whole'}
        Lineage type of children's stamps.
    properties: dict of str
        A dictionary in which the keys are properties such as the time-
        averaged radius of gyration which are measured during the 'probe'
        phase and the values are user-defined or numpy functions which are
        used as the aggregation function bu pandas.
    save_to : str, default None
        Absolute or relative path of a directory to which outputs are saved.

    Notes
    -----
    If `lineage='segment'`, then stamps are for 'segments' and they have only
    different 'segment_id' for a given 'whole' parent. If `lineage='whole'`,
    then stamps are for 'wholes' and they have only different 'ensemble_id'
    for a given 'ensemble' parent. In either scenarios, the 'stamp' files
    have 'segment' and 'segment_id' columns. If `lineage='whole'`, the
    values of these two columns are "N/A".

    Return
    ------
    stamps_avg: pd.DataFrame
        Dataframe of all the parents stamps in the `group` in a space.
    """
    invalid_keyword(geometry, ['biaxial', 'slit', 'box'])
    invalid_keyword(group, ['bug', 'all'])
    invalid_keyword(lineage, ['segment', 'whole'])
    # attributes, properties and genealogy:
    stamps_cols = list(stamps.columns)
    try:
        stamps_cols.remove("lineage_name")
        stamps_cols.remove(lineage)
    except ValueError:
        print(
            f"'lineage_name' and '{lineage}'"
            " columns are not among in stamps column:"
            f"'{stamps_cols}', they are probably removed in"
            "a previous call of 'parents_stamps' function."
        )
    # aggregation dictionary: See Note above.
    agg_funcs = dict()
    attr_agg_funcs = ['last'] * len(stamps_cols)
    agg_funcs.update(zip(stamps_cols, attr_agg_funcs))
    if properties is not None:  # add/update agg funcs for properties.
        agg_funcs.update(properties)
    # Handing 'lineage'-specific details:
    if lineage == 'whole':
        parent_groupby = 'ensemble_long'
        # If the 'whole' stamps are generated directly in the 'probe' phase,
        # then 'segment' and 'segment_id' columns are "N/A" and are removed
        # from the list of stamps columns that are added to the parents
        # stamps.
        # There is no need to have the "n_segment" column in the parent
        # stamps, so it is removed. The "whole" stamps directly generated
        # in the "probe" phase do not have such a column, but those generated
        # from "segment" stamps have.
        try:
            stamps_cols.remove("segment_id")  # segment_id is "N/A"
            stamps_cols.remove("segment")  # segment_id is "N/A"
            agg_funcs.pop("segment_id")
            agg_funcs.pop("segment")
        except ValueError:
            print(
                "'segment_id', 'segment' and 'n_segments'"
                " columns are not among in stamps column:"
                f"'{stamps_cols}', they are probably removed in"
                "a previous call of 'parents_stamps' function."
            )
        # aggregating functions for properties
        agg_funcs['ensemble_id'] = 'count'
        agg_funcs['n_frames'] = 'last'
        file_lastname = 'ensAvg'
    else:
        parent_groupby = 'whole'
        # aggregating functions for properties
        agg_funcs['segment_id'] = 'count'
        agg_funcs['n_frames'] = 'sum'
        file_lastname = 'whole'
    agg_funcs.pop(parent_groupby)
    parents_stamps = stamps.groupby([parent_groupby]).agg(agg_funcs)
    parents_stamps.reset_index(inplace=True)
    if lineage == 'whole':
        parents_stamps.rename(
            columns={'ensemble_id': 'n_ensembles'},
            inplace=True
        )
        try:
            parents_stamps.drop(columns=['n_segments'], inplace=True)
        except KeyError:
            print(
                "'n_segments' column is among the parents' stamps column:"
                f"'{stamps_cols}', it is probably bot created in"
                "a previous call of 'parents_stamps' function."
            )
    else:
        parents_stamps.rename(
            columns={'segment_id': 'n_segments'},
            inplace=True
        )
    if save_to is not None:
        space_name = parents_stamps.loc[0, 'space']
        filename = '-'.join(
            [space_name, group, 'stamps', file_lastname + '.csv']
        )
        parents_stamps.to_csv(save_to + filename, index=False)
    return parents_stamps

In [None]:
from glob import glob
import pathlib
import pandas as pd
import numpy as np
import math
import re
from polyphys.manage import organizer
from polyphys.manage.parser import SumRule, TransFoci
from polyphys.analyze import analyzer
import warnings
warnings.filterwarnings("ignore")

In [None]:
from dask.distributed import Client
from dask import delayed
from dask import compute
client = Client(n_workers=4)
client

## PC: parallel scheme with Dask and serial scheme

Files generated in the *probe* phase can be in *segment* or *whole* version. The analyze phase can be started from the *segmens* or *wholes* depending on the type of simulation.

### files in *segment* lineage/format

#### bug

##### Sum-rule project

###### Serial with For-loop

In [None]:
%%time
# 10 mins on 
inupt_databases = glob("/Users/amirhsi_mini/research_data/probe/N*-probe-segment")
geometry = 'biaxial'
tseries_properties_bug = [
    # property_, species, group
    ('fsdT', 'Mon', 'bug'),
    ('gyrT', 'Mon', 'bug'),
    ('rfloryT', 'Mon', 'bug'),
    ('shapeT', 'Mon', 'bug'),
    ('asphericityT', 'Mon', 'bug')
]
acf_tseries_properties_bug = [
    # property_, species, group
    ('fsdT', 'Mon', 'bug'),
    ('gyrT', 'Mon', 'bug'),
    ('rfloryT', 'Mon', 'bug'),
    ('shapeT', 'Mon', 'bug'),
    ('asphericityT', 'Mon', 'bug')
]

#hist_properties_bug = [
    # direction, species, group
#    ('rflory', 'Mon', 'bug')
#]
for input_database in inupt_databases:
    print(input_database)
    analyzer.analyze_bug(
        input_database,
        '/N*/N*',
        SumRule,
        geometry,
        True,
        #nonscalar_hist_properties=nonscalar_hist_properties,
        #tseries_properties=tseries_properties_bug,
        #acf_tseries_properties=acf_tseries_properties_bug,
        #hist_properties=hist_properties_bug,
        #,
        #nlags=100000
    )

###### Dask: this has problem with memeory linkage

In [None]:
inupt_databases = glob("/Users/amirhsi_mini/research_data/probe/N*-probe-segment")
geometry = 'biaxial'
analyses = []
tseries_properties_bug = [
    # property_, species, group
    ('fsdT', 'Mon', 'bug'),
    ('gyrT', 'Mon', 'bug'),
    ('rfloryT', 'Mon', 'bug'),
    ('shapeT', 'Mon', 'bug'),
    ('asphericityT', 'Mon', 'bug')
]
acf_tseries_properties_bug = [
    # property_, species, group
#    ('fsdT', 'Mon', 'bug'),
#    ('gyrT', 'Mon', 'bug'),
#    ('rfloryT', 'Mon', 'bug'),
#    ('shapeT', 'Mon', 'bug'),
    ('asphericityT', 'Mon', 'bug')
]

#hist_properties_bug = [
    # direction, species, group
#    ('rflory', 'Mon', 'bug')
#]
for input_database in inupt_databases:
    print(input_database)
    analyze_delayed = delayed(analyzer.analyze_bug)(
        input_database,
        '/N*/N*',
        SumRule,
        geometry,
        True,
        #nonscalar_hist_properties=nonscalar_hist_properties,
        #tseries_properties=tseries_properties_bug,
        acf_tseries_properties=acf_tseries_properties_bug,
        #hist_properties=hist_properties_bug,
        #nlags=100000
    )
    analyses.append(analyze_delayed)

In [None]:
%%time
# it takes 9min and 34s.
results = compute(analyses)

##### Trans-foci project

### files in *whole* lineage/format

#### bug

##### Sum-rule project

##### Trans-Foci project

In [None]:
%%time
# takes 25 min with nlags=100000
input_database = "/Users/amirhsi_mini/research_data/probe/ns400nl5al5D20ac1-probe-whole"
#input_database = '../test_data/probe/N2000D30.0ac4.0-segment/'
nonscalar_hist_properties = [
    # property_, species, group
    ('bondsT', 'Foci', 'bug', 0),
  #  ('clustersT', 'Foci', 'bug', 0)#,
]

#tseries_properties_bug = [
    # property_, species, group
#    ('fsdT', 'Mon', 'bug')#,
#    ('gyrT', 'Mon', 'bug')#,
#    ('rfloryT', 'Mon', 'bug'),
#    ('shapeT', 'Mon', 'bug'),
#    ('asphericityT', 'Mon', 'bug')
#]

acf_tseries_properties_bug = [
    # property_, species, group
    ('fsdT', 'Mon', 'bug'),
    ('gyrT', 'Mon', 'bug'),
    ('rfloryT', 'Mon', 'bug'),
    ('shapeT', 'Mon', 'bug'),
    ('asphericityT', 'Mon', 'bug')
]

#hist_properties_bug = [
    # direction, species, group
#    ('rflory', 'Mon', 'bug')
#]
geometry = 'biaxial'
analyzer.analyze_bug(
    input_database,
    '/eps*/eps*',
    TransFoci,
    geometry,
    False,
    #nonscalar_hist_properties=nonscalar_hist_properties,
    #tseries_properties=tseries_properties_bug,
   # acf_tseries_properties=acf_tseries_properties_bug,
    #hist_properties=hist_properties_bug,
    #nlags=100000
)