# <span style="color:teal">RBFE Network - Analysis</span>


This network provides a basic outline for how to run analysis.

In [None]:
# import libraries

from scipy.stats import sem as sem
import sys
import glob
import networkx as nx

import logging

logging.getLogger().setLevel(logging.INFO)

from pipeline import *
from pipeline.utils import validate
from pipeline.analysis import *

The following variables need to be set:

net_file - the network file that describes all the perturbations that were run and which engine they were run for. Usually generated in the execution_model folder during setup.

ana_file - the analysis protocol that was used to analyse the runs. This determines the extension that is used to open the results files. If none is provided, all extensions/analysis methods are considered.

exp_file - file containing the experimental results. This can be in yml format (better) or csv. The format of the yml file for each ligand should be:

```
lig_a:
  measurement:
    comment:
    doi: source of data
    error: error
    type: ki or ic50
    unit: uM or nM 
    value: value
  name: lig_a
```

results_folder - the location of the results files computed during the analysis stage after the run. The default for this is outputs_extracted/results. 

output_folder - the location for the graphs and tables generated during this notebook.

In [None]:
bench_folder = f"/home/anna/Documents/benchmark"
protein = "mcl1"
main_dir = f"/backup/GROMACS_reruns/{protein}"

# choose location for the files
net_file = f"{main_dir}/execution_model/network_failed.dat"
ana_file = f"{main_dir}/execution_model/analysis_protocol.dat"
exp_file = f"{bench_folder}/inputs/experimental/{protein}.yml"
output_folder = f"{main_dir}/outputs_extracted"

The protocol from the execution model can also be read in to gain additional parameters.

In [None]:
prot_file = f"{main_dir}/execution_model/protocol.dat"
pipeline_prot = pipeline_protocol(prot_file, auto_validate=True)

These can then be initialised into the analysis_network object, which will be used to run the rest of the functions in this notebook.

In [None]:
all_analysis_object = analysis_network(
    output_folder,
    exp_file=exp_file,
    net_file=net_file,
    analysis_prot=ana_file,
    method = pipeline_prot.name(), # if the protocol had a name
    # engines=pipeline_prot.engines(),
)

The following will then analyse the entire network:

In [None]:
all_analysis_object.compute_results()

A ligands folder can be added to visualise any perturbations and draw the network graph of the successful runs. This is generally the folder that was also used at the start for all the ligand inputs.

In [None]:
all_analysis_object.add_ligands_folder(
    f"/home/anna/Documents/benchmark/inputs/{protein}/ligands"
)

The network can be drawn. The edge colour indicates the error of that leg. Failed runs do not have their edge drawn on default.

To visualise the whole network, this can also be drawn seperately as a network object. `networkx_layout_func` can be used as an argument in `graph.draw_graph(networkx_layout_func = nx.circular_layout)` to change the layout of the drawn network. 


In [None]:
graph = network_graph(
    all_analysis_object.ligands,
    all_analysis_object.perturbations,
    ligands_folder=all_analysis_object.ligands_folder,
)
graph.draw_graph()

In [None]:
# all_analysis_object._initialise_graph_object()
all_analysis_object.draw_graph(engines="GROMACS")

To check and visualise any failed perturbations:

In [None]:
failed_perts = all_analysis_object.failed_perturbations("GROMACS")

for pert in sorted(failed_perts):
    print(pert)

all_analysis_object.draw_failed_perturbations("GROMACS")

If the failed perturbations have resulted in any disconnected ligands, these can also be listed.

In [None]:
all_analysis_object.disconnected_ligands(engine="GROMACS")

The cycles can also be considered more closely. The code below gives the average cycle closure for that engine with error. To look at each cycle individually, `all_analysis_object.cycle_dict[engine][0]` has a dictionary of the individual cycles.

In [None]:
all_analysis_object.compute_cycle_closures("GROMACS")

If more extensive analysis has been performed, it is also possible to check for average convergence for the runs. This requires the `analysed_pert.calculate_convergence()` to have been run during the individual analysis for each run. If this was not the case, setting `compute_missing` to `True` in compute_convergence below will cause this to be run. Please note, this can take a while.

In [None]:
all_analysis_object.compute_convergence(main_dir=main_dir)
all_analysis_object.plot_convergence()

There are different options for plotting. "pert" refers to perturbations, so the plotting of the edges, whereas "lig" (or "val" for values) refers to the ligands, so plotting for each node following the network-wide analysis.

The followign plots are available:

bar (pert or lig)

scatter (pert or lig) - can also be plotted using cinnabar

eng vs eng (pert or lig)

outliers


In [None]:
all_analysis_object.remove_outliers(threshold=10, name="GROMACS")

In [None]:
all_analysis_object.plot_scatter_ddG()

In [None]:
# bar
all_analysis_object.plot_bar_dG()
all_analysis_object.plot_bar_ddG()

# scatter
all_analysis_object.plot_scatter_dG()
all_analysis_object.plot_scatter_ddG()
all_analysis_object.plot_scatter_dG(use_cinnabar=True)
all_analysis_object.plot_scatter_ddG(use_cinnabar=True)

for eng in all_analysis_object.engines:
    all_analysis_object.plot_scatter_dG(engine=eng)
    all_analysis_object.plot_scatter_ddG(engine=eng)

    # outliers
    all_analysis_object.plot_outliers(engine=eng)
    all_analysis_object.plot_outliers(engine=eng, pert_val="val")

The statistics of the MAD (comparing engines) and MAE (compared to experimental) can also be computed. The first table shown is the value, and the second table contains the bootstrapped error.

In [None]:
df1, df2 = all_analysis_object.calc_mad_engines(pert_val="pert")
# all_analysis_object.calc_mad_engines(pert_val="val")
print(df1, df2)

In [None]:
df1, df2 = all_analysis_object.calc_mae_engines(pert_val="pert")
# all_analysis_object.calc_mad_engines(pert_val="val")
print(df1, df2)

The ligands can be sorted by binding affinity, and the spearmans rank correlation coefficient calculated (rho).

In [None]:
all_analysis_object.sort_ligands_by_binding_affinity(engine="GROMACS")
all_analysis_object.sort_ligands_by_experimental_binding_affinity()

In [None]:
values = all_analysis_object._stats_object.compute_rho(pert_val="val", y="GROMACS")
print(values)

Other aditional results can be added to the all_analysis_object. These must be in a file similar to that written during the analysis, a csv file with ["lig_0", "lig_1", "freenrg", "error", "engine", "analysis", "method"] as the headers. "engine", "analysis" and "method" can be left as None, as the name variable is used for identification of the results.

In [None]:
other_name = ""
other_results_folder = ""

other_results_files = glob.glob(
    f"{other_results_folder}/freenrg_*_{eng}_MBAR_alchemlyb_None_eqfalse_statsfalse_truncate0end.csv"
)

all_analysis_object.compute_other_results(other_results_files, name=other_name)

If required, any perturbations different from those in the considered network can be removed as well.

In [None]:
# remove any non main network perturbations
for eng in all_analysis_object.other_results_names:
    for pert in all_analysis_object._perturbations_dict[eng]:
        if pert not in all_analysis_object.perturbations:
            all_analysis_object.remove_perturbations(pert, name=eng)

Outliers can be removed. First, the outliers can be plotted, as in the function earlier, or all outliers over a certain threshold in kcal/mol can be identified:

In [None]:
all_analysis_object.get_outliers(threshold=5, name="GROMACS")

These can then be removed, which also automatically recalculates the network values, and the above analysis cells can be rerun for the new visualisation / stats.

In [None]:
all_analysis_object.remove_outliers(threshold=10)

The histograms of the errors can also be plotted to compare different methods:

In [None]:
all_analysis_object.plot_histogram_legs()
all_analysis_object.plot_histogram_repeats()
all_analysis_object.plot_histogram_sem(pert_val="pert")
all_analysis_object.plot_histogram_sem(pert_val="val")