<a class="anchor" id="toc"></a>
# RUN LINEAR REGRESSION

This notebook works through the process of performing linear regression on the different combinations of metrics, properties, and measures.

---
- [WORKSPACE VARIABLES](#workspace-variables)
- [UTILITY FUNCTIONS](#utility-functions)
- [PREPARE DATA](#prepare-data)
- [RUN REGRESSION](#run-regression)
---

We perform linear regression on four different cases that vary in the features and responses:

> **Case 1.** graph measures $\rightarrow$ hemodynamic properties <br />
> **Case 2.** hemodynamic properties $\rightarrow$ emergent metrics <br /> 
> **Case 3.** graph measures $\rightarrow$ emergent metrics <br />
> **Case 4.** graph measures + hemodynamic properties $\rightarrow$ emergent metrics

The final output of the regression is a single file `LINEAR_REGRESSION.csv` that is used as input to D3 for plotting the regression bar plots ([go to figure](http://0.0.0.0:8000/figures/linear_regression.html)).

<a class="anchor" id="workspace-variables"></a>

### WORKSPACE VARIABLES 
<span style="float:right;">[back to top](#toc)</span>

Set up workspace variables for linear regression.

- **`ANALYSIS_PATH`** is the path for analysis files (`.json` and `.csv` files, `.tar.xz` compressed archives)

In [None]:
ANALYSIS_PATH = "/path/to/analysis/files/"

- **`NAMES`** is the list of simulation sets to use
- **`CONTEXTS`** is the list of contexts (colony and tissue)

In [None]:
NAMES = ["EXACT_HEMODYNAMICS", "VASCULAR_FUNCTION"]
CONTEXTS = ["C", "CHX"]

- **`METRICS`** is the list of emergent metrics and center concentrations
- **`PROPERTIES`** is the list of hemodynamic properties
- **`MEASURES`** is the list of graph measures

In [None]:
METRICS = ["GROWTH", "CYCLES", "SYMMETRY", "ACTIVITY", "GLUCOSE", "OXYGEN"]
PROPERTIES = ["PRESSURE", "RADIUS", "WALL", "SHEAR", "CIRCUM", "FLOW"]
MEASURES = ["SHORTPATH", "GDIAMETER", "GRADIUS", "ECCENTRICITY", "CLOSENESS", "BETWEENNESS"]

<a class="anchor" id="utility-functions"></a>

### UTILITY FUNCTIONS
<span style="float:right;">[back to top](#toc)</span>

General utility functions for data preparation.

In [None]:
def filter_json_with_time(jsn, metric, context):
    """Filter json by time and context."""
    data = [d for d in jsn if d['time'] == 15.0 and d["context"] == context and d['graphs'] != "PATTERN"]
    assert(len(data) == 5)
    return data

def filter_json_without_time(jsn, context):
    """Filter json by context."""
    data = [d for d in jsn if d["context"] == context and d['graphs'] != "PATTERN"]
    assert(len(data) == 5)
    return data

def filter_csv(content, header, measure, context):
    """Filter csv by context."""
    data = []
    
    for layout in ['Lav', 'Lava', 'Lvav', 'Sav', 'Savav']:
        d = [float(d[header.index(measure.lower())]) for d in content
            if d[header.index("context")] == context
            and d[header.index("graph")] == layout]
        data = data + d

    assert(len(data) == 50)
    return data

<a class="anchor" id="prepare-data"></a>

### PREPARE DATA
<span style="float:right;">[back to top](#toc)</span>

Extract metrics, properties, and measures into a single dataframe.

First, we need to extract some additional properties from the `EXACT_HEMODYNAMICS` and `VASCULAR_FUNCTION` simulation sets.
The function `merge_graph` extracts individual properties from the `.GRAPH` files produced in the basic analysis step.
Values across conditions and times are merged into individual files for each property (`EXACT_HEMODYNAMICS.GRAPH.*.json`, `VASCULAR_FUNCTION.GRAPH.*.json`).

Note that these files are provided, so this block can be skipped.

In [None]:
from scripts.generate import merge_graph, save_graph

In [None]:
from scripts.EXACT_HEMODYNAMICS import EXACT_HEMODYNAMICS
for prop in PROPERTIES:
    EXACT_HEMODYNAMICS.loop(ANALYSIS_PATH, merge_graph, save_graph, f".GRAPH.{prop}", timepoints=["150"])

In [None]:
from scripts.VASCULAR_FUNCTION import VASCULAR_FUNCTION
for prop in PROPERTIES:
    VASCULAR_FUNCTION.loop(ANALYSIS_PATH, merge_graph, save_graph, f".GRAPH.{prop}", timepoints=["150"])

Then, we can iterate through the simulation sets and contexts to combine all the data into a single dataframe.

In [None]:
import pandas as pd
from scripts.utilities import load_json, load_csv

In [None]:
def load_data(path, name, context, responses, properties, measures):
    df = pd.DataFrame()
    
    # add response data by loading from .SEEDS and .CENTERS files
    for response in responses:
        if response in ["GLUCOSE", "OXYGEN"]:
            D = load_json(f"{path}{name}/{name}.CENTERS.json")
            d = filter_json_without_time(D['data'], context)
            df[response] = [e for entry in d for e in entry[response.lower()]]
        else:
            D = load_json(f"{path}{name}/{name}.SEEDS.{response}.json")
            d = filter_json_with_time(D['data'], response, context)
            df[response] = [e for entry in d for e in entry["_"]]
    
    # add property data by loading from .GRAPH files
    for prop in properties:
        D = load_json(f"{path}{name}/{name}.GRAPH.{prop}.json")
        d = filter_json_without_time(D, context)
        df[prop] = [e for entry in d for e in entry["_"]["mean"]]

    # add measure data by loading from .MEASURES files
    D = load_csv(f"{path}_/GRAPH_MEASURES.csv")
    header = D[0]
    content = D[1:]
    context_code = "C/CH" if name == "EXACT_HEMODYNAMICS" else context.replace("CHX", "CH")
    for measure in measures:
        df[measure] = filter_csv(D, header, measure, context_code)
        
    return df

In [None]:
all_df = {}
for context in CONTEXTS:
    for name in NAMES:
        all_df[f"{context}_{name}"] = load_data(ANALYSIS_PATH, name, context, METRICS, PROPERTIES, MEASURES)

<a class="anchor" id="run-regression"></a>

### RUN REGRESSION
<span style="float:right;">[back to top](#toc)</span>

Run linear regression of each of four combinations of metrics, properties, and measures.

In [None]:
from statsmodels.formula.api import ols
from scripts.utilities import save_csv

In [None]:
def run_regression(df):
    out = []
    
    # z-score data frame
    ndf = (df - df.mean())/df.std()
    
    # run regression case 1 (measures -> properties)
    for prop in PROPERTIES:
        reg = ols(prop + ' ~ ' + " + ".join(MEASURES), data=ndf).fit()
        out.append([1, prop, reg.rsquared, reg.rsquared_adj])
        
    # run regression case 2 (properties -> metrics)
    for metric in METRICS:
        reg = ols(metric + ' ~ ' + " + ".join(PROPERTIES), data=ndf).fit()
        out.append([2, metric, reg.rsquared, reg.rsquared_adj])
    
    # run regression case 3 (measures -> metrics)
    for metric in METRICS:
        reg = ols(metric + ' ~ ' + " + ".join(MEASURES), data=ndf).fit()
        out.append([3, metric, reg.rsquared, reg.rsquared_adj])
    
    # run regression case 4 (measures + properties -> metrics)
    for metric in METRICS:
        reg = ols(metric + ' ~ ' + " + ".join(MEASURES + PROPERTIES), data=ndf).fit()
        out.append([4, metric, reg.rsquared, reg.rsquared_adj])
    
    return out

In [None]:
out = []
for context in CONTEXTS:
    for name in NAMES:
        reg = run_regression(all_df[f"{context}_{name}"])
        out = out + [[name, context] + entry for entry in reg]

header = ",".join(["name", "context", "case", "response", "r2", "r2adj"]) + "\n"
save_csv(f"{ANALYSIS_PATH}_/LINEAR_REGRESSION", header, zip(*out), "")