## Introduction

We're now going to take a look at the drosha measurements and how to featurize them onto the graphs.

In [None]:
from pyprojroot import here
import pandas as pd

df_bioc = pd.read_csv(here() / "data/df_bioc.csv", index_col=0)
df_bioc.columns

There are a lot of columns in there, however, the ones we are most interested in are:

- `frac_avg`: Gives us the activity
- `dot_bracket`: Gives us the dot-bracket notation

Things that we may be interested in include:

- The `shannon_{pos}` series of columns, which gives us the shannon entropy of that particular position in the folded RNA.

Our goal here is to predict `frac_avg` (or some math transform of it) from the `dot_bracket` structure.
Our hypothesis here is that the `dot_bracket` structure represented as a graph
gives us sufficient information to predict `frac_avg` accurately;
alternatively, we might want to add in the shannon entropy,
as we found previously that it was visually\* correlated with RNA cleavage (`frac_avg`).


> \* by visually correlated, we refer to Fig. 2 of [our previously-published paper](https://www.sciencedirect.com/science/article/abs/pii/S1097276520307358).

In [None]:
from drosha_gnn.graph import to_networkx
import janitor

## Make graphs from dot-bracket

In [None]:
df = df_bioc.join_apply(lambda row: to_networkx(row["dot_bracket"]), "graph")
df.head()

In [None]:
import jax.numpy as np

def ecdf(data):
    x = np.sort(data)
    y = np.arange(len(data)+1) / (len(data) + 1)
    return x, y

In [None]:
x, y = ecdf(df_bioc["frac_avg"])