# Setting up an RNA Science Environment

The computational biology field has a lot of helpful software packages for interacting with RNA sequences and experimental data. First, let's install `arnie`, a helpful utility library that simplifies interacting with various secondary structure prediction packages.

### Best paper to help orient with the current landscape of RNA modeling

https://www.nature.com/articles/s41467-021-21194-4#:~:text=Accurate%20predictions%20of%20RNA%20secondary,for%20such%20highly%20parameterized%20models.

In [None]:
pip install arnie

Arnie needs at least one secondary structure predictor, so let's install `EternaFold`. [Eternafold](https://www.nature.com/articles/s41592-022-01605-0) is a leading prediction package that was trained using sequences collected via the citizen science game [Eterna](http://eternagame.org). In fact, Eterna players provided many of the sequences in the data for this competition. 

In [None]:
# Install Eternafold; not available via conda-forge quite yet, so we install from the pipeline build artifacts directly
# !conda install eternafold
!wget https://artprodeus21.artifacts.visualstudio.com/A910fa339-c7c2-46e8-a579-7ea247548706/84710dde-1620-425b-80d0-4cf5baca359d/_apis/artifact/cGlwZWxpbmVhcnRpZmFjdDovL2NvbmRhLWZvcmdlL3Byb2plY3RJZC84NDcxMGRkZS0xNjIwLTQyNWItODBkMC00Y2Y1YmFjYTM1OWQvYnVpbGRJZC83NzI5MTAvYXJ0aWZhY3ROYW1lL2NvbmRhX3BrZ3NfbGludXg1/content?format=zip
!unzip content\?format\=zip
!conda install conda_pkgs_linux/eternafold-1.3.1-h00ab1b0_0.conda

Ordinarily, the `EternaFold` conda package will automatically set necessary environment variables, but Kaggle's conda install works a little differently. Let's set them manually here using `%env`.

In [None]:
%env ETERNAFOLD_PATH=/opt/conda/bin/eternafold-bin
%env ETERNAFOLD_PARAMETERS=/opt/conda/lib/eternafold-lib/parameters/EternaFoldParams.v1

Now that we have a predictor, we can make structure predictions about a given sequence. For example, let's look at an example Hammerhead ribozyme sequence. We can use arnie's `mfe`, or Minimum Free Energy, function to predict a secondary structure for this RNA sequence. The structure will be represented in "dot-bracket" notation, where `.` is an unpaired base and `()` represent two paired bases.

In [None]:
!git clone https://github.com/DasLab/draw_rna.git /kaggle/working/draw_rna

In [None]:
from arnie.mfe import mfe

sequence = "CGCUGUCUGUACUUGUAUCAGUACACUGACGAGUCCCUAAAGGACGAAACAGCG"
mfe(sequence,package="eternafold")

In [None]:
from arnie.bpps import bpps

bpps_dict = {}
my_sequence = 'CGCUGUCUGUACUUGUAUCAGUACACUGACGAGUCCCUAAAGGACGAAACAGCG'

for pkg in ['vienna','nupack','RNAstructure','contrafold','RNAsoft']:
    bpps_dict[pkg] = bpps(my_sequence, package=pkg)

Arnie provides other functions for structure prediction. We can generate a 'Base Pair Probablility' matrix that predicts the probability of every possible base pairing (e.g, how likely is base 1 to pair with base 2, base 3, base 4...). 

In [None]:
from arnie.bpps import bpps
bpps(sequence,package="eternafold")

In [None]:
import pandas as pd

In [None]:
train = pd.read_csv('../input/stanford-ribonanza-rna-folding/train_data.csv',nrows = 10)

In [None]:
train.shape

In [None]:
train.columns