Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Visual Description

Update 05/21

Please refer to the lab's official implementation as this repo will no longer be maintained

How to run

First, install dependencies

# clone project   
git clone https://github.com/ec134/GSAE   

# install project   
cd GSAE
pip install -e .   
pip install -r requirements.txt

Workflow

loading data from RNAfold

From rnafold, we get a file like the following

rnafold_output.txt

which inside looks like

GGCGUUUUCGCCUUUUGGCGAUUUUUAUCGCC -14.20  10.00
(.((...(((((....))))).......)).)  -5.50
(.(((..(((((....))))).....).)).)  -4.20
((.....(((((....))))).........))  -5.90
((.((..(((((....))))).....))..))  -5.60
((.(...(((((....))))).......).))  -6.40
((.(.(.(((((....))))).....).).))  -4.20
((.((..(((((....))))).....).).))  -5.10
(((....(((((....))))).(...)..)))  -4.30
(((....(((((....)))))(....)..)))  -5.30

We can use rnafold2arrays.py in gsae/data_processing to convert this text file to

a csv file containing adjacency matrices for each fold (adjmats_<datestamp>.csv)
a csv file containing the energy scalar for each structure (energies_<datestamp>.csv)
a text file with the rna sequence (sequence_<datestamp>.txt)

rnafold2arrays.py usage:

usage: rnafold2arrays.py [-h] --data DATA --outname OUTNAME

optional arguments:
-h, --help         show this help message and exit
--data DATA        RNAfold txt file output to be converted
--outname OUTNAME  base name for the outputs

sample usage:

> python rnafold2arrays.py --data seq4_rnafold_out.txt --outname seq4
> ls
seq4_adjmat_2020-03-04-03.csv
seq4_energies_2020_03-04-03.csv
seq4_sequence_2020-03-04-03.txt
rnafold2arrays.py
seq3_rnafold_out.py

Converting adjacency data to scattering coefficients

Once we have the adjacency matrices of the structures we're interested in, we can convert them using scattering transforms to a new, more informative representation

Here we will use diracs centered at each node (i.e. the identity matrix) as our graph signals.

To convert them, we will use adj2scatcoeffs.py

adj2scatcoeffs.py usage:

usage: adj2scatcoeffs.py [-h] --data DATA --outname OUTNAME --graph_size
                        GRAPH_SIZE [--pcs PCS]

optional arguments:
-h, --help            show this help message and exit
--data DATA           csv with adjacency matrices
--outname OUTNAME     base name for output
--graph_size GRAPH_SIZE
                        number of nodes in graphs (assume equal size)
--pcs PCS             how many principle components to use (if 0, then use raw scattering coefficients)

sample usage:

> python adj2scatcoeffs.py --data seq4_adjmat_2020-03-04-03.csv --outname seq4 --graph_size 32 --pcs 100

> ls
seq4_scat_coeff_2020-03-04-03_pca_n100.csv
seq4_adjmat_2020-03-04-03.csv
adj2scatcoeffs.py

Data

Data for the 4 sequences used in the paper are located in data/

└── raw_data
    ├── hiv_tar
    │   ├── hiv_tar_sequence.txt
    │   ├── hivtar_100k_subp_n_052020.txt
    ├── hob_seq3
    │   ├── seq3_100k_subp_n_052020.txt
    │   └── seq3_sequence.txt
    ├── hob_seq4
    │   ├── seq4_100k_subp_n_052020.txt
    │   └── seq4_sequence.txt
    └── tebown
        ├── teb_100k_subp_n_052020.txt
        └── tebown_sequence.txt

IMPORTANT: Data loading for models

In order to ensure that the training scripts in the model files function correctly, the global variables at the top of load_splits.py must be assigned to whereever you save the outputs of adj2scatcoeffs.py.

Training the models

python gsae/models/gsae_model.py

arguments:

usage: gsae_model.py [--input_dim INPUT_DIM] [--dataset DATASET] [--bottle_dim BOTTLE_DIM] [--hidden_dim HIDDEN_DIM] [--learning_rate LEARNING_RATE] [--alpha ALPHA] [--beta BETA] [--n_epochs N_EPOCHS] [--batch_size BATCH_SIZE] [--n_gpus N_GPUS] [--save_dir SAVE_DIR]

Citation

@article{castro2020uncovering,
  title={Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings},
  author={Castro, Egbert and Benz, Andrew and Tong, Alexander and Wolf, Guy and Krishnaswamy, Smita},
  journal={arXiv preprint arXiv:2006.06885},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
data/raw_data		data/raw_data
gsae		gsae
.gitignore		.gitignore
HOWTO.md		HOWTO.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

data/raw_data

data/raw_data

gsae

gsae

.gitignore

.gitignore

HOWTO.md

HOWTO.md

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Visual Description

Update 05/21

Please refer to the lab's official implementation as this repo will no longer be maintained

How to run

Workflow

loading data from RNAfold

Converting adjacency data to scattering coefficients

Data

IMPORTANT: Data loading for models

Training the models

Citation

About

Releases

Packages

Languages

License

ec1340/GSAE

Folders and files

Latest commit

History

Repository files navigation

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Visual Description

Update 05/21

Please refer to the lab's official implementation as this repo will no longer be maintained

How to run

Workflow

loading data from RNAfold

Converting adjacency data to scattering coefficients

Data

IMPORTANT: Data loading for models

Training the models

Citation

About

Resources

License

Stars

Watchers

Forks

Languages