Skip to content

biovino1/lncRNA

Repository files navigation

**************************************************************************************************************
# DETERMINING LNCRNA STRUCTURE FROM ITS SEQUENCE
**************************************************************************************************************
This project takes lncRNA transcripts from a GTF file, locates their positions in the human genome (g37), and
obtains their fasta sequence. These sequences are then written out in fasta format to directories depending on
their length. This is all performed by the Genome_Reading.py script.

Because the end goal of these lncRNA fasta sequences is to see how they fold, sequences are broken up into categories
of every 500 base pairs i.e. 1-500, 501-1000, etc. Larger sequences take longer to fold.

G37_Parsing.py reads the human genome fasta file and pulls out the sequences of the 23 chromosomes. This was
performed so that a local blast database containing only the chromosomes could be used to test the validity
of the fasta sequences for each transcript.

Resulting fasta files will be uploaded to a university computer cluster. Using Mathews Lab "RNAstructure" package,
the secondary structure of these RNA fasta sequences are to be determined. rna_manager.py and rna_manager.slurm were
used on the university computer cluster to run RNAstructure package jobs.

lnc_fdb.fa and lnc_hc.fa were downloaded from https://lncipedia.org/download. These two fasta files were parsed with
lnc_parsing.fa to obtain the individual fasta sequences of each lncRNA transcript. compare_fastas.py was used to
compare the fasta sequences gathered from genome_reading.py and lnc_parsing.fa.

verify_bases.py was used to read each letter of all fasta sequences in a directory to ensure they were ATCG compliant.
Sequences that failed to meet this requirement are written out to bad_bases.txt.

**************************************************************************************************************
# WORKFLOW
**************************************************************************************************************
Homo_sapiens.GRCh37.dna.alt --> G37_Parsing.py --> g37_chroms.fa
lncipedia_5_2_hc_hg19.gtf + Homo_sapiens.GRCh37.dna.alt --> Genome_Reading.py --> Data (folder)
lnc_fdb.fa + lnc_hc.fa --> lnc_parsing --> fdbData, hcData (folders)

Some manual file manipulation occured:
1) The first line of the gtf file used in this project contained "##gtf" and was deleted before running 
Genome_Parsing.py. 

A local blast database was made using g37_chroms.fa on a unix server using the bioinfo module. Several sequences were
blasted using this database to ensure they were taken correctly.





About

lncRNA folding from sequence

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published