Skip to content

dev11ume/MinSeq-Find

Repository files navigation

Author: Devesh Bhimsaria

5 different sequencing datasets used in this study:
    1. From this study – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA729962
    2. Zhang et al. 2018 Genome Research (SelexGLM study) – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA379022
    3. Jolma et al. 2013 Cell (HT-SELEX study) – Sequencing data can be downloaded from ENA (European Nucleotide Archive) under accession numbers ERP001824 and ERP001826
    4. Yin et al. 2017 Science (HT-SELEX study) – Sequencing data can be downloaded from ENA (European Nucleotide Archive) under accession number PRJEB9797.
    5. Isakova et al. 2017 Nature Methods (SMiLE-seq study) – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRP073361


Ran on system with 16GB RAM
Getting MinSeqs for Nuclear Receptor (NR) proteins:
    1. Download NR data from above links.
    2. Downloaded data in fastq format for this study & other studies are first need to be converted into sequence format. Use provided trim_fastq_1.pl to convert fastq files to sequence format. This script will also rename files. “perl trim_fastq_1.pl Sra_mapping1.txt 20”
    3. Start with a list of NRs – like “list_NRs_complete_1.txt” for all NRs –  a list of all NRs from this study or previous high-throughput in vitro sequencing study that gave a valid motif. OR “list_NRs_bhimsaria_1.txt” for those from this study only. Columns are -
        1. Sample file name with an ID.
        2. Library or previous round file name with an ID. Starting random library was used for this study, SMiLE-seq and SelexGLM, whereas previous round data was used for Jolma et al. 2013 and Yin et al. 2017 as random library wasn’t always available for those datasets.
        3. HT-SELEX round number.
        4. Name 1 provided to the sample or run.
        5. Constant region to the left of random DNA library (if available otherwise x).
        6. Constant region to the right of random DNA library (if available otherwise x).
        7. Name 2 provided to the sample or run.
        8. Partner protein (if available otherwise x).
        9. x.
        10. Monomer used for each sample for landscape and repeat heat maps.
        11. Name 3 provided to the sample or run.
        12. Numbering 1 (ignore).
        13. Numbering 2 (ignore).
        14. Name 4 provided to the sample or run (this is used to name the sample).
    4. Run MinSeqFind algorithm using run_minseqfind_1.m (in MinSeqFind directory) program on Octave. Typically for each NR sample it’ll take around an hour to run. User can set list of samples using follwing on line#5 list_file='list_NRs_bhimsaria_1_demo.txt'; Rest parameters can be set line#10 to line#38 of run_minseqfind_1.m
    5. Program “run_minseqfind_1.m” generates PWM matrix in text format and chen format, which can then be converted to logo format using seqLogo package in R or chen2meme+ceqlogo command from MEME suite (https://meme-suite.org/meme/). These PWM logos as generated by ceqlogo were used in this manuscript.
    6. Program “run_minseqfind_1.m” also generates MinSeqs which are printed in a text file with columns a) sequence b) reverse complement c) Weighted MinSeq score d) length of the sequence (without counting Ns) e) Enrichment/MinSeq score

Demo-
After downloading and converting data user can execute run_minseqfind_1.m for list_NRs_bhimsaria_1_demo.txt to get files corresponding to GR round 3 binding data named as S07_166_S07_998 in MinSeqFindFunct-op directory. Resultant data is stored as resf.mat file in S07_166_S07_998 directory. Corresponding MinSeqs are there in MinSeqs directory S07_166_S07_998_1_100000.txt file. 20 numbered PWMs will appear in TXT and CHEN directories eg. S07_166_S07_998_1_1.txt and S07_166_S07_998_1_1.chen with their sequence reverse complement S07_166_S07_998_1_1r.txt and S07_166_S07_998_1_1r.chen.

MinSeqScoring_1.m
Program to score sequences from MinSeqs stored in a resf.mat file generated by run_minseqfind_1.m
Sequences of length longer than 1000 will be ignored. As that takes much longer time for scoring

Define following variables in the file
1) 'seq_file' as FILE CONTAINING SEQUENCES
2) 'load_resf' as FOR THE MinSeq resf.mat file
3) 'op_file' as NAME OF THE OUTPUT FILE where output scores will be stored

About

MinSeq Find Algorithm

Resources

License

GPL-3.0, GPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
GPL-3.0
LICENSE.txt

Stars

Watchers

Forks

Packages

No packages published