Skip to content
Statistical analysis of genomes
R Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Arabidopsis_thaliana
Danio_rerio
Drosophila_melanogaster
Felis_catus
Homo_sapiens
Medicago_truncatula
Mus_musculus
Oryza_sativa
.DS_Store
.gitignore
README.md
all_promoters_for TRANSFAC.zip

README.md

Genomes_project

Statistical analysis of annotated genomes

Goals and objectives

Find correlation between genomic features (like SNPs, methylation, TFBS) and functional genomic regions in different genomes

  1. Plot sequence features such as TFBS, SNPs, methylation, RNA-seq coverage
  2. Map it on functional genomic regions
  3. Find correlation and check reproducibility for different genomes
  4. Consider annotation quality and outcomes for functional features (like promoters)prediction for not annotated genomes

Data:

Graphs for Oryza sativa [1]

Arabidopsis thaliana

  1. reference genome TAIR10_toplevel (ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/dna/)
  2. annotation TAIR10_GFF3_genes.gff3
  3. variation vcf file 1001 genome TAIR
  4. methylation data

Medicago truncata

  1. annotation (.gff) and assemly (.fasta) from http://www.medicagogenome.org/downloads
  2. SNP files also from http://www.medicagogenome.org/downloads

Homo sapiens

  1. annotation Release 28 (GRCh38.p12) (CHR) in .gff3 format
  2. .fasta of primary assembly (PRI)

Mus musculus

  1. annotation Release M17 (GRCm38.p6) (CHR) in .gff3 format
  2. .fasta of primary assembly (PRI)

Felis catus

  1. annotation assembly Felis_catus_9.0 in .gff format (ID 78)
  2. .fasta of assembly 9.0 (ID 78)

Drosophila melanogaster

  1. reference assembly dmel_r5.57_FB2014_03 from FlyBase, dmel-all-chromosome-r5.57.fasta.gz
  2. annotation dmel_r5.57_FB2014_03 dmel-all-filtered-r5.57.gff.gz
  3. variation downloaded for each chromosome for all populations in one file in .vcf formatPopFly Browser Hervas S, Sanz E, Casillas S, Pool JE, and Barbadilla A (2017) PopFly: the Drosophila population genomics browser. Bioinformatics, 33, 2779-2780;

Danio rerio

Scripts for data preprocessing:

  1. get_ATGs.py
  2. get_4tss.py
  3. get_4tts.py
  4. get_promoters.py
  5. get_fin_anno.py

Data preprocessing:

  1. to create file with ATGs: python3 get_ATGs.py annotation.gff
  2. to create file with tss: python3 get_4tss.py annotation.gff
  3. to create files with promoter regions (.bed + .txt): python3 get_promoters.py 4tss.txt
  4. to obtain promoter regions sequences: sed 's/^>1.*$/>Chr1/' Arabidopsis_thaliana.TAIR10.dna.toplevel.fa | sed 's/^>2.*$/>Chr2/' | sed 's/^>3.*$/>Chr3/'| sed 's/^>4.*$/>Chr4/'| sed 's/^>5.*$/>Chr5/'| sed 's/^>Mt.*$/>ChrM/'| sed 's/^>Pt.*$/>ChrC/' > new_ref.fa in order to get names of chromosomes in fasta consistent with names in bed file, then bedtools getfasta -fi corrected_reference.fasta -bed promoters.bed -name -s -fo promoters_sequences.fasta
  5. to create fin_anno: python3 get_fin_anno.py annotation.gff

Plots visualization:

  1. first (and the most important) file is snp_custom_annotation.r, which contains a function that create custom annotation of snps - all other scripts use these function
  2. ATG_plot.r is used for visualization SNP distribution around start codon (required packages are dplyr, scales)
  3. intron_exon_junctions.r is used for visualization of SNP distribution around exon-intron boundary
  4. promoter-terminator.r is used for visualization of SNP distribution around terminator
  5. transcr_stop_plot.r is used for visualization of SNP distribution around transcription stop codon
  6. transfac.r is used for visualization distribution of TFBSs in promoter region (+-500 nucleotides around TSS)

Several results:

  1. Arabidopsis thaliana

  1. Medicago truncatula
You can’t perform that action at this time.