RNAi genes duplications in Drosophila

All supporting scripts that I use to analyze genomic data in my Master's Dissertation. The detailed explanations for each script is given below

Variant Calling

I used Bowtie2 , samtools v1.4 and GATK v3.5 to collect variants of 15-20 genes from short reads population genomic data of Drosophila pseudoobscura (12 strains), Drosophila miranda (12 strains) see McGaugh et.al, 2012 and Drosophila athabsca (28 strains) see Miller et al. ,2017. Finally, I used fasta_formatter from FastX-Toolkit to organize the fasta files according to the genes. The script is available here: varcall.sh
Any linkage information in heterozygous individual is recovered using FastPhase. I created a R script to parse fasta files into format suitable for FastPhase, running FastPhase and reconvert the output back into Fasta format. The script is available here: FastPhaseIntercovert.R. The original script credits to Dr Darren Obbard and I modified the script so that it compatible with current version of FastPhase.
To gain information on the effect of weakly deleterious mutations, I created a R script which removes variants with MAF less than 0.15. The scripts run by parsing fasta files into matrix, remove variants with MAF less than 0.15 and replace those variants with major variants at corresponding site and finally, convert back the matrix into Fasta format. The script is available here: MAFRemover.R

I collated transcriptome datasets from ENA and DDBJ and used custom parser to process the text files into format that I set for expression script
The expression script runs by mapping RNA-seq reads with reference transcriptome Bowtie2 and uses samtools v1.4 to retain and count only the mapped reads for each genes (by using idxstats samtools)
The rest of the scripts are used for normalization, transform and plotting data using ggverse packages and mixed model analysis using mcmcGLMM

I collected gene sequences using tBLASTN of local database. To automate the process, I write a script which will take sequences query, doing BLAST search and outputing fasta from BLAST hits output (and reverse complementing when the hits map in reverse direction). The script is available here: Reverse-ComplementBLASThits.sh and I also created script to manually select region of interest from sequence database (which I find quite handy when inspecting sequence and looking for the surrounding sequence). This script is available here: BLAST search
For the species that only has genomic reads, I used a targetted assembly approach to assemble only reads that match with known RNAi protein. I identify the reads using Diamond and then use single cell/bacterial assembly program, Spades. The script is available here: Diamond_Spades.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Expression_Analysis		Expression_Analysis
Phylogenetic_Analysis		Phylogenetic_Analysis
Population_Genetic		Population_Genetic
.gitattributes		.gitattributes
README.md		README.md