Repository practice project in EEAD-CSIC
The main objective is to analyse the variant calling with GATK. To achive that goal I will follow the following process.
- Prepare Data
- Quality Control
- Mapping
- Variant calling
Genome reference MorexV Samples: 32 samples of barley
This tutorial introduces the text-based workflow system Snakemake: https://snakemake.readthedocs.io/en/stable/tutorial/setup.html
Alll the rules are in the Snakemakefile in workflow Directory
We need a paired end couple of samples or a single end. In this repository, we have two samples "A_1_20_1" and "A_1_20_2" in the data/samples directory. Besides, in the data directory we can find the whole genome reference: GCA_904849725_genome.fa
To configure this workflow to use your own data, go to config directory and follow the instructions.
CLone this repository:
git clone https://github.com/carmenmiravete/barley_variant_calling.git
Change to barley_variant_calling directory:
cd barley_variant_calling
Run from the Terminal the following code:
snakemake -p -c4
-c or --cores you can select the number of cores you want
It is possible that the files are compressed.
Make a rule to decompress the files.
In the terminal:
snakemake -p data/samples/{A_1_20_1,A_1_20_2}.fastq.gz -c4
I will start the project doing a QUality Control with FastQC. These sequences must be verified for quality control, to ensure that the raw data does not have problems that could affect the biological analysis.
Process: Import of data from FastQ files 2 by 2. The program will provide a overview of the data Summary graphs Export results to an HTML based permanent report
snakemake -p results/reports/data/samples/{A_1_20_1,A_1_20_1}.html -c4
- GCA_904849725_genome.fa
- SAMPLES: A_1_20_1 and A_1_20_2
It can be done from Terminal:
bwa index -a bwtsw data/GCA_904849725_genome.fa
Or you can find the respective rule in the Snakefile called rule ref_genome.
The mapping reads information is in the rule map_reads in the Snakefile.
To run this rule, introduce the following code in the terminal
snakemake -p results/sorted/A_1_21_RG.bam
It can be the case of map single end reads. Single end Read alignment: bwa aln -f data/GCA_904849725_genome.fa Single_End_Sample.fastq
There ir a rule called rule sort_bam. To run this rule, we must write the following code in the Terminal:
snakemake -p results/sorted/A_1_20_sorted.bam -c4
#Step 5: Depth calculation
rule depth_calc: input: "results/mapped/{sample}.bam" output: "results/depth/{sample}_depth.csv" shell: "samtools depth {input} > {output}"
#Step 6: Show th depth in a plot
rule plot_depth: input: "results/depth/{sample}_depth.csv" output: "results/depth/plots/{sample}.svg" script: "plot-depth.py"
First of all, we need to calculate the depth. For this calculation, we use the rule depth_calc, where we get a .csv output.
To run the rule, write this code in the terminal:
snakemake -p results/depth/A_1_20_depth.csv -c4
For this step, we are going to use python 3. However, you can use the programm you want to visualize the data.
You can find the program I did in the directory workflow/scripts
Software used in this practical:
- Picard
- GATK
Installing GATK
We are using the
GATK function
snakemake --dag -p | dot -Tsvg > dag.svg^C