This repository contains scripts and tools for analyzing whole genome sequencing data using GATK4. The pipeline includes several steps, such as quality control analysis, alignment, variant calling, variant filtering, and functional analysis.
To get started with the genome analysis pipeline, follow these instructions:
-
Clone this repository: git clone https://github.com/edielam/Genome-Analysis-Pipeline.git
-
Install the required dependencies (see the Dependencies section for details).
-
Prepare your input data, including raw sequencing reads, reference genome, and known variant sites.
-
Customize the pipeline by updating the configuration files and script parameters according to your analysis requirements.
-
Execute the pipeline by running the provided scripts in the specified order.
The following dependencies are required to run the genome analysis pipeline:
- GATK4
- FastQC
- BWA
- Samtools
- Picard
- VCFtools
- Variant Effect Predictor (VEP)
- Python (with pandas, matplotlib, and other required libraries)
Ensure that these dependencies are installed and accessible in your environment before running the pipeline.
The pipeline is divided into several steps, each implemented in a separate script. Follow the instructions in each script to perform the specific analysis step.
-
QC Analysis, Alignment, and Mark Duplicates: Run
qc-analysis_alignment_and_markduplicates.sh
to perform quality control analysis, read alignment, and duplicate marking in a single step. -
BQSR and Variant Calling: Execute
bqsr_and_variant-calling.sh
to perform Base Quality Score Recalibration (BQSR) and variant calling using GATK4's HaplotypeCaller. -
Variant Filtering and Annotation: Use
variant-filtering-and_annotation.sh
to filter variant calls based on quality, coverage, and other criteria, and annotate the variants using Variant Effect Predictor (VEP). -
Functional Analysis: Utilize
functional_analysis.py
to perform functional analysis, including gene prioritization and pathway enrichment. Useprotein_sequence_generation.py
for protein sequence generation. -
Visualization: Use
visualization_of_variants.py
for generating visuals to be included in your report.
Refer to the individual scripts and their respective README files for detailed instructions on usage, inputs, and outputs.
qc-analysis_alignment_and_markduplicates.sh
: Script for performing quality control analysis, alignment, and marking duplicates.bqsr_and_variant-calling.sh
: Script for performing Base Quality Score Recalibration (BQSR) and variant calling.variant-filtering-and_annotation.sh
: Script for filtering variant calls based on quality, coverage, and other criteria, and annotating variants using VEP.functional_analysis.py
: Python script for performing functional analysis, including gene prioritization, pathway enrichment, visualization of variants, and protein sequence generation.visualization_of_variants.py
: Python script for generating visualizations of variants, such as bar charts and network plots, using annotated variant data.protein_sequence_generation.py
: Python script for generating protein sequences incorporating gene variants using Biopython.
Contributions are welcome! If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
This project is licensed under the [MIT License]