Welcome to BathyBrooksiSymbionts repository! Here, you will find the in-house scripts that were used for the population structure and strain evolution analyses of chemosynthetic symbionts from the deep-sea mussel Bathymodiolus brooksi.
This folder contains R and Python scripts that were used to study symbiont population structure and selective preassure. Additionally, you can find a toy example to test the scripts, which is composed by SNVs called in 6 different samples (A-F). Note that these analyses are gene-based, thus, the reference sequences and coordinates for the SNVs are based on genes.
First, we merge all SNPs found across samples into count files for each individual sample. This script is using .vcf files that have been produced by Lofreq, and requires the DP4 field to extract the nucleotide counts.
./vcf_to_mergedcounts.py .
This will produce .list
files for each of the samples. A two columns .txt
file is required, where the name of the samples must be linked to the name of its corresponding file (see metagenomes/samples.txt
). This table is used as the input for the R script, which will perform the population structure analyses.
Rscript structure.r samples.txt
This results in the generation of a matrix samples.txtFst_pos.txt
,containing Fst and Pi values for each SNV found in each of the samples. Then, we use this table as input to genome_wise_calculations.py
, which will estimate Fst and Pi genome-wide.
./genome_wise_calculations.py samples.txtFst_pos.txt reference_genome.fasta toy_example
This will output a dataframe with Pi values for each sample (toy_example_pi.out
), and a matrix with pair-wise Fst values (toy_example_fst.mat
).
Additionally, pN/pS per gene and for the entire population are estimated with the following command:
./pN_pS_calculations.py . reference_genome.fasta toy_pNpS good_snps.txt
A file with the SNVs that must be parsed is required. This file contains in each line a SNV ID, which contains the geneID and the position in the following format: Gene1*3
.
Here you will find a python script to calculate McDonald-Kreitman and Allele Frequency Spectrum statistics.
Here you find the R and python scripts to calculate Pst (Pangenome Index). The file Example_coverage.list
shows the format expected. Presence is calculated as the gene coverage while absence is calculated as the difference between the full expected coverage and the actual coverage of the gene. For more detailed information on how to run it check the section Population structure analysis.
Metagenome-Assembled Genomes of MOX and SOX bacteria across samples.
This folder contains the pangenomes (gene content) of MOX and SOX bacteria. These comprise the latest versio of the core genomes and accessory genes. The status of the genes are shown in the header.
The one confident contig identified as a phage in the entire dataset has been deposited here. This contig is present in sample E.