Code to support the design and testing an amplicon panel for Pacific oyster (Crassostrea gigas), designed for the use of the authors and associated manuscript, and comes with no guarantees for usefulness for any other purposes.
Requires Linux or Mac OS, and all shell scripts are run from the main directory.
Skip to panel design section
Skip to chromosome positions section
Skip to panel testing section
- bedtools
- R
- Eric Normandeau's scripts
- Prefiltered single SNP data in plink format from Sutherland et al. 2020 (Evol. Appl.) (here) #TODO
- Prefiltered single SNP data populations output in VCF format (here) #TODO
- Reference genome for Pacific oyster used in identifying markers (Zhang et al. 2012; Sutherland et al. 2020) download GenBank version here
Interactively run 01_scripts/01_identifying_markers.R
to
- Import data and select only the British Columbia (BC) naturalized samples
- Perform minor allele frequency (MAF) filter to remove variants under MAF in BC populations
- Calculate per locus stats observed heterozygosity (Hobs) and FST (averaged per locus)
- Generate comparative per locus stat plots
Outputs
03_marker_selection/all_markers_per_locus_stats_incl_hobs_MAF.csv
e.g.,
mname,Fit,Fst,Fis,maf.vec,Hobs
100026_37_C,-0.011,0.00777,-0.0192,0.0173,0.0347
Custom markers will be included in the panel, specifically private allele variants specific to cultured BC populations. This will use code and resources adapted from https://github.com/bensutherland/ms_oyster_popgen
, specifically 00_archive/my_cols.csv
and 01_scripts/private_alleles.r
.
Interactively run 01_scripts/private_alleles.r
to identify marker names (mnames) of high frequency private alleles from DPB and GUR populations.
Outputs:
03_marker_selection/per_repunit_private_allele_tally_all_data.csv
03_marker_selection/DPB_selected_PA_mnames.csv
03_marker_selection/GUR_selected_PA_mnames.csv
Interactively run 01_scripts/01b_collect_mnames.sh
and set the number of markers that you want to retain based on high HOBS and high FST characteristics.
Outputs
04_extract_loci/selected_mnames.csv
e.g.,
504898_23_C, top_FST
375205_31_T, top_FST
Interactively run 01_scripts/02_info_from_vcf.sh
to obtain specific information about the markers from the larger VCF
Outputs
04_extract_loci/vcf_selection.csv
This is a text file with fields (1) chr; (2) pos of SNP in ref genome; (3) info about marker; (4) ref allele; (5) alt allele, e.g.,
JH816256.1,99460,100388:26:-,G,A
Interactively run 01_scripts/03_prepare_bed_file.sh
to prepare a bed file based on the selected markers
Outputs
04_extract_loci/vcf_selection.bed
e.g.,
#TODO note: this will include a fourth column that can be used for matching the output fasta back to the VCF
Run 01_scripts/04_extract_from_reference.sh
to extract selected marker flanking sequence from the genome
Outputs
04_extract_loci/vcf_selection.fa
04_extract_loci/selected_chr_and_seq.txt
(tab delim version)
Interactively run 01_scripts/05_make_submission_form.R
to connect sequence data to selected marker information from the VCF.
Outputs
05_submission_form/seq_and_minfo_all_data.csv
(full information for data checking)05_submission_form/seq_and_minfo_for_submission.csv
(submission info only) The submission csv has fields (1) marker name; (2) chr; (3) ref allele (based on genome); (4) alt allele; (5) strand; (6) marker type; (7) priority level; (8) formatted seq (e.g., ATGC[A/G]ATGC). More details are available in the script.
#TODO# fix: this section Note that the reference allele is given that designation based on the reference genome nucleotide and doesn't always indicate in that manner in the VCF. A switch will occur in the Rscript to switch these identities, if the VCF alt allele is the allele in the reference genome. #END TODO# fix: this section
Interactively run 01_scripts/confirm_FST.R
to confirm that the selected markers for high FST are indeed leading to higher population-level FST estimates.
Inputs
03_marker_selection/adegenet_output.RData
(generated above)04_extract_loci/selected_mnames.csv
Also see information here for additional information on the bed and VCF formats for extraction.
simple_pop_stats repo).
Currently implemented in an interactive Rscript 100-bowtie2-amplicon-mapping-roslin-genome.Rmd
here
- bowtie2
- samtools
- R
- Marker information file
thermo_submitted_sutherland_cgig_v.0.1_2022-02-08_marker_data_only_2023-02-21.xlsx
available as Additional File X here - Reference genome (Roslin Institute) available here
Outputs
- Alignment sam file
- Plot of distribution of markers across the reference genome chromosomes including singly or multiply aligning markers
Create a folder that contains both amplitools
and simple_pop_stats
, as both of these repositories will be used to analyze the data.
Download the following from FigShare
- VariantCaller output files
R_2022_08_04_S5XL.xls
andR_2022_10_07_S5XL.xls
- prepared individual to population map
my_data_ind-to-pop_annot.txt
Put the above VariantCaller files in amplitools/02_input_data
and the population map in simple_pop_stats/02_input_data
Open amplitools/01_scripts/00_initiator.R
in Rstudio and source the script. This will load amplitools.
Open amplitools/01_scripts/demo_analysis.R
for a demonstration analysis (also shown in brief below).
From within R, convert proton results to genepop results using the following amplitools script:
proton_to_genepop(neg_control="BLANK")
# neg_control is a string found in only negative control samples to filter them out
This will output, per input file, a .txt file in 02_input_data/prepped_matrices/
.
From amplitools main directory in the terminal, finalize the genepop by using the following script on each prepared matrix:
./01_scripts/format_genepop.sh 02_input_data/prepped_matrices/<filename>
This will output, per input file, a .gen file in 02_input_data/prepped_genepops/
.
Copy the output genepops into simple_pop_stats/02_input_data
.
Open simple_pop_stats/01_scripts/simple_pop_stats_start.R
in RStudio, clear space, then source the script.
Also source the development script simple_pop_stats/01_scripts/dev/comp_tech_reps.R
Run the following:
comp_tech_reps(format_type = "amplitools", max_missing = 0.5)
Where max_missing
indicates the maximum missing data to retain a sample for the technical replicate comparison.
Save out the produced genind object, which contains only the best of the technical replicate samples:
save(obj_nr_best, file = "02_input_data/obj_nr_best_<date>.RData")
Note: if you need to restart at any future time, you can always reload this file by:
load("02_input_data/obj_nr_best_<date>.RData")
This will output an RData file with only unique samples as 02_input_data/obj_nr_best_<date>.RData
.
Open simple_pop_stats/01_scripts/simple_pop_stats_start.R
in RStudio, clear space, then source the script.
Open ms_oyster_panel/01_scripts/sps_popgen_analysis.R
, and run interactively in RStudio.
note: if you are not running tech replicates, can use load_genepop(datatype="SNP")
to select your genepop
Follow the instructions of the script to:
- Prepare data including adding population attribute to the genepop
- Characterize missing data and filter as needed
- Remove monomorphic loci
- Generate per-locus statistics
- Run multivariate and differentiation analyses
- Identify private alleles at the regional or population level
- Convert the genepop to a rubias file for downstream use in parentage (see below)
Outputs
- various
simple_pop_stats
output in03_results
- a rubias file will be output to
03_results/rubias_output_SNP.txt
Copy the output rubias prepared file to amplitools, into the 03_prepped_data
folder, suggested filename:
cgig_no_monomorphs_no_multimapper.txt
Requires that the rubias file was produced in the preceding step. This will contain three generations of samples, VIU_F0
, VIU_F1
, VIU_F2
for parentage analysis.
Clear the workspace, then source amplitools/00_initiator.R
Estimate log likelihoods from the data and simulated siblings and parents, then calculate on your existing data:
Run the following two scripts to analyze each set of contrasts:
# For F1 vs. F2 (parents vs. offspring)
ckmr_from_rubias(input.FN = "03_prepped_data/cgig_all_rubias.txt", parent_pop = "VIU_F1", offspring_pop = "VIU_F2", cutoff = 5)
# For F0 vs. F1 (grandparents vs. parents)
ckmr_from_rubias(input.FN = "03_prepped_data/cgig_all_rubias.txt", parent_pop = "VIU_F0", offspring_pop = "VIU_F1", cutoff = 5)
Outputs
- Figure of logl from simulated relationships
- Parent-offspring relationships above the cutoff
- Parent and offspring full-sib results above the cutoff
- Retained individuals for both parent and offspring populations
Copy the amplitools output file po_VIU_F1_vs_VIU_F2_pw_logl_5_report.txt
into simple_pop_stats/03_results/
, as this will be used to determine the empirically-determined likely trios based on the following criteria:
- log likelihood > 10 in both parents
- No additional assignments
This will be used to evaluate any potential poorly performing loci, or likely null alleles, but this will be done in
simple_pop_stats
.
An independent dataset was generated to evaluate the function of the panel. To analyze, follow the above steps to create a folder with the present repository, amplitools, and simple_pop_stats, all at the same level.
The analysis will require that the following files are retained and stored as described:
R_2023_07_26_12_44_23_user_GSS5PR-0268-78-Ampseq_Oyster_20230725.xls
inamplitools/02_input_data
my_data_ind-to-pop_annot.txt
insimple_pop_stats/02_input_data/
Follow instructions and analysis code in the following script:
ms_oyster_panel_OCP23_v.0.3/01_scripts/OCP23_analysis_2023-08-28.R