Pacific oyster panel design and testing

Code to support the design and testing an amplicon panel for Pacific oyster (Crassostrea gigas), designed for the use of the authors and associated manuscript, and comes with no guarantees for usefulness for any other purposes.

Requires Linux or Mac OS, and all shell scripts are run from the main directory.

Skip to panel design section
Skip to chromosome positions section
Skip to panel testing section

Panel design

Requirements:

bedtools
R
Eric Normandeau's scripts

Data inputs:

Prefiltered single SNP data in plink format from Sutherland et al. 2020 (Evol. Appl.) (here) #TODO
Prefiltered single SNP data populations output in VCF format (here) #TODO
Reference genome for Pacific oyster used in identifying markers (Zhang et al. 2012; Sutherland et al. 2020) download GenBank version here

01. Filter and characterize variants

Interactively run 01_scripts/01_identifying_markers.R to

Import data and select only the British Columbia (BC) naturalized samples
Perform minor allele frequency (MAF) filter to remove variants under MAF in BC populations
Calculate per locus stats observed heterozygosity (Hobs) and FST (averaged per locus)
Generate comparative per locus stat plots

Outputs

03_marker_selection/all_markers_per_locus_stats_incl_hobs_MAF.csv
e.g.,

mname,Fit,Fst,Fis,maf.vec,Hobs
100026_37_C,-0.011,0.00777,-0.0192,0.0173,0.0347

02. Select custom markers (private alleles)

Custom markers will be included in the panel, specifically private allele variants specific to cultured BC populations. This will use code and resources adapted from https://github.com/bensutherland/ms_oyster_popgen, specifically 00_archive/my_cols.csv and 01_scripts/private_alleles.r.

Interactively run 01_scripts/private_alleles.r to identify marker names (mnames) of high frequency private alleles from DPB and GUR populations.

Outputs:

03_marker_selection/per_repunit_private_allele_tally_all_data.csv
03_marker_selection/DPB_selected_PA_mnames.csv
03_marker_selection/GUR_selected_PA_mnames.csv

03. Select markers (high FST or high HOBS)

Interactively run 01_scripts/01b_collect_mnames.sh and set the number of markers that you want to retain based on high HOBS and high FST characteristics.

Outputs

04_extract_loci/selected_mnames.csv
e.g.,

504898_23_C, top_FST
375205_31_T, top_FST

04. Obtain info from VCF on selected markers

Interactively run 01_scripts/02_info_from_vcf.sh to obtain specific information about the markers from the larger VCF

Outputs

04_extract_loci/vcf_selection.csv
This is a text file with fields (1) chr; (2) pos of SNP in ref genome; (3) info about marker; (4) ref allele; (5) alt allele, e.g.,

JH816256.1,99460,100388:26:-,G,A

05. Prepare BED file for extracting selected marker sequence

Interactively run 01_scripts/03_prepare_bed_file.sh to prepare a bed file based on the selected markers

Outputs

04_extract_loci/vcf_selection.bed e.g.,

#TODO note: this will include a fourth column that can be used for matching the output fasta back to the VCF

06. Extract FASTA from reference genome

Run 01_scripts/04_extract_from_reference.sh to extract selected marker flanking sequence from the genome

Outputs

04_extract_loci/vcf_selection.fa
04_extract_loci/selected_chr_and_seq.txt (tab delim version)

07. Create design submission file

Interactively run 01_scripts/05_make_submission_form.R to connect sequence data to selected marker information from the VCF.

Outputs

05_submission_form/seq_and_minfo_all_data.csv (full information for data checking)
05_submission_form/seq_and_minfo_for_submission.csv (submission info only) The submission csv has fields (1) marker name; (2) chr; (3) ref allele (based on genome); (4) alt allele; (5) strand; (6) marker type; (7) priority level; (8) formatted seq (e.g., ATGC[A/G]ATGC). More details are available in the script.

#TODO# fix: this section Note that the reference allele is given that designation based on the reference genome nucleotide and doesn't always indicate in that manner in the VCF. A switch will occur in the Rscript to switch these identities, if the VCF alt allele is the allele in the reference genome. #END TODO# fix: this section

08. Additional data checking (optional)

Interactively run 01_scripts/confirm_FST.R to confirm that the selected markers for high FST are indeed leading to higher population-level FST estimates.

Inputs

03_marker_selection/adegenet_output.RData (generated above)
04_extract_loci/selected_mnames.csv

Also see information here for additional information on the bed and VCF formats for extraction.

simple_pop_stats repo).

Chromosome positions

Currently implemented in an interactive Rscript 100-bowtie2-amplicon-mapping-roslin-genome.Rmd here

Requirements

bowtie2
samtools
R

Data inputs

Marker information file thermo_submitted_sutherland_cgig_v.0.1_2022-02-08_marker_data_only_2023-02-21.xlsx available as Additional File X here
Reference genome (Roslin Institute) available here

Outputs

Alignment sam file
Plot of distribution of markers across the reference genome chromosomes including singly or multiply aligning markers

Panel testing

Requirements:

Create a folder that contains both amplitools and simple_pop_stats, as both of these repositories will be used to analyze the data.

Data inputs:

Download the following from FigShare

VariantCaller output files R_2022_08_04_S5XL.xls and R_2022_10_07_S5XL.xls
prepared individual to population map my_data_ind-to-pop_annot.txt

Put the above VariantCaller files in amplitools/02_input_data and the population map in simple_pop_stats/02_input_data

01. Use amplitools to prepare input data

Open amplitools/01_scripts/00_initiator.R in Rstudio and source the script. This will load amplitools.

Open amplitools/01_scripts/demo_analysis.R for a demonstration analysis (also shown in brief below).

From within R, convert proton results to genepop results using the following amplitools script:

proton_to_genepop(neg_control="BLANK")         
# neg_control is a string found in only negative control samples to filter them out

This will output, per input file, a .txt file in 02_input_data/prepped_matrices/.

From amplitools main directory in the terminal, finalize the genepop by using the following script on each prepared matrix:
./01_scripts/format_genepop.sh 02_input_data/prepped_matrices/<filename>

This will output, per input file, a .gen file in 02_input_data/prepped_genepops/.

Copy the output genepops into simple_pop_stats/02_input_data.

02. Compare technical replicate samples and retain the best replicates

Open simple_pop_stats/01_scripts/simple_pop_stats_start.R in RStudio, clear space, then source the script.
Also source the development script simple_pop_stats/01_scripts/dev/comp_tech_reps.R

Run the following:
comp_tech_reps(format_type = "amplitools", max_missing = 0.5)
Where max_missing indicates the maximum missing data to retain a sample for the technical replicate comparison.

Save out the produced genind object, which contains only the best of the technical replicate samples:
save(obj_nr_best, file = "02_input_data/obj_nr_best_<date>.RData")

Note: if you need to restart at any future time, you can always reload this file by:
load("02_input_data/obj_nr_best_<date>.RData")

This will output an RData file with only unique samples as 02_input_data/obj_nr_best_<date>.RData.

03. General population genetic characterization of pilot study

Open simple_pop_stats/01_scripts/simple_pop_stats_start.R in RStudio, clear space, then source the script.
Open ms_oyster_panel/01_scripts/sps_popgen_analysis.R, and run interactively in RStudio.

note: if you are not running tech replicates, can use load_genepop(datatype="SNP") to select your genepop

Follow the instructions of the script to:

Prepare data including adding population attribute to the genepop
Characterize missing data and filter as needed
Remove monomorphic loci
Generate per-locus statistics
Run multivariate and differentiation analyses
Identify private alleles at the regional or population level
Convert the genepop to a rubias file for downstream use in parentage (see below)

Outputs

various simple_pop_stats output in 03_results
a rubias file will be output to 03_results/rubias_output_SNP.txt

Copy the output rubias prepared file to amplitools, into the 03_prepped_data folder, suggested filename:
cgig_no_monomorphs_no_multimapper.txt

04. Parentage analysis

Requires that the rubias file was produced in the preceding step. This will contain three generations of samples, VIU_F0, VIU_F1, VIU_F2 for parentage analysis.
Clear the workspace, then source amplitools/00_initiator.R

Estimate log likelihoods from the data and simulated siblings and parents, then calculate on your existing data:
Run the following two scripts to analyze each set of contrasts:

# For F1 vs. F2 (parents vs. offspring)
ckmr_from_rubias(input.FN = "03_prepped_data/cgig_all_rubias.txt", parent_pop = "VIU_F1", offspring_pop = "VIU_F2", cutoff = 5)

# For F0 vs. F1 (grandparents vs. parents)
ckmr_from_rubias(input.FN = "03_prepped_data/cgig_all_rubias.txt", parent_pop = "VIU_F0", offspring_pop = "VIU_F1", cutoff = 5)

Outputs

Figure of logl from simulated relationships
Parent-offspring relationships above the cutoff
Parent and offspring full-sib results above the cutoff
Retained individuals for both parent and offspring populations

Copy the amplitools output file po_VIU_F1_vs_VIU_F2_pw_logl_5_report.txt into simple_pop_stats/03_results/, as this will be used to determine the empirically-determined likely trios based on the following criteria:

log likelihood > 10 in both parents
No additional assignments This will be used to evaluate any potential poorly performing loci, or likely null alleles, but this will be done in simple_pop_stats.

05. Panel testing: independent dataset

An independent dataset was generated to evaluate the function of the panel. To analyze, follow the above steps to create a folder with the present repository, amplitools, and simple_pop_stats, all at the same level.
The analysis will require that the following files are retained and stored as described:

R_2023_07_26_12_44_23_user_GSS5PR-0268-78-Ampseq_Oyster_20230725.xls in amplitools/02_input_data
my_data_ind-to-pop_annot.txt in simple_pop_stats/02_input_data/

Follow instructions and analysis code in the following script:
ms_oyster_panel_OCP23_v.0.3/01_scripts/OCP23_analysis_2023-08-28.R

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
00_archive		00_archive
01_scripts		01_scripts
02_input_data		02_input_data
03_marker_selection		03_marker_selection
04_extract_loci		04_extract_loci
05_submission_form		05_submission_form
100_amplicon_mapping		100_amplicon_mapping
.gitignore		.gitignore
README.md		README.md
VCF_format_info.md		VCF_format_info.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pacific oyster panel design and testing

Panel design

Requirements:

Data inputs:

01. Filter and characterize variants

02. Select custom markers (private alleles)

03. Select markers (high FST or high HOBS)

04. Obtain info from VCF on selected markers

05. Prepare BED file for extracting selected marker sequence

06. Extract FASTA from reference genome

07. Create design submission file

08. Additional data checking (optional)

Also see information here for additional information on the bed and VCF formats for extraction.

Chromosome positions

Requirements

Data inputs

Panel testing

Requirements:

Data inputs:

01. Use amplitools to prepare input data

02. Compare technical replicate samples and retain the best replicates

03. General population genetic characterization of pilot study

04. Parentage analysis

05. Panel testing: independent dataset

About

Releases

Packages

Contributors 2

Languages

bensutherland/ms_oyster_panel

Folders and files

Latest commit

History

Repository files navigation

Pacific oyster panel design and testing

Panel design

Requirements:

Data inputs:

01. Filter and characterize variants

02. Select custom markers (private alleles)

03. Select markers (high FST or high HOBS)

04. Obtain info from VCF on selected markers

05. Prepare BED file for extracting selected marker sequence

06. Extract FASTA from reference genome

07. Create design submission file

08. Additional data checking (optional)

Also see information here for additional information on the bed and VCF formats for extraction.

Chromosome positions

Requirements

Data inputs

Panel testing

Requirements:

Data inputs:

01. Use amplitools to prepare input data

02. Compare technical replicate samples and retain the best replicates

03. General population genetic characterization of pilot study

04. Parentage analysis

05. Panel testing: independent dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages