A user-friendly heuristic for inferring presence and mechanism of facultative parthenogenesis from genetic and genomic data sets.
Author: Dr. Brenna A. Levine - levine.brenna.a@gmail.com
How to cite: Levine BA, Booth W. 2024. ParthenoGenius: A user-friendly heuristic for inferring presence and mechanism of facultative parthenogenesis from genetic and genomic data sets. https://github.com/brenna-levine/LEVINE_ParthenoGenius/tree/main
ParthenoGenius is writted in python3 and requires the following modules: argparse (https://docs.python.org/3/library/argparse.html), pandas (https://pandas.pydata.org), and datetime (https://docs.python.org/3/library/datetime.html).
The help menu can be accessed using the "-h" argument.
python3 ParthenoGenius.py -h
The mandatory arguments to run ParthenoGenius are an input file formatted as a .csv (e.g., mydata.csv) and a user-defined prefix for naming the output files (e.g., "mydata-out").
python3 ParthenoGenius.py mydata.csv mydata-out
The user may provide an optional estimated per-base error rate. The default value for this parameter is 0.001.
python3 ParthenoGenius.py mydata.csv mydata-out --error 0.01
The input file is a .csv file containing diploid co-dominant locus names and the genotypes for a mother and offspring. Missing data can be coded as -9. The file format is similar to that of a Structure file, with two lines for each individual. Alleles can be written as letters or numbers, with no restrictions on the length of the allele (i.e., SNP allele vs microsatellite length). For example:
,SNP1,SNP2,SNP3,SNP4,SNP5
MOM_1,1,1,1,-9,0
MOM_2,1,0,1,1,0
OFF_1,1,0,0,0,0
OFF_2,1,0,1,0,-9
Five test data sets are provided in the repository in the TEST-DATA-SETS
directory to further illustrate the necessary input file format. The input file can be easily generated generated by modifying a Structure file, a common file format generated by bioinformatics software. The input file should only contain data for the mother and one offspring. If numerous offspring are to be considered, the file preparation script PG_file_prep.sh
should be executed to generate individual files, and the wrapper script PG_wrapper.sh
should be used to iterate through the files.
If you have a large genetic parentage data set containing many parents and offspring, you can use the file preparation script (PG_file_prep.sh
) to generate the requisite input files for ParthenoGenius. To do so, you will need two files named ID_pairs.txt
and genotypes.csv
.
The file ID_pairs.txt
should contain the headers OFFSPRING_ID
AND MOM_ID
, followed by the ID's of respective offspring and mothers. Data should be tab-delimited. Examples of these files are in the TUTORIAL
directory. For example:
OFFSPRING_ID MOM_ID
NVP-003 NVP-004
NVP-006 NVP-004
NVP-009 NVP-004
The genotypes.csv
file should be a file formatted similarly to the Structure-like input file aboe, but may contain many parents, offspring, and unrelated individuals. Again, if you have a population-wide Structure file, you can easily modify it to generate the input file for this script. For example:
,SNP_1,SNP_2,SNP_3,SNP_4, SNP_5
NVP-003,1,1,0,0,1
NVP-003,1,1,1,1,1
NVP-004,1,1,0,0,0
NVP-004,1,1,1,1,0
NVP-006,0,1,1,1,1
NVP-006,1,1,1,0,1
NVP-009,0,1,-9,0,0
NVP-009,0,0,0,1,0
NVP-010,0,-9,1,1,0
NVP-010,1,-9,1,1,0
Note - IDs in the ID_pairs.txt
and genotypes.csv
must be the same. These files must be in the same directory as the PG_file_prep.sh
script.
PG_file_prep.sh
can be run from the command-line using the following command:
./PG_file_prep.sh
PG_file_prep.sh
will generate ParthenoGenius input files for each mother/offspring pair. One can then run all generated input files through ParthenoGenius using PG_wrapper.sh
.
To do so, modify the wrapper script to reflect name of the directory containing the input files and, optionally, the user-defined error rate. The only contents of the directory containing the input files should be the input files. Then, execute the wrapper script from the command line as follows:
./PG_wrapper.sh
Output files will be automatically named according to the input file name and can be found in the directory containing the input files.
The following tutorial demonstrates how to use PG_file_prep.sh
and PG_wrapper.sh
to prepare input files from a larger .csv file and iteratively run ParthenoGenius.py
on all of the generated input files. For the tutorial, we are using the same data as is present in the NEPALESE-VIPER-PARTH-TEST
directory. This directory contains two emtpy directories (INFILES
AND OUTFILES
), the PG_file_prep.sh
script, a larger .csv containing genotypes of multiple offspring and their mothers (genotypes.csv
) and a file that specifies mother/offspring pairs (ID_pairs.txt
).
First, navigate to the tutorial directory from the ParthenoGenius home directory.
cd TUTORIAL
Next, use less
to view the ID_pairs.txt
and genotypes.csv
file. Notice that the ID names in the ID_pairs.txt
file match those in the genotypes.csv
file. Now, execute the local copy of PG_file_prep.sh
.
./PG_file_prep.sh
View the contents of the directory with ls
. You should now see three infiles corresponsing to the three sets of mother/offspring pairs in the ID_pairs.txt
file. Notice that the files names correspond to the IDs of the offspring in each mother/offspring pair.
Now, move all of the newly generated infiles to the INFILES/
directory.
mv NVP* INFILES/
Then, navigate back to main ParthenoGenius directory.
../
I have already modified the PG_wrapper.sh
script to reflect the location of the infiles and created a copy of the wrapper script specific to this tutorial (PG_wrapper-TUTORIAL.sh
). View this version of the wrapper script with less
and notice that the infile directory location is ./TUTORIAL/INFILES
. I have also modified this wrapper script to include the optional user-defined error rate. Notice that --error 0.01
is specified in the ParthenoGenius command. Now, execute the PG_wrapper-TUTORIAL.sh
script.
./PG_wrapper-TUTORIAL.sh
Upon completion of the wrapper script, navigate to the tutorial infile directory.
cd TUTORIAL/INFILES
Use ls
to view the contents of the directory. Notice that there are now three sets of outfiles - one set for each mother/offspring pair. You can now move the outfiles to the OUTFILES
directory to maintain organization.
mv *part* ../OUTFILES/