Skip to content

A Docker based workflow for performing a Plink/fastStructure analysis from Excel data.

License

Notifications You must be signed in to change notification settings

furious-luke/lizards-are-awesome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lizards-are-awesome

A Docker based workflow for performing a Plink/fastStructure analysis from on DArTseq SNP data, inferred from an Excel file.

Overview

This software seeks to reduce the manual labour involved in preparing DArTseq SNP data in 1 row format for analysis with Plink and fastStructure. LAA is designed specifically for SNP data sets generated by DArTseq, in 1 row format. As such, input data will be the following metadata provided by DArTseq: "0" = Reference allele homozygote, "1"= SNP allele homozygote, "2"= heterozygote, and "-" = double null/ null allele homozygote (absence of fragment with SNP in genomic representation). LAA first converts these data into ped and map files for plink analysis.

Most of the work, besides the mentioned external packages, is done with a Python script. The primary operations performed by the script are:

  1. Duplicating the input data.
  2. Performing a substitution on certain characters in both sets of data, in order to create Plink compatible characters (i.e. "-" to "0").
  3. Independently indexing both sets of data.
  4. Combining both sets of data.
  5. Sorting on the combined index.
  6. Transposing the combined data.
  7. Outputting to Plink compatible ped and map formats.

Whereas before these steps would have been carred out manually using various software packages, they are now performed automatically.

In addition to the conversion operation, there are additional functions to perform analysis runs of Plink and fastStructre, passing the data files between the two programs automatically.

In addition to the conversion operation, LAA automatically initiates the program Plink on the generated ped and map files, and the resulting bed, bim and fam files are then passed on to and analysed with fastStructure. The user can choose a maximum of K(number of populations) to be analysed by fastStructure. Output files include the meanQ value for each individual, defining the mean probability to belong to any one of the populations K1 to Kx.

Design Decisions

Why Docker?

Plink is written for Linux based operating systems. As such on a Linux system all operations could be performed directly, without the need for any kind of virtualisation layer. But, in order to support researchers using Windows based operating systems the decision was made to leverage Docker virtualisation.

Docker provides a light-weight virtualisation layer enabling Linux software to run on Windows with (relative) ease. It also has the added benefit of providing a cloud based mechanism for disseminating software "images" to users. The advantage of Docker over other systems, like VirtualBox or VMWare, are:

  • cloud based distribution of prebuilt images,
  • future releases will allow native Docker containers, and
  • easy to replicate virtual image creation.

Why Python?

Python is a powerful and expressive scripting language. It comes with many diverse packages, and has excellent support from developers (for example, fastStructure is written in Python).

Dependencies

When installing on any platform there are number of requisite dependencies:

  • Python
  • Docker

If you happen to be installing on Windows, then there are a couple of extra requirements:

  • Visual Studio Python compiler
  • MsysGit

Important

We've found that Docker has issues when running on Windows, resulting in faulty data transformation. While you may be able to install LAA on a Windows system, the accuracy of results are likely to be compromised.

To install on Windows, we recommend using a virtual machine running an Ubuntu installation, e.g. VMWare All steps detailed below under Installation will have to be performed through the Virtual Machine, including installing Docker.

Installation

Begin by installing all of the dependencies for your operating system as listed above.

Once complete, open a system terminal (please see the subsection on system terminals below, under usage).

From an open system terminal, install the LAA Python interface with:

pip install lizards-are-awesome

Next, from a system terminal, download and prepare the laa docker image. This image contains plink, fastStructure, and the conversion scripts, all built into a light-weight Alpine linux image:

laa init

Usage

Terminals

Usage is currently done directly from your operating system terminal. In Linux like operating systems (including Mac OS X) use the system terminal emulator. In Windows operating systems use the Docker quick start terminal.

Input Format

LAA accepts XLSX Excel formats and CSV. Unfortunately, XLSX is extremely slow to parse using opensource utilities. As such we recommend converting your Excel data to CSV before use with LAA (simply open and then save as csv file using Microsoft Office or opensource spreadsheet tools, like Libre Office).

The data sheet should contain only columns with DArTseq SNP data (i.e. 0, 1, 2 and -), all other columns have to be removed. The first row should contain the name of the population each individual belongs to (e.g. species), the second row should contain the ID of each individual. All following rows contain the SNP data.

A short, fictitious, example:

Pminima Pminima Pminor Pminima Pminor Pminima
lizard1 lizard2 lizard15 lizard39 lizard40 lizard44
0 1 1 2 1 1
0 0 0 1 0 0
1 - 1 0 1 1
0 0 1 0 - 0
2 2 1 1 1 2
2 2 1 2 1 0
1 1 2 1 2 1
1 1 1 2 0 1
0 0 0 0 0 0
- 1 2 1 1 1

And, in CSV format:

Pminima,Pminima,Pminor,Pminima,Pminor,Pminima
lizard1,lizard2,lizard15,lizard39,lizard40,lizard44
0,1,1,2,1,1
0,0,0,1,0,0
1,-,1,0,1,1
0,0,1,0,-,0
2,2,1,1,1,2
2,2,1,2,1,0
1,1,2,1,2,1
1,1,1,2,0,1
0,0,0,0,0,0
-,1,2,1,1,1

Location

All LAA commands must be run from the same directory you have your CSV input file in. For the purpose of the examples, let's say we have an input file, input.csv, located at /c/workspace/data:

cd /c/workspace/data

Quick-run

To perform the complete process, including conversion, Plink, fastStructre and analysing for K values, you can just run:

laa all input.csv --maxk=5

where --maxk=5 may be replaced with a suitable value for the maximum K value to use.

This will produce a range of files in the current working directory corresponding to the outputs of the conversion, Plink, and fastStructre.

Conversion

Converting the input data will peform recombination, transposition, output to a PED file, and also generation of a suitable mapping file:

laa convert input.csv output.ped

This will generate two files: output.ped, and output.map. These files are suitable for use with Plink.

Plink

To process the converted input files with Plink, run:

laa plink output.ped

fastStructure

To process the Plink outputs with fastStructure, run:

laa fast output

K Choice

To run fastStructure a number of times, and then choose an appropriate K value, run:

laa choosek output --maxk=5

where --maxk=5 may be replaced with a suitable value for the maximum K value to use.

Getting Help

Help is always available from the command-line. To get a printout of available commands, run:

laa -h

You may also get help for a specific command with something like:

laa convert -h

where convert may be replaced with the respective command help is sought for.

About

A Docker based workflow for performing a Plink/fastStructure analysis from Excel data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages