Skip to content
Mike Lape edited this page Jun 5, 2019 · 1 revision

smartpca

smartpca runs Principal Components Analysis on input genotype data and outputs principal components (eigenvectors) and eigenvalues. The method assumes that samples are unrelated. (However, a small number of cryptically related individuals is usually not a problem in practice as they will typically be discarded as outliers.)

Input

5 different input formats are supported.
See convertf for documentation on using the convertf program to convert between formats.

The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate how parfile works via a toy example (see example.perl in this directory). This example takes input in EIGENSTRAT format. The syntax of how to take input in other formats is analogous to the convertf program, see ../CONVERTF/README.

The smartpca program prints various statistics to standard output. To redirect this information to a file, change the above syntax to "../bin/smartpca -p parfile >logfile". For a description of these statistics, see the documentation file smartpca.info in this directory.

Running time

Estimated running time of the smartpca program is
2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
Thus, under the default of up to 5 outlier removal iterations, running time is up to 1.5e-11 * nSNP * NSAMPLES^2 hours.

Note on QTL data

Some users have reported problems running smartpca on data sets involving QTL data sets stored in a .ped file. The reason for this is that in .ped files the QTL value is stored in the population field, but our code does not allow more than 100 different population values. A solution is to convert to EIGENSTRAT format using the convertf program (see ../CONVERTF/) and then modify the last column of the individual file to contain the same population name (any string will do) for every row. This should be easy to do using a PERL script (this functionality is not currently implemented in our code).

Parameter Descriptions

Required Parameters

genotypename: input genotype file (see convertf)
snpname: input snp file (see convertf)
indivname: input indiv file (see convertf)
evecoutname: output file of eigenvectors. See numoutevec parameter below.
evaloutname: output file of all eigenvalues

OPTIONAL PARAMETERS:

numoutevec: number of eigenvectors to output. Default is 10.
numoutlieriter: maximum number of outlier removal iterations.
Default is 5. To turn off outlier removal, set this parameter to 0.
numoutlierevec: number of principal components along which to remove outliers during each outlier removal iteration. Default is 10.
outliersigmathresh: number of standard deviations which an individual must exceed, along one of the top (numoutlierevec) principal components, in order for that individual to be removed as an outlier. Default is 6.0.
outlieroutname: output logfile of outlier individuals removed. If not specified, smartpca will print this information to stdout, which is the default.
usenorm: Whether to normalize each SNP by a quantity related to allele freq. Default is YES. (When analyzing microsatellite data, should be set to NO. See Patterson et al. 2006.)
altnormstyle: Affects very subtle details in normalization formula.
Default is YES (normalization formulas of Patterson et al. 2006)
To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO. missingmode: If set to YES, then instead of doing PCA on # reference alleles, do PCA on whether each data point is missing or nonmissing. Default is NO. May be useful for detecting informative missingness (Clayton et al. 2005). nsnpldregress: If set to a positive integer, then LD correction is turned on, and input to PCA will be the residual of a regression involving that many previous SNPs, according to physical location. See Patterson et al. 2006. Default is 0 (no LD correction). If desiring LD correction, we recommend 2. maxdistldregress: If doing LD correction, this is the maximum genetic distance (in Morgans) for previous SNPs used in LD correction. Default is no maximum. poplistname: If wishing to infer eigenvectors using only individuals from a subset of populations, and then project individuals from all populations onto those eigenvectors, this input file contains a list of population names, one population name per line, which will be used to infer eigenvectors.
It is assumed that the population of each individual is specified in the indiv file. Default is to use individuals from all populations. phylipoutname: output file containing an fst matrix which can be used as input to programs in the PHYLIP package, such as the "fitch" program for constructing phylogenetic trees. noxdata: if set to YES, all SNPs on X chr are excluded from the data set. The smartpca default for this parameter is YES, since different variances for males vs. females on X chr may confound PCA analysis. nomalexhet: if set to YES, any het genotypes on X chr for males are changed to missing data. The smartpca default for this parameter is YES. badsnpname: specifies a list of SNPs which should be excluded from the data set. Same format as example.snp. Cannot be used if input is in PACKEDPED or PACKEDANCESTRYMAP format. popsizelimit: If set to a positive integer, the result is that a randomly selected subset of size popsizelimit individuals from each population will be included in the analysis. It is assumed that the population of each individual is specified in the indiv file. Default is to use all individuals in the analysis. snpweightoutname: output file containing SNP weightings of each principal component. Note that this output file does not contain entries for monomorphic SNPs from the input .snp file. chrom: Only use SNPs on this chromosome. lopos: Only use SNPs with physical position >= this value. hipos: Only use SNPs with physical position <= this value. blgsize: Size (in Morgans) of blocks used in FST stderr jackknife computation. The default value for this parameter is 0.05. qtmode: If set to YES, assume that there is a single population and that the population field contains information on real-valued phenotypes. The default value for this parameter is NO. fstonly: If set to YES, then skip PCA and just calculate FST values.
The default value for this parameter is NO. killr2: If set to YES, then eliminate SNPs in LD with nearby SNPs.
The default value for this parameter is NO. r2thresh: If killr2 is set to YES, then pairs of SNPs that have r-squared greater than this value will have one member removed.
The default value for this parameter is -1.0. r2genlim: If killr2 is set to YES, then pairs of SNPs will only be considered for elimination based on r-squared if within a genetic distance equal to this number of Morgans. The default value for this parameter is 0.01. r2physlim: If killr2 is set to YES, then pairs of SNPs will only be considered for elimination based on r-squared if within a physical distance equal to this number of bases. The default value for this parameter is 5000000. hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP format, check the hash stored inside the file to make sure that individual and SNP files have not changed since the file was made. If they have, then exit in error. The default value for this parameter is YES. (Use with smartpca is deprecated. Much better to run convertf and create clean files with correct hash.)

The next 5 optional parameters allow the user to output genotype, snp and indiv files which will be identical to the input files except that: Any individuals set to Ignore in the input indiv file will be removed from the data set (see ../CONVERTF/README) Any data excluded or set to missing based on noxdata, nomalexhet and badsnpname parameters (see above) will be removed from the data set. The user may decide to output these files in any format. outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP genotypeoutname: output genotype file snpoutname: output snp file indivoutname: output indiv file outputgroup: see documentation in ../CONVERTF/README