Skip to content

ancestry

Brent Pedersen edited this page Jan 5, 2022 · 1 revision

Ancestry Estimate

note: this feature is working, but still experimental. it may change in future versions.

somalier can predict ancestry on a set of query samples given a set of labelled samples, for example from thousand genomes along with labels for. This would look like:

somalier ancestry --labels ancestry-labels-1kg.tsv 1kg-somalier/*.somalier ++ query-samples-somalier/*.somalier

Where the ++ separates the labeled samples from the query samples. This command will create an html output along with a text file of the predictions.

ancestry-labels-1kg.tsv is here

and the somalier files for thousand genomes can be downloaded from here These were created from the thousand genomes high coverage data from here

Note that these will work for either GRCh37 or hg38 as long as you use the most recent sites files distributed with somalier.

Example output is here

Usage

Usage:
  somalier pca [options] [extracted ...]

Arguments:
  [extracted ...]  $sample.somalier files for each sample. place labelled samples first followed by '++' then *.somalier for query samples

Options:
  --labels=LABELS            file with ancestry labels
  -o, --output-prefix=OUTPUT_PREFIX
                             prefix for output files (default: somalier-ancestry)
  --n-pcs=N_PCS              number of principal components to use in the reduced dataset (default: 5)
  --nn-hidden-size=NN_HIDDEN_SIZE
                             shape of hidden layer in neural network (default: 16)
  --nn-batch-size=NN_BATCH_SIZE
                             batch size fo training neural network (default: 32)
  --nn-test-samples=NN_TEST_SAMPLES
                             number of labeled samples to test for NN convergence (default: 101)
  -h, --help                 Show this help
Clone this wiki locally