- Add compatibility with Python 3.8. By
Alistair Miles <alimanfoo>
,329
.
Important
This release drop support for Python 2.7.
- Removed support for Python 2. By
Alistair Miles <alimanfoo>
,251
. - Upgraded pinned dependencies used in the CI and developer environments, and removed support for Python 3.5 in the CI matrix. By
Alistair Miles <alimanfoo>
,270
. - Fixed a bug related to inaccessible sites being incorporated in the numerator of diversity functions in
allel.stats.diversity
. ByMurillo Rodrigues <mufernando>
,Peter Ralph <petrelharp>
andAlistair Miles <alimanfoo>
,276
,294
- Minor documentation fixes. By
Murillo Rodrigues <mufernando>
,273
,294
. - Improve handling of None when defining accessibility in roh_mhmm and roh_poissonhmm. By
Ólavur Mortensen <olavurmortensen>
andNick Harding <hardingnj>
,296
.
- Fixed a bug in
allel.GenotypeDaskArray.to_allele_counts
where the shape of the output array was not being determined correctly. ByNick Harding <hardingnj>
,266
.
Important
Use of the allel.stats namespace is deprecated in this release, all functions from stats modules are available from the root allel namespace, please access them from there.
Important
Python 2.7 has had a stay of execution - this release supports Python 2.7 and 3.5-3.7. However, support for Python 2.7 will definitely be removed in version 1.3.
- Added a new function
allel.pbs
which implements the Population Branching Statistic (PBS), a test for selection based on allel frequency differentiation between three populations.210
. - Added a new function
allel.roh_poissonhmm
for inferring runs of homozygosity, which uses a Poisson HMM and is orders of magnitude faster than the previously vailableallel.roh_mhmm
multinomial HMM implementation. ByNick Harding <hardingnj>
,188
,187
. - Added a workaround to ensure arrays passed into Cython functions are safe to use as memoryviews, which is required to avoid errors when using distributed computing systems like dask.distributed where data may be moved between compute nodes and passed with a read-only flag set.
208
,206
. - Added support for parsing VCF files where the chromosomes are not in lexical sorted order. Also improved handling of cases where no variants are returned.
221
,167
,213
. - Added a new index class
allel.ChromPosIndex
for locating data given chromosome and positio locations. This behaves similarly to the existingallel.SortedMultiIndex
but the chromosome values do not need to be sorted.201
,239
. - Added new parameters
exclude_fields
andrename_fields
to VCF parsing functions to add greater flexibility when selecting fields to extract. Also added several measures to protect against name clashes when converting VCF to Zarr on platforms with a case-insensitive file system.215
,216
. - Added a convenience function
allel.read_vcf_headers
, to obtain just header information from a VCF file.216
. - All functions for computing site frequency spectra now accept an optional argument n for manually specifying the number of chromosomes sampled from the population.
174
,240
. - Added start, stop and step options to
allel.equally_accessible_windows
.234
,166
. - Fixed broken implementation of
allel.AlleleCountsArray.map_alleles
.241
,200
. - Fixed functions calculating Tajima's D such that a value of np.nan is returned if there are fewer than 3 segregating sites. By
Andrew Kern <andrewkern>
andAlistair Miles <alimanfoo>
,175
,237
. - Fixed incorrect fill value in GFF parsing functions.
165
,223
. - Fixed a problem in count_alleles() methods where a subpop arg was provided as a numpy array.
235
,171
. - Removed fill option to LD functions
allel.rogers_huff_r
andallel.rogers_huff_r_between
, always use NaN where a value cannot be calculated. Also added additional tests and check for case where variants have no data.197
,243
. - Allow multiallelic variants in
allel.ehh_decay
.209
,244
. - Added checks to raise appropriate errors if user tries to rename two fields to the same name when reading VCF.
245
,220
. - Fixed setup.py so that installation of numpy prior to installation of scikit-allel is no longer required - numpy will be automatically installed as a dependency if not already installed. By
haseley
,212
,211
. - Migrate to using pytest instead of nose for testing.
236
,184
. - Small optimisation for writing zarr attributes.
225
,238
. - Fixed pandas deprecation warning. By
Summer Rae <summerela>
,228
. - Fixed problem where some packages where getting clobbered by imports of other packages.
163
,232
. - Added support for Python 3.7 and compatibility with numpy 1.15.
217
,214
. - Various documentation improvements. By
Peter Ralph <petrelharp>
andCJ Battey <cjbattey>
,229
.
- Various VCF parsing improvements and bug fixes (
183
,189
).
- Added support for Type=Character in VCF files (
Kunal Bhutani <kunalbhutani>
;159
) - Fixed type of indexing variables in VCF reading functions to handle larger datasets (
160
). - Added option to specify string codec in
allel.vcf_to_zarr
(156
). - Fixed bug in LD plotting function (
161
).
- Changed semantics of is_snp computed field when extracting data from VCF to exclude variants where one of the alternate alleles is a spanning deletion ('*') (
155
). - Resolved minor logging bug (
152
).
- Added an option to
allel.vcf_to_hdf5
to disable use of variable length strings because they can cause large HDF5 file size (153
).
- Include fixture data in release to aid testing and binary builds.
This release includes new functions for extracting data from VCF files and loading into NumPy arrays, HDF5 files and other storage containers. These functions are backed by VCF parsing code implemented in Cython, so should be reasonably fast. This is new code so there may be bugs, please report any issues via GitHub.
For a tutorial and worked examples, see the following article: Extracting data from VCF.
For API documentation, see the following functions: allel.read_vcf
, allel.vcf_to_npz
, allel.vcf_to_hdf5
, allel.vcf_to_zarr
, allel.vcf_to_dataframe
, allel.vcf_to_csv
, allel.vcf_to_recarray
, allel.iter_vcf_chunks
.
Added convenience functions allel.gff3_to_dataframe
and allel.gff3_to_recarray
.
- scikit-allel is now compatible with Dask versions 0.12 and later (
148
). - Fixed issue within functions
allel.joint_sfs
andallel.joint_sfs_folded
relating to data types (144
). - Fixed regression in functions
allel.ehh_decay
andallel.voight_painting
following refactoring of array data structures in version 1.0.0 (142
). - HTML representations of arrays have been tweaked to look better in Jupyter notebooks (
141
).
Important
This is the last version of scikit-allel that will support Python 2. The next version of scikit-allel will support Python versions 3.5 and later only.
Fix test compatibility with numpy 1.10.
Move cython function imports outside of functions to work around bug found when using scikit-allel with dask.
Add missing test packages so full test suite can be run to verify install.
This release includes some subtle but important changes to the architecture of the data structures modules (allel.model.ndarray
, allel.model.chunked
, allel.model.dask
). These changes are mostly backwards-compatible but in some cases could break existing code, hence the major version number has been incremented. Also included in this release are some new functions related to Mendelian inheritance and calling runs of homozygosity, further details below.
This release includes a new allel.stats.mendel
module with functions to help with analysis of related individuals. The function allel.mendel_errors
locates genotype calls within a trio or cross that are not consistent with Mendelian segregation of alleles. The function allel.phase_by_transmission
will resolve unphased diploid genotypes into phased haplotypes for a trio or cross using Mendelian transmission rules. The function allel.paint_transmission
can help with evaluating and visualizing the results of phasing a trio or cross.
A new allel.roh_mhmm
function provides support for locating long runs of homozygosity within a single sample. The function uses a multinomial hidden Markov model to predict runs of homozygosity based on the rate of heterozygosity over the genome. The function can also incorporate information about which positions in the genome are not accessible to variant calling and hence where there is no information about heterozygosity, to reduce false calling of ROH in regions where there is patchy data. We've run this on data from the Ag1000G project but have not performed a comprehensive evaluation with other species, feedback is very welcome.
The allel.model.ndarray
module includes a new allel.model.ndarray.GenotypeVector
class. This class represents an array of genotype calls for a single variant in multiple samples, or for a single sample at multiple variants. This class makes it easier, for example, to locate all variants which are heterozygous in a single sample.
Also in the same module are two new classes allel.model.ndarray.GenotypeAlleleCountsArray
and allel.model.ndarray.GenotypeAlleleCountsVector
. These classes provide support for an alternative encoding of genotype calls, where each call is stored as the counts of each allele observed. This allows encoding of genotype calls where samples may have different ploidy for a given chromosome (e.g., Leishmania) and/or where samples carry structural variation within some genome regions, altering copy number (and hence effective ploidy) with respect to the reference sequence.
There have also been architectural changes to all data structures modules. The most important change is that all classes in the allel.model.ndarray
module now wrap numpy arrays and are no longer direct sub-classes of the numpy numpy.ndarray
class. These classes still behave like numpy arrays in most respects, and so in most cases this change should not impact existing code. If you need a plain numpy array for any reason you can always use numpy.asarray
or access the .values
property, e.g.:
>>> import allel
>>> import numpy as np
>>> g = allel.GenotypeArray([[[0, 1], [0, 0]], [[0, 2], [1, 1]]])
>>> isinstance(g, np.ndarray)
False
>>> a = np.asarray(g)
>>> isinstance(a, np.ndarray)
True
>>> isinstance(g.values, np.ndarray)
True
This change was made because there are a number of complexities that arise when sub-classing class:numpy.ndarray and these were proving tricky to manage and maintain.
The allel.model.chunked
and allel.model.dask
modules also follow the same wrapper pattern. For the allel.model.dask
module this means a change in the way that classes are instantiated. For example, to create a allel.model.dask.GenotypeDaskArray
, pass the underlying data directly into the class constructor, e.g.:
>>> import allel
>>> import h5py
>>> h5f = h5py.File('callset.h5', mode='r')
>>> h5d = h5f['3R/calldata/genotype']
>>> genotypes = allel.GenotypeDaskArray(h5d)
If the underlying data is chunked then there is no need to specify the chunks manually when instantiating a dask array, the native chunk shape will be used.
Finally, the allel.model.bcolz module has been removed, use either the allel.model.chunked
or allel.model.dask
module instead.
This release resolves compatibility issues with Zarr version 2.1.
- Added parameter min_maf to
allel.ihs
to skip IHS calculation for variants below a given minor allele frequency. - Minor change to calculation of integrated haplotype homozygosity to enable values to be reported for first and last variants if include_edges is True.
- Minor change to
allel.standardize_by_allele_count
to better handle missing values.
In this release the implementations of allel.ihs
and allel.xpehh
selection statistics have been reworked to address a number of issues:
- Both functions can now integrate over either a genetic map (via the map_pos parameter) or a physical map.
- Both functions now accept max_gap and gap_scale parameters to perform adjustments to integrated haplotype homozygosity where there are large gaps between variants, following the standard approach. Alternatively, if a map of genome accessibility is available, it may be provided via the is_accessible parameter, in which case the distance between variants will be scaled by the fraction of accessible bases between them.
- Both functions are now faster and can make use of multiple threads to further accelerate computation.
- Several bugs in the previous implementations of these functions have been fixed (
91
). - New utility functions are provided for standardising selection scores, see
allel.standardize_by_allele_count
(for use with IHS and NSL) andallel.standardize
(for use with XPEHH).
Other changes:
- Added functions
allel.moving_tajima_d
andallel.moving_delta_tajima_d
(81
,70
). - Added functions
allel.moving_weir_cockerham_fst
,allel.moving_hudson_fst
,allel.moving_patterson_fst
. - Added functions
allel.moving_patterson_f3
andallel.moving_patterson_d
. - Renamed "blockwise" to "average" in function names in
allel.stats.fst
andallel.stats.admixture
for clarity. - Added convenience methods
allel.AlleleCountsArray.is_biallelic
andallel.AlleleCountsArray.is_biallelic_01
for locating biallelic variants. - Added support for zarr in the
allel.chunked
module (101
). - Changed HDF5 default chunked storage to use gzip level 1 compression instead of no compression (
100
). - Fixed bug in
allel.sequence_divergence
(75
). - Added workaround for chunked arrays if passed as arguments into numpy aggregation functions (
66
). - Protect against invalid coordinates when mapping from square to condensed coords (
83
). - Fixed bug in
allel.plot_sfs_folded
and added docstrings for all plotting functions inallel.stats.sf
(80
). - Fixed bug related to taking views of genotype and haplotype arrays (
77
).
- Fixed a bug in the count_alleles() methods on genotype and haplotype array classes that manifested if the max_allele argument was provided (
59
). - Fixed a bug in Jupyter notebook display method for chunked tables (
57
). - Fixed a bug in site frequency spectrum scaling functions (
54
). - Changed behaviour of subset method on genotype and haplotype arrays to better infer argument types and handle None argument values (
55
). - Changed table eval and query methods to make python the default for expression evaluation, because it is more expressive than numexpr (
58
).
- Changed
allel.util.hdf5_cache
to resolve issues with hashing and argument order (51
,52
).
- Changed functions
allel.weir_cockerham_fst
andallel.locate_unlinked
such that chunked implementations are now used by default, to avoid accidentally and unnecessarily loading very large arrays into memory (50
).
- Added new
allel.model.dask
module, providing implementations of the genotype, haplotype and allele counts classes backed by dask.array (32
). - Released the GIL where possible in Cython optimised functions (
43
). - Changed functions in
allel.stats.selection
that accept min_ehh argument, such that min_ehh = None should now be used to indicate that no minimum EHH threshold should be applied.
The major change in v0.19.0 is the addition of the new allel.model.chunked
module, which provides classes for variant call data backed by chunked array storage (31
). This is a generalisation of the previously available allel.model.bcolz
to enable the use of both bcolz and HDF5 (via h5py) as backing storage. The allel.model.bcolz
module is now deprecated but will be retained for backwargs compatibility until the next major release.
Other changes:
- Added function for computing the number of segregating sites by length (nSl), a summary statistic comparing haplotype homozygosity between different alleles (similar to IHS), see
allel.nsl
(40
). - Added functions for computing haplotype diversity, see
allel.haplotype_diversity
andallel.moving_haplotype_diversity
(29
). - Added function
allel.plot_moving_haplotype_frequencies
for visualising haplotype frequency spectra in moving windows over the genome (30
). - Added vstack() and hstack() methods to genotype and haplotype arrays to enable combining data from multiple arrays (
21
). - Added convenience function
allel.equally_accessible_windows
(16
). - Added methods from_hdf5_group() and to_hdf5_group() to
allel.model.ndarray.VariantTable
(26
). - Added
allel.util.hdf5_cache
utility function. - Modified functions in the
allel.stats.selection
module that depend on calculation of integrated haplotype homozygosity to return NaN when haplotypes do not decay below a specified threshold (39
). - Fixed missing return value in
allel.plot_voight_painting
(23
). - Fixed return type from array reshape() (
34
).
Contributors: alimanfoo <alimanfoo>
, hardingnj
<hardingnj>
- Minor change to the Garud H statistics to avoid raising an exception when the number of distinct haplotypes is very low (
20
).
- Added functions for computing H statistics for detecting signatures of soft sweeps, see
allel.garud_h
,allel.moving_garud_h
,allel.plot_haplotype_frequencies
(19
). - Added function
allel.fig_voight_painting
to paint both flanks either side of some variant under selection in a single figure (17
). - Changed return values from
allel.voight_painting
to also return the indices used for sorting haplotypes by prefix (18
).
- Added new module for computing and plotting site frequency spectra, see
allel.stats.sf
(12
). - All plotting functions have been moved into the appropriate stats module that they naturally correspond to. The
allel.plot
module is deprecated (13
). - Improved performance of carray and ctable loading from HDF5 with a condition (
11
).
- Fixed behaviour of take() method on compressed arrays when indices are not in increasing order (
6
). - Minor change to scaler argument to PCA functions in
allel.stats.decomposition
to avoid confusion about when to fall back to default scaler (7
).
- Added block-wise implementation to
allel.locate_unlinked
so it can be used with compressed arrays as input.
- Added new selection module with functions for haplotype-based analyses of recent selection, see
allel.stats.selection
.
- Improved performance of
allel.model.bcolz.carray_block_compress
,allel.model.bcolz.ctable_block_compress
andallel.model.bcolz.carray_block_subset
for very sparse selections. - Fix bug in IPython HTML table captions.
- Fix bug in addcol() method on bcolz ctable wrappers.
- Fix missing package in setup.py.
- Added functions to estimate Fst with standard error via a block-jackknife:
allel.blockwise_weir_cockerham_fst
,allel.blockwise_hudson_fst
,allel.blockwise_patterson_fst
. - Fixed a serious bug in
allel.weir_cockerham_fst
related to incorrect estimation of heterozygosity, which manifested if the subpopulations being compared were not a partition of the total population (i.e., there were one or more samples in the genotype array that were not included in the subpopulations to compare). - Added method
allel.AlleleCountsArray.max_allele
to determine highest allele index for each variant. - Changed first return value from admixture functions
allel.blockwise_patterson_f3
andallel.blockwise_patterson_d
to return the estimator from the whole dataset. - Added utility functions to the
allel.stats.distance
module for transforming coordinates between condensed and uncondensed forms of a distance matrix. - Classes previously available from the allel.model and allel.bcolz modules are now aliased from the root
allel
module for convenience. These modules have been reorganised into anallel.model
package with sub-modulesallel.model.ndarray
andallel.model.bcolz
. - All functions in the
allel.model.bcolz
module use cparams from input carray as default for output carray (convenient if you, e.g., want to use zlib level 1 throughout). - All classes in the
allel.model.ndarray
andallel.model.bcolz
modules have changed the default value for the copy keyword argument to False. This means that not copying the input data, just wrapping it, is now the default behaviour. - Fixed bug in
GenotypeArray.to_gt
where maximum allele index is zero.
- Added a new module
allel.stats.admixture
with statistical tests for admixture between populations, implementing the f2, f3 and D statistics from Patterson (2012). Functions includeallel.blockwise_patterson_f3
andallel.blockwise_patterson_d
which compute the f3 and D statistics respectively in blocks of a given number of variants and perform a block-jackknife to estimate the standard error.
- Added functions for principal components analysis of genotype data. Functions in the new module
allel.stats.decomposition
includeallel.pca
to perform a PCA via full singular value decomposition, andallel.randomized_pca
which uses an approximate truncated singular value decomposition to speed up computation. In tests with real data the randomized PCA is around 5 times faster and uses half as much memory as the conventional PCA, producing highly similar results. - Added function
allel.pcoa
for principal coordinate analysis (a.k.a. classical multi-dimensional scaling) of a distance matrix. - Added new utility module
allel.stats.preprocessing
with classes for scaling genotype data prior to use as input for PCA or PCoA. By default the scaling (i.e., normalization) of Patterson (2006) is used with principal components analysis functions in theallel.stats.decomposition
module. Scaling functions can improve the ability to resolve population structure via PCA or PCoA. - Added method
allel.GenotypeArray.to_n_ref
. Also addeddtype
argument toallel.GenotypeArray.to_n_ref()
andallel.GenotypeArray.to_n_alt()
methods to enable direct output as float arrays, which can be convenient if these arrays are then going to be scaled for use in PCA or PCoA. - Added
allel.GenotypeArray.mask
property which can be set with a Boolean mask to filter genotype calls from genotype and allele counting operations. A similar property is available on theallel.GenotypeCArray
class. Also added methodallel.GenotypeArray.fill_masked
and similar method on theallel.GenotypeCArray
class to fill masked genotype calls with a value (e.g., -1).
- Added functions for calculating Watterson's theta (proportional to the number of segregating variants):
allel.watterson_theta
for calculating over a given region, andallel.windowed_watterson_theta
for calculating in windows over a chromosome/contig. - Added functions for calculating Tajima's D statistic (balance between nucleotide diversity and number of segregating sites):
allel.tajima_d
for calculating over a given region andallel.windowed_tajima_d
for calculating in windows over a chromosome/contig. - Added
allel.windowed_df
for calculating the rate of fixed differences between two populations. - Added function
allel.locate_fixed_differences
for locating variants that are fixed for different alleles in two different populations. - Added function
allel.locate_private_alleles
for locating alleles and variants that are private to a single population.
- Added functions implementing the Weir and Cockerham (1984) estimators for F-statistics:
allel.weir_cockerham_fst
andallel.windowed_weir_cockerham_fst
. - Added functions implementing the Hudson (1992) estimator for Fst:
allel.hudson_fst
andallel.windowed_hudson_fst
. - Added new module
allel.stats.ld
with functions for calculating linkage disequilibrium estimators, includingallel.rogers_huff_r
for pairwise variant LD calculation,allel.windowed_r_squared
for windowed LD calculations, andallel.locate_unlinked
for locating variants in approximate linkage equilibrium. - Added function
allel.plot_pairwise_ld
for visualising a matrix of linkage disequilbrium values between pairs of variants. - Added function
allel.create_allele_mapping
for creating a mapping of alleles into a different index system, i.e., if you want 0 and 1 to represent something other than REF and ALT, e.g., ancestral and derived. Also added methodsallel.GenotypeArray.map_alleles
,allel.HaplotypeArray.map_alleles
andallel.AlleleCountsArray.map_alleles
which will perform an allele transformation given an allele mapping. - Added function
allel.plot_variant_locator
ported across from anhima. - Refactored the
allel.stats
module into a package with sub-modules for easier maintenance.
- Added documentation for the functions
allel.bcolz.carray_from_hdf5
,allel.bcolz.carray_to_hdf5
,allel.bcolz.ctable_from_hdf5_group
,allel.bcolz.ctable_to_hdf5_group
. - Refactoring of internals within the
allel.bcolz
module.
- Added subpop argument to
allel.GenotypeArray.count_alleles
andallel.HaplotypeArray.count_alleles
to enable count alleles within a sub-population without subsetting the array. - Added functions
allel.GenotypeArray.count_alleles_subpops
andallel.HaplotypeArray.count_alleles_subpops
to enable counting alleles in multiple sub-populations in a single pass over the array, without sub-setting. - Added classes
allel.model.FeatureTable
andallel.bcolz.FeatureCTable
for storing and querying data on genomic features (genes, etc.), with functions for parsing from a GFF3 file. - Added convenience function
allel.pairwise_dxy
for computing a distance matrix using Dxy as the metric.
- Added function
allel.write_fasta
for writing a nucleotide sequence stored as a NumPy array out to a FASTA format file.
- Added method
allel.VariantTable.to_vcf
for writing a variant table to a VCF format file.