Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long time no updates #5

Merged
merged 26 commits into from
Nov 5, 2023

Conversation

austinTalbot7241993
Copy link
Owner

No description provided.

akotlar and others added 26 commits September 21, 2023 17:39
* Updates lint.yml to cache venv

Most of our lint.yml build time in CI is spent in dependency installation.
This PR caches the virtual env, so that unless requirements change, the
venv will be served from cache rather than installing from scratch
Provide utility functions for loading the opensearch config and finding
the bystro project root directory.
…ic manner (#276)

At various points in annotation file processing we rely on tar to
compress or decompress files. In particular, we use GNU tar and use
certain features that are specific to that implementation. On linux, GNU
tar is the default. On MacOSX, however, the default implementation is
bsdtar, which is not fully compatible with GNU tar. If we end up calling
bsdtar when we mean GNU tar, subtle bugs can result.

So, in order to run tar correctly we need to know which OS we're running
it on and specify the correct executable name for GNU tar (or raise an
error if it isn't installed). This commit defines a utils module that
provides a runtime constant GNU_TAR_EXECUTABLE_NAME valid under linux or
MacOSX.

Closes #277

---------

Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>
* Partially addresses issue #284 by adding API documentation

---------

Co-authored-by: cristinaetrv <24943967+cristinaetrv@users.noreply.github.com>
* Add web_api.md, which links to Postman docs
- change cpanm to cpm for faster building
- minor tweaks to dist and cpanfile
Updates github workflows for Perl portions of the Bystro project.

---------

Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>
1. Use a common approach to preparing test directory, config file and
data that is placed into a temporary directory to avoid leaving behind
test artifacts.
2. Update Perl package CI to include bystro-vcf binary to ensure all
tests run (since they are skipped if bystro-vcf binary is not present)
Bumps [rustix](https://github.com/bytecodealliance/rustix) from 0.36.9
to 0.36.16.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/6534992521aaef40684b110616da2e3c1b7e6bbf"><code>6534992</code></a>
chore: Release rustix version 0.36.16</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/4928cf7a38eacb7f58a03657cd80882da77bbab2"><code>4928cf7</code></a>
Disable riscv64 testing.</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/8cc159c4c3c9fdcc3bcba9e76a9e015000dc13e6"><code>8cc159c</code></a>
Fix the <code>test_ttyname_ok</code> test when /dev/stdin is
inaccessable. (<a
href="https://redirect.github.com/bytecodealliance/rustix/issues/821">#821</a>)</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/6dc7ba947895254bca5801c71ec00e2a2c9d13d7"><code>6dc7ba9</code></a>
Downgrade dependencies and disable tests to compile under Rust
1.48.</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/ded8986e7efc888f2e185139406eff11b5ecc41c"><code>ded8986</code></a>
Disable MIPS in CI. (<a
href="https://redirect.github.com/bytecodealliance/rustix/issues/793">#793</a>)</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/739f9c3ba01425c14c39cbfb4c61e2642383a408"><code>739f9c3</code></a>
Fixes for <code>Dir</code> on macOS, FreeBSD, and WASI.</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/87481a97f4364d12d5d6f30cdd025a0fc509b8ec"><code>87481a9</code></a>
Merge pull request from GHSA-c827-hfw6-qwvm</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/5b764b597e2bb8776a59292d62e33fab83e288ec"><code>5b764b5</code></a>
chore: Release rustix version 0.36.15</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/c692a58a11cab288c3a14e4c21ab7783df1d891d"><code>c692a58</code></a>
Disable the qemu cache.</li>
<li><a
href="https://github.com/bytecodealliance/rustix/commit/f6afba4017d3e669507396de679e1997d711ed1a"><code>f6afba4</code></a>
Pin Rust nightly to 2023-07-03.</li>
<li>Additional commits viewable in <a
href="https://github.com/bytecodealliance/rustix/compare/v0.36.9...v0.36.16">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rustix&package-manager=cargo&previous-version=0.36.9&new-version=0.36.16)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/bystrogenomics/bystro/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>
- add automatic running of extra author tests
- tidy code using xt/author tests
- Add cpm installation instructions
#305)

* Simplifies indexing
* Fixes saving parsing regression
* Adds index string -> search _source parsing test

The aim of this PR is to simplify the source structure in indexed
documents, and fix a saving regression. We do this by ensuring that if
there are no overlap delimited values (the innermost dimension) the
innermost value is a scalar.

Briefly, this will ensure that:

het1;het2 will shown as **[ [het1, het2] ]** and not [ [ [het1], [het2]
] ]
and
het1 as **[ [ het1 ] ]**

which will make parsing more straightforward.

Still a bit more work on this one, especially in the need save handler
test demonstrating parsing from source document

cc @poneill

---------

Co-authored-by: Ubuntu <ubuntu@ip-10-98-135-15.ec2.internal>
* Updates ray to 2.7.1, msgspec to 0.18.4, pyarrow to 13.0.0, Cython to
3.0.4, pandas to 2.1.1, maturin to 1.3.0, opensearch-py to 2.3.2.

Tested end to end on dev server
…riant info (#283)

Given an opensearch query and an index name, run the query against the
index and return a dataframe of matching variants / samples.
* Takes N DbSNP 2 VCF files
(https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/), extracts
every population from the Freq=(.*) field in a separate INFO field,
drops the Freq field, drops the first allele for each population (which
is reference). Then it writes the population-specific fields to the info
header, and updates the yml config file to point to the newly formatted
vcf (the original vcf will not be overwritten).

This is necessary because the dbSNP 2 vcf does not make good use of the
VCF spec; the Freq field is the combination of multiple fields, each of
which is an Allelic type, but where the first allele is the reference,
which is not the standard use.

This will enable us to reproducibly fetch, transform, build dbSNP files,
from 1 yaml config, once we add 1 more utility, which translates the
RefSeq NC_* chromosome identifiers to chr1-22,X,Y,M.


Will remove [wip] once test added

---------

Co-authored-by: wingolab <thomas.wingo@emory.edu>
* Adds dbSNP chromosome renaming tool, taking us from identifiers like
NC_000001.10 to those like "1"

This script works tested on GCF_000001405.25.gz (hg19) and
GCF_000001405.39.gz (hg38); in a future PR will wrap into utility that
can be run through bystro-utils.pl; will add test at that time.

With this we have a fully working DbSNP 2 transformation; I have already
tested building dbSNP 2, works well.
* Use the 1st column of the assembly report to 1) fix that I committed
the wrong column in the final commit of the previous PR (should have
been 11th column for UCSC style chromosomes), 2) that using that column
results in a VCF file with "na" chromosomes, as explained in the comment
in this PR.
* Add running bystro-stats on annotation output
* Add tests for search/utils/annotation
* Fix startup.yml and ensure proteomics server is defined in
beanstalkd.yml

Partially addresses #314
Implement the joining of annotation query results to tandem mass tags
datasets on (sample_id, gene_name) pairs.
This PR provides:

- a module for bi-directional lookup of gene name from uniprot id and
vice versa
- the data necessary for that module
- a script for downloading that data
* If all delimited outputs are duplicates, output only the single unique
value
* Join ref on '', converting A|C|G|T => ACGT in the del case, or A|C =>
AC in the ins case. SNV are unaffected


Needs tests, but worth a look anyway. Live on bystro-dev.emory.edu

This is motivated by Pat's feedback and agreed upon with Dave (going to
have him double check the output of a test case just in case)

Attached before & after annotations:

[before_trio_trim_vep_vcf.annotation.tsv.zip](https://github.com/bystrogenomics/bystro/files/13161175/before_trio_trim_vep_vcf.annotation.tsv.zip)

[after_trio_trim_vep_vcf.annotation.tsv.gz](https://github.com/bystrogenomics/bystro/files/13172838/after_trio_trim_vep_vcf.annotation.tsv.gz)
* Add pipeline tagged union type to SaveJobData

This is the first step in supporting filter pipelines, which are
procedures that run in a streaming manner to filter values from the
query we're saving.

No tests written because it is trivial to test (it is live now on
bystro-dev), and we will write tests in the followup PR that actually
implements filters.
Still a WIP, it does not include PIT snapshot and missing user_id

Query function will connect to Postgres:
- For experiment name it will return the top_n (default is 10) of the
Sample_ID->Subject_ID mappings
- For query without experiment name it will match/replace on the query
string, looking for subject_id, and replacing it with sample_id, and
will print out the modify/updated query string

**Example cli (With Expriment Name):**
python cli.py query --query "Dennis/Alex" --experiment_name Experiment_6
--postgres_config ../../../../config/postgres.yml

**Expected output:**
Sample ID => Subject ID mappings for Experiment: Experiment_6
{
    "123": "Dennis",
    "24123": "Alex",
   
**Example cli (Without Expriment Name):**
python cli.py query --query "heterozygotes:(Alex && Cristina)"
--postgres_config ../../../../config/postgres.yml

**Expected output:**
Original Query: heterozygotes:(Alex && Cristina)
Modified Query: heterozygotes:(24123 && 34441)

---------

Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>
This pull request introduces a powerful enhancement to our PPCA
(Probabilistic Principal Component Analysis) implementation, offering
users the option to leverage Sherman Woodbury matrix inverse methods.
These methods are known for their ability to significantly reduce
computational complexity when dealing with high-dimensional matrices,
particularly when the matrix can be expressed as a combination of a
diagonal and a low-rank matrix.

Key Highlights of this Pull Request:

Covariance Module Expansion: In pursuit of implementing Sherman Woodbury
matrix inverse methods, this pull request incorporates additional
functions within the covariance module. These functions are crucial for
a seamless integration of the matrix inversion capabilities.

Integration into PPCA Base Class: To fully harness the benefits of the
Sherman Woodbury matrix inverse methods, this pull request thoughtfully
integrates them into the core of the PPCA framework. This integration
enhances the overall performance and efficiency of the PPCA
implementation.

Comprehensive Unit Testing: Rigorous unit tests have been included to
ensure the reliability and accuracy of the implementation. These tests
verify that the Sherman Woodbury matrix inverse methods are functioning
as expected and that the PPCA framework remains robust.

---------

Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>
@austinTalbot7241993 austinTalbot7241993 merged commit 5f18a05 into austinTalbot7241993:master Nov 5, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants