Long time no updates #5

austinTalbot7241993 · 2023-11-05T01:38:40Z

No description provided.

* Updates lint.yml to cache venv Most of our lint.yml build time in CI is spent in dependency installation. This PR caches the virtual env, so that unless requirements change, the venv will be served from cache rather than installing from scratch

Provide utility functions for loading the opensearch config and finding the bystro project root directory.

…ic manner (#276) At various points in annotation file processing we rely on tar to compress or decompress files. In particular, we use GNU tar and use certain features that are specific to that implementation. On linux, GNU tar is the default. On MacOSX, however, the default implementation is bsdtar, which is not fully compatible with GNU tar. If we end up calling bsdtar when we mean GNU tar, subtle bugs can result. So, in order to run tar correctly we need to know which OS we're running it on and specify the correct executable name for GNU tar (or raise an error if it isn't installed). This commit defines a utils module that provides a runtime constant GNU_TAR_EXECUTABLE_NAME valid under linux or MacOSX. Closes #277 --------- Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

* Partially addresses issue #284 by adding API documentation --------- Co-authored-by: cristinaetrv <24943967+cristinaetrv@users.noreply.github.com>

* Add web_api.md, which links to Postman docs

- change cpanm to cpm for faster building - minor tweaks to dist and cpanfile

Updates github workflows for Perl portions of the Bystro project. --------- Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

1. Use a common approach to preparing test directory, config file and data that is placed into a temporary directory to avoid leaving behind test artifacts. 2. Update Perl package CI to include bystro-vcf binary to ensure all tests run (since they are skipped if bystro-vcf binary is not present)

Bumps [rustix](https://github.com/bytecodealliance/rustix) from 0.36.9 to 0.36.16. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/bytecodealliance/rustix/commit/6534992521aaef40684b110616da2e3c1b7e6bbf"><code>6534992</code></a> chore: Release rustix version 0.36.16</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/4928cf7a38eacb7f58a03657cd80882da77bbab2"><code>4928cf7</code></a> Disable riscv64 testing.</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/8cc159c4c3c9fdcc3bcba9e76a9e015000dc13e6"><code>8cc159c</code></a> Fix the <code>test_ttyname_ok</code> test when /dev/stdin is inaccessable. (<a href="https://redirect.github.com/bytecodealliance/rustix/issues/821">#821</a>)</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/6dc7ba947895254bca5801c71ec00e2a2c9d13d7"><code>6dc7ba9</code></a> Downgrade dependencies and disable tests to compile under Rust 1.48.</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/ded8986e7efc888f2e185139406eff11b5ecc41c"><code>ded8986</code></a> Disable MIPS in CI. (<a href="https://redirect.github.com/bytecodealliance/rustix/issues/793">#793</a>)</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/739f9c3ba01425c14c39cbfb4c61e2642383a408"><code>739f9c3</code></a> Fixes for <code>Dir</code> on macOS, FreeBSD, and WASI.</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/87481a97f4364d12d5d6f30cdd025a0fc509b8ec"><code>87481a9</code></a> Merge pull request from GHSA-c827-hfw6-qwvm</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/5b764b597e2bb8776a59292d62e33fab83e288ec"><code>5b764b5</code></a> chore: Release rustix version 0.36.15</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/c692a58a11cab288c3a14e4c21ab7783df1d891d"><code>c692a58</code></a> Disable the qemu cache.</li> <li><a href="https://github.com/bytecodealliance/rustix/commit/f6afba4017d3e669507396de679e1997d711ed1a"><code>f6afba4</code></a> Pin Rust nightly to 2023-07-03.</li> <li>Additional commits viewable in <a href="https://github.com/bytecodealliance/rustix/compare/v0.36.9...v0.36.16">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rustix&package-manager=cargo&previous-version=0.36.9&new-version=0.36.16)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/bystrogenomics/bystro/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

- add automatic running of extra author tests - tidy code using xt/author tests - Add cpm installation instructions

@poneill

#305) * Simplifies indexing * Fixes saving parsing regression * Adds index string -> search _source parsing test The aim of this PR is to simplify the source structure in indexed documents, and fix a saving regression. We do this by ensuring that if there are no overlap delimited values (the innermost dimension) the innermost value is a scalar. Briefly, this will ensure that: het1;het2 will shown as **[ [het1, het2] ]** and not [ [ [het1], [het2] ] ] and het1 as **[ [ het1 ] ]** which will make parsing more straightforward. Still a bit more work on this one, especially in the need save handler test demonstrating parsing from source document cc @poneill --------- Co-authored-by: Ubuntu <ubuntu@ip-10-98-135-15.ec2.internal>

* Updates ray to 2.7.1, msgspec to 0.18.4, pyarrow to 13.0.0, Cython to 3.0.4, pandas to 2.1.1, maturin to 1.3.0, opensearch-py to 2.3.2. Tested end to end on dev server

…riant info (#283) Given an opensearch query and an index name, run the query against the index and return a dataframe of matching variants / samples.

* Takes N DbSNP 2 VCF files (https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/), extracts every population from the Freq=(.*) field in a separate INFO field, drops the Freq field, drops the first allele for each population (which is reference). Then it writes the population-specific fields to the info header, and updates the yml config file to point to the newly formatted vcf (the original vcf will not be overwritten). This is necessary because the dbSNP 2 vcf does not make good use of the VCF spec; the Freq field is the combination of multiple fields, each of which is an Allelic type, but where the first allele is the reference, which is not the standard use. This will enable us to reproducibly fetch, transform, build dbSNP files, from 1 yaml config, once we add 1 more utility, which translates the RefSeq NC_* chromosome identifiers to chr1-22,X,Y,M. Will remove [wip] once test added --------- Co-authored-by: wingolab <thomas.wingo@emory.edu>

* Adds dbSNP chromosome renaming tool, taking us from identifiers like NC_000001.10 to those like "1" This script works tested on GCF_000001405.25.gz (hg19) and GCF_000001405.39.gz (hg38); in a future PR will wrap into utility that can be run through bystro-utils.pl; will add test at that time. With this we have a fully working DbSNP 2 transformation; I have already tested building dbSNP 2, works well.

* Use the 1st column of the assembly report to 1) fix that I committed the wrong column in the final commit of the previous PR (should have been 11th column for UCSC style chromosomes), 2) that using that column results in a VCF file with "na" chromosomes, as explained in the comment in this PR.

* Add running bystro-stats on annotation output * Add tests for search/utils/annotation * Fix startup.yml and ensure proteomics server is defined in beanstalkd.yml Partially addresses #314

Implement the joining of annotation query results to tandem mass tags datasets on (sample_id, gene_name) pairs.

This PR provides: - a module for bi-directional lookup of gene name from uniprot id and vice versa - the data necessary for that module - a script for downloading that data

* If all delimited outputs are duplicates, output only the single unique value * Join ref on '', converting A|C|G|T => ACGT in the del case, or A|C => AC in the ins case. SNV are unaffected Needs tests, but worth a look anyway. Live on bystro-dev.emory.edu This is motivated by Pat's feedback and agreed upon with Dave (going to have him double check the output of a test case just in case) Attached before & after annotations: [before_trio_trim_vep_vcf.annotation.tsv.zip](https://github.com/bystrogenomics/bystro/files/13161175/before_trio_trim_vep_vcf.annotation.tsv.zip) [after_trio_trim_vep_vcf.annotation.tsv.gz](https://github.com/bystrogenomics/bystro/files/13172838/after_trio_trim_vep_vcf.annotation.tsv.gz)

* Add pipeline tagged union type to SaveJobData This is the first step in supporting filter pipelines, which are procedures that run in a streaming manner to filter values from the query we're saving. No tests written because it is trivial to test (it is live now on bystro-dev), and we will write tests in the followup PR that actually implements filters.

Still a WIP, it does not include PIT snapshot and missing user_id Query function will connect to Postgres: - For experiment name it will return the top_n (default is 10) of the Sample_ID->Subject_ID mappings - For query without experiment name it will match/replace on the query string, looking for subject_id, and replacing it with sample_id, and will print out the modify/updated query string **Example cli (With Expriment Name):** python cli.py query --query "Dennis/Alex" --experiment_name Experiment_6 --postgres_config ../../../../config/postgres.yml **Expected output:** Sample ID => Subject ID mappings for Experiment: Experiment_6 { "123": "Dennis", "24123": "Alex", **Example cli (Without Expriment Name):** python cli.py query --query "heterozygotes:(Alex && Cristina)" --postgres_config ../../../../config/postgres.yml **Expected output:** Original Query: heterozygotes:(Alex && Cristina) Modified Query: heterozygotes:(24123 && 34441) --------- Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

This pull request introduces a powerful enhancement to our PPCA (Probabilistic Principal Component Analysis) implementation, offering users the option to leverage Sherman Woodbury matrix inverse methods. These methods are known for their ability to significantly reduce computational complexity when dealing with high-dimensional matrices, particularly when the matrix can be expressed as a combination of a diagonal and a low-rank matrix. Key Highlights of this Pull Request: Covariance Module Expansion: In pursuit of implementing Sherman Woodbury matrix inverse methods, this pull request incorporates additional functions within the covariance module. These functions are crucial for a seamless integration of the matrix inversion capabilities. Integration into PPCA Base Class: To fully harness the benefits of the Sherman Woodbury matrix inverse methods, this pull request thoughtfully integrates them into the core of the PPCA framework. This integration enhances the overall performance and efficiency of the PPCA implementation. Comprehensive Unit Testing: Rigorous unit tests have been included to ensure the reliability and accuracy of the implementation. These tests verify that the Sherman Woodbury matrix inverse methods are functioning as expected and that the PPCA framework remains robust. --------- Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

akotlar and others added 26 commits September 21, 2023 17:39

Create Ticket issue template (#270)

70d4228

Add caching to lint.yml (#275)

6450aef

* Updates lint.yml to cache venv Most of our lint.yml build time in CI is spent in dependency installation. This PR caches the virtual env, so that unless requirements change, the venv will be served from cache rather than installing from scratch

Add project-level config utilities (#278)

f5ef13e

Provide utility functions for loading the opensearch config and finding the bystro project root directory.

Batch API docs (#290)

b41e014

* Partially addresses issue #284 by adding API documentation --------- Co-authored-by: cristinaetrv <24943967+cristinaetrv@users.noreply.github.com>

Add web_api.md, which links to Postman docs (#291)

7cd4d97

* Add web_api.md, which links to Postman docs

Perl tidy (#286)

258d7a5

Update Perl Docker recipe (#292)

e1d1d0f

- change cpanm to cpm for faster building - minor tweaks to dist and cpanfile

Update Perl CI (#293)

201a420

Updates github workflows for Perl portions of the Bystro project. --------- Co-authored-by: Alex V. Kotlar <akotlar@bu.edu>

Tidy Perl code and docs (#308)

0c75097

- add automatic running of extra author tests - tidy code using xt/author tests - Add cpm installation instructions

[python/search/index] wait for search index refresh (#310)

1b7da3b

[chore] Update python dependencies (#309)

2b2e8e6

* Updates ray to 2.7.1, msgspec to 0.18.4, pyarrow to 13.0.0, Cython to 3.0.4, pandas to 2.1.1, maturin to 1.3.0, opensearch-py to 2.3.2. Tested end to end on dev server

[proteomics] Add functionality to query annotation file and return va…

041f051

…riant info (#283) Given an opensearch query and an index name, run the query against the index and return a dataframe of matching variants / samples.

[search/save] Run bystro-stats on saved annotations (#321)

9a7584a

* Add running bystro-stats on annotation output * Add tests for search/utils/annotation * Fix startup.yml and ensure proteomics server is defined in beanstalkd.yml Partially addresses #314

[WIP] join annotation query results to protein abundance matrix (#324)

e24f274

Implement the joining of annotation query results to tandem mass tags datasets on (sample_id, gene_name) pairs.

[proteomics] Provide Uniprot ID / Hugo symbol crosswalk (#331)

ace2148

This PR provides: - a module for bi-directional lookup of gene name from uniprot id and vice versa - the data necessary for that module - a script for downloading that data

austinTalbot7241993 merged commit 5f18a05 into austinTalbot7241993:master Nov 5, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long time no updates #5

Long time no updates #5

austinTalbot7241993 commented Nov 5, 2023

Long time no updates #5

Long time no updates #5

Conversation

austinTalbot7241993 commented Nov 5, 2023