Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add summary stats script #501

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add summary stats script #501

wants to merge 2 commits into from

Conversation

jkgoodrich
Copy link
Contributor

I have not yet run or tested --generate-gene-lof-matrix or --summarize-gene-lof-matrix since it requires a densify

Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been a few requests for summary stats for genomes - possible in a notebook but could be good to productionize/add an argument for it in this script ?

mane_select_only=True,
index=freq_index,
)
meta_ht = meta.ht()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta_ht = meta.ht()
meta_ht = meta(data_type=data_type).ht()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait it doesn't work like this in this branch - did meta change from main?

ht = ht.annotate_globals(num_release_samples=meta_ht.count())
ht.write(
release_summary_stats(
test=test, data_type="exomes", filter_name=filter_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test=test, data_type="exomes", filter_name=filter_name
test=test, data_type=data_type, filter_name=filter_name

],
)
mt.write(
release_lof(test=test, data_type="exomes", mt=True).path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
release_lof(test=test, data_type="exomes", mt=True).path,
release_lof(test=test, data_type=data_type, mt=True).path,

)

if args.summarize_gene_lof_matrix:
mt = release_lof(test=test, data_type="exomes", mt=True).mt()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mt = release_lof(test=test, data_type="exomes", mt=True).mt()
mt = release_lof(test=test, data_type=data_type, mt=True).mt()

)
ht = default_generate_gene_lof_summary(mt)
ht.write(
release_lof(test=test, data_type="exomes").path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
release_lof(test=test, data_type="exomes").path,
release_lof(test=test, data_type=data_type).path,

"--summarize-gene-lof-matrix",
help="Creates gene LoF matrix summary Table.",
action="store_true",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser.add_argument(
    "--data-type",
    default="exomes",
    choices=["exomes","genomes"],
    help="Data type (exomes or genomes) to produce summary stats for."
)

@KoalaQin
Copy link
Contributor

KoalaQin commented May 1, 2024

Should we get this PR in too? I can help review if needed. Want to wrap up all the stats tickets.

@matren395
Copy link
Contributor

Hmm so this code was instead ran in a notebook and essentially this runs get_summary_stats() to get per-callset stats (not per sample) of just about exactly what we're doing.

Following this goal, I think we should maybe calculate the TOTALS of everything we're calculating per sample as sums instead from the intermediate file ? I can add the code for it - if we want per-callset instead of per-sample - stats

@matren395
Copy link
Contributor

Should we get this PR in too? I can help review if needed. Want to wrap up all the stats tickets.

so I took a run at this in #611 - just getting those stats from the existing aggregated table, while they're already there. LMK your thoughts.

@matren395
Copy link
Contributor

matren395 commented May 2, 2024

After another look, this actually looks like it's doing something a bit more complex and productionized and better versionized than the simple code I put into #611 . Running the existing summary stats method =/= calculating things from the intermediate per-sample table , though we can return 99% of the same results from it. @jkgoodrich is this worth pursuing, or is calculating plenty of per-callset stats via #611 good enough ?

I think this is code you/Julia wrote Oct2023, so it should honestly be very close to being mergeable if we'd want. However, running this AND our per-sample code would involve two (expensive) aggregations over the whole dataset which is.... suboptimal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants