Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Calyx-friendly JSON #25

Merged
merged 28 commits into from
May 2, 2023
Merged

Generate Calyx-friendly JSON #25

merged 28 commits into from
May 2, 2023

Conversation

anshumanmohan
Copy link
Contributor

@anshumanmohan anshumanmohan commented Mar 2, 2023

These JSON files are generated from gfas and not odgi graphs, and are done purely in Python, without the use of odgi commands or odgi's Python bindings.

Presently I match the .data files that Susan creates for odgi depth, but I will soon extend this to expose an interface that lets us generate more or less of the gfa. Before going too far, I'm wondering about minimizing these JSON files a little; see #24.

Some next steps:

  • After talking about the issue linked above, extend the testing suite to include all the GFAs we have.
  • I presently create the .out files using a shell script, because the oracle needs to run a different command (exine depth -d filename.og -o $fn.out) than the command we are testing (python3 to_json.py < filename.gfa). I'd like to clean this up after learning more about turnt.
  • I also had to make local copies of the gfa files (turnt ../path_to_gfas/*.gfa did not seem to work) and I'd like to tidy this up.
  • Extend the JSON generator to emit richer versions of the graph, e.g. enough data for crush or flip to operate.

Copy link
Collaborator

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neato!! Very cool that this works.

We briefly discussed Turnt's multiple environments to make differentials testing a little easier.

Also, it would be great to avoid committing most *.gfa files to the repo, since they can get so big. Currently, the only such files we have in the repo are the tiny ones in test/basic/:
https://github.com/cucapra/pollen/tree/main/test/basic

The remaining ones are downloaded from the web:

curl -Lo ./$@ $(GFA_URL)/$*.gfa

Instead of making a new test/depth_json directory for testing this functionality, it may make more sense to just put your new stuff in test/turnt.toml so you can use the files right there.

slow_odgi/to_json.py Outdated Show resolved Hide resolved
Comment on lines 94 to 97
print(json.dumps(graph.headers, indent=4))
print(json.dumps(graph.segments, indent=4, cls=SegmentEncoder))
print(json.dumps(graph.links, indent=4, cls=LinkEncoder))
print(json.dumps(graph.paths, indent=4, cls=PathEncoder))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have a better time with these JSONEncoder subclasses if they were type-sensitive, so you would only need to use a single encoder class. Something like this:

 class AlignmentEncoder(JSONEncoder):
     def default(self, o):
         if isinstance(o, mygfa.Path):
             items = str(o).split("\t")
             return { ... }
         elif isinstance(o, mygfa.Link):
             ...

@sampsyo sampsyo mentioned this pull request Mar 2, 2023
@anshumanmohan anshumanmohan marked this pull request as draft March 21, 2023 23:57
@sampsyo
Copy link
Collaborator

sampsyo commented Apr 5, 2023

I see this is now a draft—care to comment about whether this is ready for a re-review or if there are other outstanding tasks?

@anshumanmohan anshumanmohan changed the base branch from main to cli April 26, 2023 17:13
@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Apr 26, 2023

Sorry about the silence! This fell off my radar what with all the other changes elsewhere. In the commits since your last review, I have

  1. Incorporated your comments.
  2. Brought the code up to speed with changes in mygfa, the package-import style, etc.
  3. Dovetailed testing into slow-odgi's testing and the interface into slow-odgi's CLI. These are temporary, and perhaps a mistake. This should stand alone and just borrow mygfa from slow-odgi. Better yet, mygfa should be lifted above these two separate packages. Issue forthcoming.

@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Apr 26, 2023

make test-mkjson will now run the depth-specific json-generator, with exine depth as its oracle.

I now have the ability to pass in the command-line flags n, e, and p that are used to determine the max nodes, max steps per node, and max paths respectively. There is work to be done, though:

  1. The exine json-generator adjusts the widths of fields as needed; I just stick to 4. Must fix.
  2. Fixing this will allow me to pass in larger parameters for these three flags and therefore compute JSONs for the four larger graphs. At present we don't do anything reasonable with them: exine, our oracle, complains that the parameters need to be bigger, and we ignore this, so the expect files are empty.
  3. Opening up testing to all the graphs may well reveal other issues that have not yet come up with the smaller four graphs.
  4. The final step will be to infer these parameters automatically and tightly, so I won't have to run the smaller graphs with parameters that the bigger graphs need.

@anshumanmohan anshumanmohan mentioned this pull request Apr 26, 2023
3 tasks
@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Apr 26, 2023

If interested in testing the "simple" JSON dump,

  1. Toggle the commented lines 158/159 of __main__ so that simple_json becomes the target function
  2. Test with slow_odgi mkjson test/k.gfa and the like, not with turnt. There isn't a reasonable oracle for the simple dump.

@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Apr 26, 2023

Done with the parameter-adjusting stuff! We now mimic exine exactly: if the user provides the -a flag, as I currently do in turnt, all the parameters are inferred automatically and tightly. However, the user is free to also supply some other value(s), and any user-supplied values always take precedence.

For example, here's how you can go hard on the number of paths for no reason.

[envs.mkjson_oracle]
binary = true
command = "exine depth -d {filename} -a {filename} -p 500"
output.json = "-"

[envs.mkjson_test]
binary = true
command = "slow_odgi mkjson {filename} -p 500"
output.json = "-"

Anyway, outlandish examples aside, we now diff out correctly against all the fetch-ed GFA files!

@anshumanmohan anshumanmohan marked this pull request as ready for review April 26, 2023 23:09
Copy link
Collaborator

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is REALLY great—it seems really really valuable that we can start testing proper Pollen hardware without using odgi at all. 🎉 🎉 🎉 I think this sets us up nicely for a future phase of work where we work more on the node-depth Calyx implementation and even generalize it to work on other stuff.

I agree with your notion (recorded in #69) about splitting the data translator into a separate thing from slow_odgi. I know we already agree on this, but just to put it into words, the point is that slow_odgi will be useful for anyone wanting our "PanoBench" code release, whereas the JSON translator is really only relevant to our Calyx stuff. So it's nice to keep them as separate things, both depending on mygfa.

Base automatically changed from cli to main May 2, 2023 16:24
@anshumanmohan anshumanmohan merged commit 9eeb67d into main May 2, 2023
@anshumanmohan anshumanmohan deleted the to-json branch May 2, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants