-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality for MR metadata reading from SAV #313
base: dev
Are you sure you want to change the base?
Add functionality for MR metadata reading from SAV #313
Conversation
See failing builds also
|
@evanmiller I think it's fixed now. |
@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that |
019f4a6
to
850f0df
Compare
Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance? |
2ee8be0
to
7768183
Compare
7768183
to
0a83ade
Compare
Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now. |
Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple). |
c074f51
to
ec778f5
Compare
I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with Maybe I should try opening the PR against update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :) |
430b30f
to
ec778f5
Compare
Found another issue with fuzzer, this time it's an OOM. On it, will ping when done. |
For Windows I am referring to the failed CI
|
1b5fcb7
to
a7b36ea
Compare
815d40d
to
de0551c
Compare
I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it! |
de0551c
to
213a76a
Compare
I was just exploring that failure. Thanks for the message. I just pushed a version that previously had successful cfuzz build. I had a rogue commit somewhere in the middle, and due to the long-lived nature of the PR I had failed to notice it. Anyways, all should be good now, with both the builds and the functionality. I'll need to run a squash, so commits look nicer, and remove obvious commented out code. I can do that today, but the structure shouldn't change. So if you want to go over it and provide further guidance, I'm all ears. P.S. I implemented some tests in the pyreadstat version of this work, but was not able to implement them here. As far as I can see, the elaborate structure of the tests, filling the buffers and preparing parsers, mostly focuses on data (and not metadata which this PR focuses on). If you have a clear path forward for me to implement this, pls let me know which steps I should take. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments!
return metadata->multiple_response_sets_length; | ||
} | ||
|
||
const mr_set_t *readstat_get_mr_sets(readstat_metadata_t *metadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should be called readstat_get_multiple_response_sets
src/spss/readstat_sav_read.c
Outdated
} | ||
|
||
fprintf(stderr, "\n\n\nDebug: MR string: '%s'\n", mr_string); | ||
char *token = strtok(mr_string, "$\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe strtok
is not thread-safe; I'd prefer a thread-safe implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any particular advice for this? I tried with strtok_r
and strtok_s
combination, but build system is affected... I guess one option would be to do the entire mr_string parser in ragel... But maybe there's a quicker way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the build error?
} | ||
spss_varinfo_t *info = (spss_varinfo_t *)ck_str_hash_lookup(sv_name_upper, var_dict); | ||
if (info) { | ||
free(mr.subvariables[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe readstat_realloc
instead?
1d764ec
to
3a93e6b
Compare
3a93e6b
to
0a11d5c
Compare
8edcecf
to
55778b1
Compare
55778b1
to
07e323f
Compare
I've made the changes with Ragel @evanmiller I'm not sure how the CI is invoked, but sometimes it starts immediately after a commit, sometimes after an hour, and sometimes doesn't start. So getting the builds right is difficult. But I think I got it in the last commit. Had to add some vcproj files etc. I wasn't able to replace Anyways, if you can take a look at my latest changes, and provide feedback (at least if this is moving in the right direction) that would be really awesome. Learning Ragel was fun :) P.S. I'd love to add some tests, so any advice on how to move there would also be gr8. |
Looking much better and more maintainable with the Ragel code! As for tests, we currently use a table-driven suite that roundtrips files (writes then reads and checks for expected error values). This approach would require added a multiple-response write API in addition to the read API. Generally writing is easier than reading, at least for purposes of testing, though often there are snares around getting written files to open properly in SPSS. As for strtok problems I'd need to see the errors. Going full Ragel is one solution, but it's a matter of how you want to spend your time. |
This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.
UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.
Here's the example on how I tried testing it:
And here's the example file:
simple_alltypes.sav.zip
This is the output from when it succeeds:
and this one is when it fails (which happens more often):