Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality for MR metadata reading from SAV #313

Open
wants to merge 35 commits into
base: dev
Choose a base branch
from

Conversation

slobodan-ilic
Copy link

@slobodan-ilic slobodan-ilic commented Apr 24, 2024

This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.

UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.

Here's the example on how I tried testing it:

include <stdlib.h>
#include "readstat.h"

typedef struct
{
    const mr_set_t *sets;
    int count;
} mr_sets_context_t;

int handle_metadata(readstat_metadata_t *metadata, void *ctx)
{
    mr_sets_context_t *mr_ctx = (mr_sets_context_t *)ctx; // Cast to non-const
    mr_ctx->count = readstat_get_multiple_response_sets_length(metadata);
    mr_ctx->sets = readstat_get_mr_sets(metadata);
    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);

    // Processing
    mr_sets_context_t *mr_ctx = malloc(sizeof(mr_sets_context_t));
    error = readstat_parse_sav(parser, argv[1], mr_ctx);
    printf("Found %d records\n", mr_ctx->count);
    for (int i = 0; i < mr_ctx->count; i++)
    {
        printf("MR set %d name: %s\n", i + 1, mr_ctx->sets[i].name);
        printf("type: %c\n", mr_ctx->sets[i].type);
        printf("is dichotomy: %d\n", mr_ctx->sets[i].is_dichotomy);
    }

    // Cleanup
    readstat_parser_free(parser);
    if (error != READSTAT_OK)
    {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    return 0;
}

And here's the example file:
simple_alltypes.sav.zip

This is the output from when it succeeds:

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
Found 2 records
MR set 1 name: categorical_array
type: C
is dichotomy: 0
MR set 2 name: mymrset
type: D
is dichotomy: 1

and this one is when it fails (which happens more often):

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
[1]    86961 segmentation fault  DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav

@evanmiller
Copy link
Contributor

See failing builds also

src/spss/readstat_sav_read.c: In function ‘parse_mr_line’:
src/spss/readstat_sav_read.c:176:51: error: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
  176 |             for (int i = 0; i < internal_count && isdigit(*next_part); i++) {
      |                                                   ^~~~~~~
src/spss/readstat_sav_read.c:26:1: note: include ‘<ctype.h>’ or provide a declaration of ‘isdigit’
   25 | #include "readstat_zsav_read.h"
  +++ |+#include <ctype.h>
   26 | #endif
src/spss/readstat_sav_read.c: In function ‘readstat_parse_sav’:
src/spss/readstat_sav_read.c:1882:40: error: implicit declaration of function ‘toupper’ [-Wimplicit-function-declaration]
 1882 |                     sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);
      |                                        ^~~~~~~
src/spss/readstat_sav_read.c:1882:40: note: include ‘<ctype.h>’ or provide a declaration of ‘toupper’
make: *** [Makefile:2419: src/spss/libreadstat_la-readstat_sav_read.lo] Error 1

@slobodan-ilic
Copy link
Author

@evanmiller I think it's fixed now.

@evanmiller
Copy link
Contributor

@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that strtol performs an unprotected memory read. There is also a Windows build issue that will need to be addressed.

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 019f4a6 to 850f0df Compare May 6, 2024 10:36
@slobodan-ilic
Copy link
Author

Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance?

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 7768183 to 0a83ade Compare June 12, 2024 14:47
@slobodan-ilic
Copy link
Author

Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now.

@evanmiller
Copy link
Contributor

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from c074f51 to ec778f5 Compare June 13, 2024 16:49
@slobodan-ilic
Copy link
Author

slobodan-ilic commented Jun 13, 2024

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with readstat_sav_date, the ones that mostly have VS17 in the paths of the files? If so, I've tried reverting my PR back to dev branch, but these errors are still present in the CI. I thought I'd avoid trying to get VS up and running, since I'm on mac.

Maybe I should try opening the PR against master?

update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :)

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 430b30f to ec778f5 Compare June 13, 2024 17:24
@slobodan-ilic
Copy link
Author

Found another issue with fuzzer, this time it's an OOM. On it, will ping when done.

@evanmiller
Copy link
Contributor

For Windows I am referring to the failed CI

In file included from src/spss/readstat_sav_read.c:11:
src/spss/readstat_sav_read.c: In function 'parse_mr_counted_value':
src/spss/readstat_sav_read.c:167:55: error: array subscript has type 'char' [-Werror=char-subscripts]
  167 |         for (int i = 0; i < internal_count && isdigit(*(*next_part)); i++) {
      |                                                       ^~~~~~~~~~~~~
src/spss/readstat_sav_read.c: In function 'parse_mr_line':
src/spss/readstat_sav_read.c:210:20: error: array subscript has type 'char' [-Werror=char-subscripts]
  210 |     while (isdigit(*next_part)) {
      |                    ^~~~~~~~~~

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 1b5fcb7 to a7b36ea Compare June 15, 2024 15:54
@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 2 times, most recently from 815d40d to de0551c Compare June 16, 2024 09:17
@evanmiller
Copy link
Contributor

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from de0551c to 213a76a Compare June 16, 2024 10:15
@slobodan-ilic
Copy link
Author

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

I was just exploring that failure. Thanks for the message. I just pushed a version that previously had successful cfuzz build. I had a rogue commit somewhere in the middle, and due to the long-lived nature of the PR I had failed to notice it. Anyways, all should be good now, with both the builds and the functionality. I'll need to run a squash, so commits look nicer, and remove obvious commented out code. I can do that today, but the structure shouldn't change. So if you want to go over it and provide further guidance, I'm all ears.

P.S. I implemented some tests in the pyreadstat version of this work, but was not able to implement them here. As far as I can see, the elaborate structure of the tests, filling the buffers and preparing parsers, mostly focuses on data (and not metadata which this PR focuses on). If you have a clear path forward for me to implement this, pls let me know which steps I should take.

Copy link
Contributor

@evanmiller evanmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments!

return metadata->multiple_response_sets_length;
}

const mr_set_t *readstat_get_mr_sets(readstat_metadata_t *metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be called readstat_get_multiple_response_sets

}

fprintf(stderr, "\n\n\nDebug: MR string: '%s'\n", mr_string);
char *token = strtok(mr_string, "$\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe strtok is not thread-safe; I'd prefer a thread-safe implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any particular advice for this? I tried with strtok_r and strtok_s combination, but build system is affected... I guess one option would be to do the entire mr_string parser in ragel... But maybe there's a quicker way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the build error?

src/spss/readstat_sav_read.c Outdated Show resolved Hide resolved
src/spss/readstat_sav_read.c Outdated Show resolved Hide resolved
}
spss_varinfo_t *info = (spss_varinfo_t *)ck_str_hash_lookup(sv_name_upper, var_dict);
if (info) {
free(mr.subvariables[j]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe readstat_realloc instead?

src/spss/readstat_sav_read.c Outdated Show resolved Hide resolved
@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 5 times, most recently from 1d764ec to 3a93e6b Compare June 20, 2024 13:37
@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 3a93e6b to 0a11d5c Compare June 20, 2024 13:42
@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 2 times, most recently from 8edcecf to 55778b1 Compare June 20, 2024 20:24
@slobodan-ilic slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 55778b1 to 07e323f Compare June 21, 2024 07:32
@slobodan-ilic
Copy link
Author

I've made the changes with Ragel @evanmiller

I'm not sure how the CI is invoked, but sometimes it starts immediately after a commit, sometimes after an hour, and sometimes doesn't start. So getting the builds right is difficult. But I think I got it in the last commit. Had to add some vcproj files etc. I wasn't able to replace strtok though, I only found strtok_r and strtok_s but the build was messed up, and didn't want to get too deep into that before checking up with you.

Anyways, if you can take a look at my latest changes, and provide feedback (at least if this is moving in the right direction) that would be really awesome.

Learning Ragel was fun :)

P.S. I'd love to add some tests, so any advice on how to move there would also be gr8.

@evanmiller
Copy link
Contributor

Looking much better and more maintainable with the Ragel code!

As for tests, we currently use a table-driven suite that roundtrips files (writes then reads and checks for expected error values). This approach would require added a multiple-response write API in addition to the read API. Generally writing is easier than reading, at least for purposes of testing, though often there are snares around getting written files to open properly in SPSS.

As for strtok problems I'd need to see the errors. Going full Ragel is one solution, but it's a matter of how you want to spend your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants