Add functionality for MR metadata reading from SAV #313

slobodan-ilic · 2024-04-24T09:52:04Z

This PR adds functionality for reading multiple response metadata from sav files. It's been tested on a simple file that we use for PoC in Crunch.io. It's a work in progress, I'm available for any updates and changes that need to be done to it.

UPDATE: Running this function with a test file succeeds a small portion of the tires. However it throws segfaults on most tries, I can't really track it down, so any guidance on that more than welcome as well.

Here's the example on how I tried testing it:

include <stdlib.h>
#include "readstat.h"

typedef struct
{
    const mr_set_t *sets;
    int count;
} mr_sets_context_t;

int handle_metadata(readstat_metadata_t *metadata, void *ctx)
{
    mr_sets_context_t *mr_ctx = (mr_sets_context_t *)ctx; // Cast to non-const
    mr_ctx->count = readstat_get_multiple_response_sets_length(metadata);
    mr_ctx->sets = readstat_get_mr_sets(metadata);
    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);

    // Processing
    mr_sets_context_t *mr_ctx = malloc(sizeof(mr_sets_context_t));
    error = readstat_parse_sav(parser, argv[1], mr_ctx);
    printf("Found %d records\n", mr_ctx->count);
    for (int i = 0; i < mr_ctx->count; i++)
    {
        printf("MR set %d name: %s\n", i + 1, mr_ctx->sets[i].name);
        printf("type: %c\n", mr_ctx->sets[i].type);
        printf("is dichotomy: %d\n", mr_ctx->sets[i].is_dichotomy);
    }

    // Cleanup
    readstat_parser_free(parser);
    if (error != READSTAT_OK)
    {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    return 0;
}

And here's the example file:
simple_alltypes.sav.zip

This is the output from when it succeeds:

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
Found 2 records
MR set 1 name: categorical_array
type: C
is dichotomy: 0
MR set 2 name: mymrset
type: D
is dichotomy: 1

and this one is when it fails (which happens more often):

➜  ReadStat git:(ISS-229-add-mr-metadata-support-for-sav) ✗ DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav
count: 0
label: 

Final counted value is: 1
count: 24
label: My multiple response set
[1]    86961 segmentation fault  DYLD_LIBRARY_PATH=./.libs ./read_mr_metadata ./simple_alltypes.sav

src/spss/readstat_sav_read.c

evanmiller · 2024-05-04T12:08:32Z

See failing builds also

src/spss/readstat_sav_read.c: In function ‘parse_mr_line’:
src/spss/readstat_sav_read.c:176:51: error: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
  176 |             for (int i = 0; i < internal_count && isdigit(*next_part); i++) {
      |                                                   ^~~~~~~
src/spss/readstat_sav_read.c:26:1: note: include ‘<ctype.h>’ or provide a declaration of ‘isdigit’
   25 | #include "readstat_zsav_read.h"
  +++ |+#include <ctype.h>
   26 | #endif
src/spss/readstat_sav_read.c: In function ‘readstat_parse_sav’:
src/spss/readstat_sav_read.c:1882:40: error: implicit declaration of function ‘toupper’ [-Wimplicit-function-declaration]
 1882 |                     sv_name_upper[c] = toupper((unsigned char) mr.subvariables[j][c]);
      |                                        ^~~~~~~
src/spss/readstat_sav_read.c:1882:40: note: include ‘<ctype.h>’ or provide a declaration of ‘toupper’
make: *** [Makefile:2419: src/spss/libreadstat_la-readstat_sav_read.lo] Error 1

slobodan-ilic · 2024-05-05T15:58:08Z

@evanmiller I think it's fixed now.

evanmiller · 2024-05-05T16:23:43Z

@slobodan-ilic Thanks for addressing the build issues. However, it looks like CI Fuzzer uncovered a segfault. From a cursory read of the code, it appears that strtol performs an unprotected memory read. There is also a Windows build issue that will need to be addressed.

slobodan-ilic · 2024-06-11T15:57:23Z

Hi @evanmiller, thanks for all the input. We've done a couple of iterations and a bunch of testing on real-life survey data. All of the small bugs are taken care of, no more nasty segfaults, etc. Are you available to do one more round of review and provide some guidance?

slobodan-ilic · 2024-06-12T16:43:49Z

Another short update: managed to run fuzzers locally (even though the documentation didn't work in a straightforward path). After managing to produce a crash locally - identified and fixed the bug (which seemed obvious once discovered). Should be in much better shape now.

evanmiller · 2024-06-13T11:44:25Z

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

slobodan-ilic · 2024-06-13T17:07:56Z

Hi, CI is still producing a fuzz failure. Also please see the Windows build failure (looks simple).

I've just detected the other fuzz failure as you were writing... Should be fixed now. About the windows tho, are you referring to the errors with readstat_sav_date, the ones that mostly have VS17 in the paths of the files? If so, I've tried reverting my PR back to dev branch, but these errors are still present in the CI. I thought I'd avoid trying to get VS up and running, since I'm on mac.

Maybe I should try opening the PR against master?

update: Well I just tried with master too (as a separate commit which I later deleted). Was the same error about sav date, all red in the appveyor run, but it said the tests passed. Unknown land to me :)

slobodan-ilic · 2024-06-13T19:13:54Z

Found another issue with fuzzer, this time it's an OOM. On it, will ping when done.

evanmiller · 2024-06-14T11:23:00Z

For Windows I am referring to the failed CI

In file included from src/spss/readstat_sav_read.c:11:
src/spss/readstat_sav_read.c: In function 'parse_mr_counted_value':
src/spss/readstat_sav_read.c:167:55: error: array subscript has type 'char' [-Werror=char-subscripts]
  167 |         for (int i = 0; i < internal_count && isdigit(*(*next_part)); i++) {
      |                                                       ^~~~~~~~~~~~~
src/spss/readstat_sav_read.c: In function 'parse_mr_line':
src/spss/readstat_sav_read.c:210:20: error: array subscript has type 'char' [-Werror=char-subscripts]
  210 |     while (isdigit(*next_part)) {
      |                    ^~~~~~~~~~

evanmiller · 2024-06-16T10:10:37Z

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

slobodan-ilic · 2024-06-16T10:19:15Z

I think the spawnv/_spawnv Windows build issue is a problem in master so you don't need to worry about it!

I was just exploring that failure. Thanks for the message. I just pushed a version that previously had successful cfuzz build. I had a rogue commit somewhere in the middle, and due to the long-lived nature of the PR I had failed to notice it. Anyways, all should be good now, with both the builds and the functionality. I'll need to run a squash, so commits look nicer, and remove obvious commented out code. I can do that today, but the structure shouldn't change. So if you want to go over it and provide further guidance, I'm all ears.

P.S. I implemented some tests in the pyreadstat version of this work, but was not able to implement them here. As far as I can see, the elaborate structure of the tests, filling the buffers and preparing parsers, mostly focuses on data (and not metadata which this PR focuses on). If you have a clear path forward for me to implement this, pls let me know which steps I should take.

evanmiller

Left some comments!

evanmiller · 2024-06-16T10:33:43Z

src/readstat_metadata.c

+    return metadata->multiple_response_sets_length;
+}
+
+const mr_set_t *readstat_get_mr_sets(readstat_metadata_t *metadata) {


I think this function should be called readstat_get_multiple_response_sets

evanmiller · 2024-06-16T10:37:00Z

src/spss/readstat_sav_read.c

+    }
+
+    fprintf(stderr, "\n\n\nDebug: MR string: '%s'\n", mr_string);
+    char *token = strtok(mr_string, "$\n");


I believe strtok is not thread-safe; I'd prefer a thread-safe implementation.

Do you have any particular advice for this? I tried with strtok_r and strtok_s combination, but build system is affected... I guess one option would be to do the entire mr_string parser in ragel... But maybe there's a quicker way to do it.

What was the build error?

src/spss/readstat_sav_read.c

evanmiller · 2024-06-16T10:47:13Z

src/spss/readstat_sav_read.c

+                }
+                spss_varinfo_t *info = (spss_varinfo_t *)ck_str_hash_lookup(sv_name_upper, var_dict);
+                if (info) {
+                    free(mr.subvariables[j]);


Maybe readstat_realloc instead?

src/spss/readstat_sav_read.c

slobodan-ilic · 2024-06-21T16:03:56Z

I've made the changes with Ragel @evanmiller

I'm not sure how the CI is invoked, but sometimes it starts immediately after a commit, sometimes after an hour, and sometimes doesn't start. So getting the builds right is difficult. But I think I got it in the last commit. Had to add some vcproj files etc. I wasn't able to replace strtok though, I only found strtok_r and strtok_s but the build was messed up, and didn't want to get too deep into that before checking up with you.

Anyways, if you can take a look at my latest changes, and provide feedback (at least if this is moving in the right direction) that would be really awesome.

Learning Ragel was fun :)

P.S. I'd love to add some tests, so any advice on how to move there would also be gr8.

evanmiller · 2024-06-22T11:01:19Z

Looking much better and more maintainable with the Ragel code!

As for tests, we currently use a table-driven suite that roundtrips files (writes then reads and checks for expected error values). This approach would require added a multiple-response write API in addition to the read API. Generally writing is easier than reading, at least for purposes of testing, though often there are snares around getting written files to open properly in SPSS.

As for strtok problems I'd need to see the errors. Going full Ragel is one solution, but it's a matter of how you want to spend your time.

Add functionality for MR metadata reading from SAV

b96798d

slobodan-ilic mentioned this pull request Apr 24, 2024

Feature Request (or Question): Support for Multiple Response Sets in SAV Files? #229

Open

slobodan-ilic commented May 3, 2024

View reviewed changes

src/spss/readstat_sav_read.c Outdated Show resolved Hide resolved

Try fixing build

850f0df

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 019f4a6 to 850f0df Compare May 6, 2024 10:36

slobodan-ilic added 6 commits June 3, 2024 16:15

Fix issues with null-termination of mr string

bae8721

Refactor of mr parsing

55af2f2

Try fixing fuzzifier

e471605

wip

789511a

fixup! Try fixing fuzzifier

622301c

fixup! wip

8b453bd

slobodan-ilic mentioned this pull request Jun 11, 2024

Add support for reading Multiple Response from sav (WIP, DO NOT MERGE) Roche/pyreadstat#259

Open

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 2ee8be0 to 7768183 Compare June 12, 2024 13:42

fixup! fixup! wip

0a83ade

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 7768183 to 0a83ade Compare June 12, 2024 14:47

Fix error found by fuzzifier

26e96c7

slobodan-ilic added 2 commits June 13, 2024 17:11

Fix another malloc issue found with fuzzer

481a7d1

Another malloc fix

ec778f5

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from c074f51 to ec778f5 Compare June 13, 2024 16:49

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 430b30f to ec778f5 Compare June 13, 2024 17:24

try fix oom found with fuzzer

d30f048

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 1b5fcb7 to a7b36ea Compare June 15, 2024 15:54

Fix actual logic

213a76a

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 2 times, most recently from 815d40d to de0551c Compare June 16, 2024 09:17

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from de0551c to 213a76a Compare June 16, 2024 10:15

evanmiller reviewed Jun 16, 2024

View reviewed changes

slobodan-ilic added 2 commits June 20, 2024 11:08

Rewrite parsing logic with Ragel

68b2ecb

try fixing appveyor build

12fa4b2

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 5 times, most recently from 1d764ec to 3a93e6b Compare June 20, 2024 13:37

Try fix build pt2

0a11d5c

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 3a93e6b to 0a11d5c Compare June 20, 2024 13:42

slobodan-ilic added 3 commits June 20, 2024 16:07

Try fix build pt3

8975ade

Fix attempt pt 4

1c92bd2

Try fix build pt5

db6164e

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch 2 times, most recently from 8edcecf to 55778b1 Compare June 20, 2024 20:24

Fix build pt6

07e323f

slobodan-ilic force-pushed the ISS-229-add-mr-metadata-support-for-sav branch from 55778b1 to 07e323f Compare June 21, 2024 07:32

slobodan-ilic added 3 commits June 21, 2024 17:06

Fix functionality

b0a99ef

try fix build

0fbca90

Try fix build

fc836e7

Change parser to full-ragel

6f500cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality for MR metadata reading from SAV #313

Add functionality for MR metadata reading from SAV #313

slobodan-ilic commented Apr 24, 2024 •

edited

Loading

evanmiller commented May 4, 2024

slobodan-ilic commented May 5, 2024

evanmiller commented May 5, 2024

slobodan-ilic commented Jun 11, 2024

slobodan-ilic commented Jun 12, 2024

evanmiller commented Jun 13, 2024

slobodan-ilic commented Jun 13, 2024 •

edited

Loading

slobodan-ilic commented Jun 13, 2024

evanmiller commented Jun 14, 2024

evanmiller commented Jun 16, 2024

slobodan-ilic commented Jun 16, 2024

evanmiller left a comment

evanmiller Jun 16, 2024

evanmiller Jun 16, 2024

slobodan-ilic Jun 21, 2024

evanmiller Jun 22, 2024

evanmiller Jun 16, 2024

slobodan-ilic commented Jun 21, 2024

evanmiller commented Jun 22, 2024

Add functionality for MR metadata reading from SAV #313

Are you sure you want to change the base?

Add functionality for MR metadata reading from SAV #313

Conversation

slobodan-ilic commented Apr 24, 2024 • edited Loading

evanmiller commented May 4, 2024

slobodan-ilic commented May 5, 2024

evanmiller commented May 5, 2024

slobodan-ilic commented Jun 11, 2024

slobodan-ilic commented Jun 12, 2024

evanmiller commented Jun 13, 2024

slobodan-ilic commented Jun 13, 2024 • edited Loading

slobodan-ilic commented Jun 13, 2024

evanmiller commented Jun 14, 2024

evanmiller commented Jun 16, 2024

slobodan-ilic commented Jun 16, 2024

evanmiller left a comment

Choose a reason for hiding this comment

evanmiller Jun 16, 2024

Choose a reason for hiding this comment

evanmiller Jun 16, 2024

Choose a reason for hiding this comment

slobodan-ilic Jun 21, 2024

Choose a reason for hiding this comment

evanmiller Jun 22, 2024

Choose a reason for hiding this comment

evanmiller Jun 16, 2024

Choose a reason for hiding this comment

slobodan-ilic commented Jun 21, 2024

evanmiller commented Jun 22, 2024

slobodan-ilic commented Apr 24, 2024 •

edited

Loading

slobodan-ilic commented Jun 13, 2024 •

edited

Loading