Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid data with DDS_NOT_ALIVE_DISPOSED_INSTANCE_STATE received #148

Closed
stillsimon opened this issue Apr 9, 2019 · 4 comments
Closed

Comments

@stillsimon
Copy link

stillsimon commented Apr 9, 2019

Hi, I've been testing with two writers writing out data to one reader. One writer writes out 10 data samples and deleted. After that, another writer with the same topic writes out 10 data samples too. However, the reader cannot receive the the data samples from the second writer. It shows the data samples are invalid and the instance state is DDS_NOT_ALIVE_DISPOSED_INSTANCE_STATE. Is it a bug or something with qos configuration? Thanks!
PS: If I don't delete the first writer, everything works fine!

Pub side code:

#include "ddsc/dds.h"
#include "HelloWorldData.h"
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    dds_entity_t participant;
    dds_entity_t topic;
    dds_entity_t writer;
    dds_entity_t writer2;
    dds_return_t ret;
    HelloWorldData_Msg msg;
    (void) argc;
    (void) argv;

    /* Create a Participant. */
    participant = dds_create_participant(DDS_DOMAIN_DEFAULT, NULL, NULL);
    DDS_ERR_CHECK (participant, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    /* Create a Topic. */
    topic = dds_create_topic(participant, &HelloWorldData_Msg_desc, "HelloWorldData_Msg", NULL, NULL);
    DDS_ERR_CHECK (topic, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    /* Create a Writer. */
    dds_qos_t *writer_qos = dds_create_qos();
    dds_qset_writer_data_lifecycle(writer_qos, true);
    dds_qset_reliability(writer_qos, DDS_RELIABILITY_RELIABLE, DDS_MSECS(100));
    writer = dds_create_writer(participant, topic, writer_qos, NULL);
    DDS_ERR_CHECK (writer, DDS_CHECK_REPORT | DDS_CHECK_EXIT);
    writer2 = dds_create_writer(participant, topic, writer_qos, NULL);
    DDS_ERR_CHECK (writer2, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    printf("=== [Publisher]  Waiting for a reader to be discovered ...\n");

    ret = dds_set_status_mask(writer, DDS_PUBLICATION_MATCHED_STATUS);
    DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    while (true) {
        uint32_t status;
        ret = dds_get_status_changes(writer, &status);
        DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

        if (status == DDS_PUBLICATION_MATCHED_STATUS) {
            break;
        }
        /* Polling sleep. */
        dds_sleepfor(DDS_MSECS (20));
    }

    /* Create a message to write. */
    for (int i = 0; i < 10; i++) {
        msg.userID = i;
        msg.message = "Hello World";
        printf("=== [Publisher]  Writing : ");
        printf("Message (%d, %s)\n", msg.userID, msg.message);
        ret = dds_write(writer, &msg);
        DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);
    };

    /* Delete one data writer after it finishes writing. */
    ret = dds_delete(writer);
    DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    dds_sleepfor(DDS_SECS(1));

    /* Create a message to write. */
    for (int i = 0; i < 10; i++) {
        msg.userID = i + 10;
        msg.message = "Hello World2";
        printf("=== [Publisher2]  Writing : ");
        printf("Message (%d, %s)\n", msg.userID, msg.message);
        ret = dds_write(writer2, &msg);
        DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);
    };

    dds_sleepfor(DDS_SECS(1));

    /* Deleting the participant will delete all its children recursively as well. */
    ret = dds_delete(participant);
    DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    return EXIT_SUCCESS;
}

Sub side code:

#include "ddsc/dds.h"
#include "HelloWorldData.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

/* An array of one message (aka sample in dds terms) will be used. */
#define MAX_SAMPLES 1

void reader_on_data_available(dds_entity_t reader, void *arg)
{
    HelloWorldData_Msg *msg;
    void *samples[MAX_SAMPLES];
    dds_sample_info_t infos[MAX_SAMPLES];
    dds_return_t ret;

    ret = dds_take(reader, samples, infos, MAX_SAMPLES, MAX_SAMPLES);
    DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    /* Check if we read some data and it is valid. */
    if ((ret > 0) && (infos[0].valid_data)) {
        /* Print Message. */
        msg = (HelloWorldData_Msg *) samples[0];
        printf("Message (%d, %s) from %x Instance State: %d\n",
               msg->userID,
               msg->message,
               infos[0].publication_handle,
               infos[0].instance_state);
    }

    if ((ret > 0) && (!infos[0].valid_data)) {
        printf("Invalid Data from %x instance state: %d\n", infos[0].publication_handle, infos[0].instance_state);;
    }
}

int main(int argc, char **argv)
{
    dds_entity_t participant;
    dds_entity_t topic;
    dds_entity_t reader;
    HelloWorldData_Msg *msg;
    void *samples[MAX_SAMPLES];
    dds_sample_info_t infos[MAX_SAMPLES];
    dds_return_t ret;
    dds_qos_t *qos;
    (void) argc;
    (void) argv;

    /* Create a Participant. */
    participant = dds_create_participant(DDS_DOMAIN_DEFAULT, NULL, NULL);
    DDS_ERR_CHECK (participant, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    /* Create a Topic. */
    topic = dds_create_topic(participant, &HelloWorldData_Msg_desc, "HelloWorldData_Msg", NULL, NULL);
    DDS_ERR_CHECK (topic, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    /* Create a reliable Reader. */
    qos = dds_create_qos();
    dds_qset_reliability(qos, DDS_RELIABILITY_RELIABLE, DDS_SECS (100));
    dds_listener_t *reader_listener = dds_create_listener(reader);
    dds_lset_data_available(reader_listener, reader_on_data_available);
    reader = dds_create_reader(participant, topic, qos, reader_listener);
    DDS_ERR_CHECK (reader, DDS_CHECK_REPORT | DDS_CHECK_EXIT);
    dds_delete_qos(qos);

    printf("\n=== [Subscriber] Waiting for a sample ...\n");

    dds_sleepfor(DDS_SECS(10));

    /* Deleting the participant will delete all its children recursively as well. */
    ret = dds_delete(participant);
    DDS_ERR_CHECK (ret, DDS_CHECK_REPORT | DDS_CHECK_EXIT);

    return EXIT_SUCCESS;
}

image
image

@eboasson
Copy link
Contributor

eboasson commented Apr 9, 2019

There is a "writer data lifecycle" QoS setting that includes a field named "auto-dispose unregistered instances", and sadly, this one defaults to "true". What it means (today, in Cyclone anyway) is that whenever an instance becomes unregistered, for whatever reason, it is automatically disposed. Deleting the writer implies the instances it published are all automatically unregistered.

Just a fair warning: there are disputes about the meaning of this particular QoS. OpenSplice DDS uses the same interpretation as Cyclone currently does, but certainly RTI Connext and TwinOaks CoreDX have a different take on it: in those implementations, that automatic dispose is only performed if the writing application explicitly calls "unregister_instance".

To me, that interpretation makes zero sense. The distinction between "not-disposed" and "disposed" is very important in the programming model: the meaning attached to "dispose" is that the instance has reached the end of its life and so anything depending on it should be garbage collected. If I treat "auto dispose" as something only to be done when "unregister_instance" is called explicitly, then an application doing a dispose+unregister immediately before crashing, combined with a bit of packet loss, could result in some parts of the system having the data in "not-alive-disposed" state while other parts might have it in "not-alive-no-writers".

Now, of course with transient data in the picture, it is possible that you would restart some application that would recover and continue the cleaning up. But my point is that in the absence of such an application, the "Connext" model necessarily leaves the system in an inconsistent state, whereas the "Cyclone" model allows you to choose what you want to happen (because you can always call "dispose" yourself).

Besides, from a purely linguistic point of view, I believe the "Connext" model should be called "dispose-on-unregister-instance", rather than "dispose unregistered instances" ...

Note: in the presence of multiple writers all publishing updates of the same instance — in most systems I have seen, something that is frowned upon — I think it makes more sense to change Cyclone to do this auto-disposing only once all writers have unregistered the instance.

@stillsimon
Copy link
Author

stillsimon commented Apr 9, 2019

Thanks for explanation! But I do have a different understanding for the meaning of "instance". From RTI User Manual:

for Topics with keys, a unique value for the key identifies a unique instance of the Topic.

So in this case, when a second writer wants to send different data samples with different keys, these instances should be considered different from the instances sent by the first writer(thus should not be disposed). Is that right? Or does Cyclone has different interpretation for the meaning of "instance"?

User manual from RTI:
image

@eboasson
Copy link
Contributor

eboasson commented Apr 9, 2019

There is no different understanding of the meaning of "instance". You, me, RTI, Cyclone are all in agreement on that point.

So in this case, when a second writer wants to send different data samples with different keys, these instances should be considered different from the instances sent by the first writer(thus should not be disposed).

They are considered different. I must apologise for having interpreted your question with my mind still pondering transient-local data ... So now I have had a better look and actually tried it for good measure.

The deleting of the writer causes disposes/unregister to be generated and communicated to the application as "invalid samples". Those get returned by the take call as well, and it appears that the instances written by the first writer happen to always come first in the ordering in which take returns the data. (Not so strange, actually, there is just no guarantee.)

You're taking only 1 sample at a time, and it appears that if you're only getting 20 "data available" events in total, you end up taking the first 10 samples and then the 10 dispose+unregisters — never actually getting to the point where you would read the second 10 samples. If you take more samples at a time or you keep taking until there's nothing available, you'll see all the data.

That appears to not trigger on the dispose/unregister messages suggests a bug. When I fixed #87 I did notice some special-casing in the raising of that event that I didn't pursue at the time, so I find it quite plausible that it isn't raising the "data available" event correctly in this case.

Even with that addressed, you should allow for receiving multiple samples before the data available event is raised: generating these automatic disposes and unregisters will have a bit of a different profile from the normal ones and I don't think it makes any sense to invoke a listener for every single one.

eboasson added a commit to eboasson/cyclonedds that referenced this issue Apr 10, 2019
Deleting a writer causes unregisters (and possibly disposes) in the rest
of the network, and these updates to the instances should trigger
DATA_AVAILABLE.

Signed-off-by: Erik Boasson <eb@ilities.com>
eboasson added a commit that referenced this issue Apr 11, 2019
Deleting a writer causes unregisters (and possibly disposes) in the rest
of the network, and these updates to the instances should trigger
DATA_AVAILABLE.

Signed-off-by: Erik Boasson <eb@ilities.com>
@stillsimon
Copy link
Author

Thank you! I modified my code and it worked!

eboasson added a commit to eboasson/cyclonedds that referenced this issue Apr 12, 2019
Deleting a writer causes unregisters (and possibly disposes) in the rest
of the network, and these updates to the instances should trigger
DATA_AVAILABLE.

Signed-off-by: Erik Boasson <eb@ilities.com>
@eboasson eboasson closed this as completed May 1, 2019
eboasson added a commit that referenced this issue May 23, 2019
Deleting a writer causes unregisters (and possibly disposes) in the rest
of the network, and these updates to the instances should trigger
DATA_AVAILABLE.

Signed-off-by: Erik Boasson <eb@ilities.com>
kurtuluso pushed a commit to kurtuluso/cyclonedds that referenced this issue May 24, 2019
Deleting a writer causes unregisters (and possibly disposes) in the rest
of the network, and these updates to the instances should trigger
DATA_AVAILABLE.

Signed-off-by: Erik Boasson <eb@ilities.com>
hsgwa pushed a commit to hsgwa/cyclonedds that referenced this issue Feb 9, 2021
* Add support for taking a sequence of messages

Signed-off-by: Michael Carroll <michael@openrobotics.org>

* Reorder valid messages to front of sequence

Signed-off-by: Michael Carroll <michael@openrobotics.org>

* Initialize taken value

Signed-off-by: Michael Carroll <michael@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants