Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to clean up invalid BinarySchema #10823

Open
asdfgh19 opened this issue Jul 5, 2023 · 4 comments
Open

How to clean up invalid BinarySchema #10823

asdfgh19 opened this issue Jul 5, 2023 · 4 comments

Comments

@asdfgh19
Copy link

asdfgh19 commented Jul 5, 2023

 I created a cache like this IgniteCache<String, BinaryObject>. In order to do upsert, I used EntryProcessor as follows.

EntryProcessor
public class UpsertEntryProcessorV1
    implements EntryProcessor<String, BinaryObject, Void> {

    @IgniteInstanceResource
    private Ignite ignite;

    @SuppressWarnings("unchecked")
    @Override
    public Void process(MutableEntry<String, BinaryObject> entry,
        Object... arguments) throws EntryProcessorException {
        if (entry == null || entry.getKey() == null) {
            return null;
        }
        Map<String, Map<String, Object>> params =
            (Map<String, Map<String, Object>>) arguments[0];
        Map<String, Object> param = params.get(entry.getKey());
        if (param == null || param.isEmpty()) {
            return null;
        }

        BinaryObjectBuilder builder;
        BinaryObject entryVal = entry.getValue();
        if (entryVal != null) {
            builder = entryVal.toBuilder();
        } else {
            String valueType = (String) arguments[1];
            builder = ignite.binary().builder(valueType);
        }

        for (Map.Entry<String, Object> item : param.entrySet()) {
            if (item.getKey() == null) {
                continue;
            }
            builder.setField(item.getKey(), item.getValue());
        }

        entry.setValue(builder.build());
        return null;
    }
}

 In this cache, I wrote 20,000 records, each record has 1 to 1,000 fields. I found that the ignite service has a lot of full gc, and even crashes directly. Through heap dump analysis, I found that BinarySchema occupies a lot of memory.

heap

image2023-6-22_9-28-48

Eventually I figured out that the problem was when I was randomly writing to some fields of a record every time.

 when I write a record for the first time, a BinarySchema will be created.

 The next time I update this record and write one more field, a new BinarySchema will be created and written to ./work/db/ binary_meta/, the old schema will not be cleared.

 In the end, tens of thousands of BinarySchemas will be created, but there are only dozens of BinarySchemas that are actually valid in serialized storage.

 The following are some places I found to store BinarySchema, is there a way to clean up these BinarySchema?

  1. BinaryContext#descByCls#schemaReg
  2. BinaryContext#schemas
  3. CacheObjectBinaryProcessorImpl#metadataLocCache#metadata#schemas
@ptupitsyn
Copy link
Contributor

invalid BinarySchema

It is not invalid. If a schema was used once, it has to be stored to deserialize that object later.

You can try to minimize the number of unique schemas - for example, ensure consistent field order. Currently, for (Map.Entry<String, Object> item : param.entrySet()) can produce items in random order when underlying Map implementation is unordered.

@asdfgh19
Copy link
Author

asdfgh19 commented Jul 5, 2023

@ptupitsyn Thanks for your reply and suggestion! By analyzing BinaryObjectBuilderImpl#serializeTo, I found a little rule.

 Writing fields out of order or updating some fields of an existing record will not create a new BinarySchema, but whenever we add a new field to a record, a new Schema will be created.

 Suppose there is such a scenario, first we create a new record with field A, and the schemaId of the record is 1 at this time.
 Next we update the record and add a field B, at this time the schemaId becomes 2. In this way, we add a new field every time until we reach the final goal of 1000 fields, and the schemaId becomes 1000.
 If there are no other records referencing the 999 BinarySchemas we created in the past, then no object deserialization needs to use these BinarySchemas.

 I solved this problem by a way. When writing the record for the first time, writing null to all non-existing fields creates a unique BinarySchema. But this solution wastes some extra memory, because null values occupy 2 bytes after serialization, and probably most records only need to write 200 fields out of 1000 fields..

@ptupitsyn
Copy link
Contributor

Thanks for posting the solution. Yes, there is a trade-off - create more schemas, or waste some space for nulls.

Can you also describe your use case a little bit please? 1000 fields handled dynamically is somewhat unusual to see.

@asdfgh19
Copy link
Author

asdfgh19 commented Jul 6, 2023

@ptupitsyn I agree with you. There is a trade-off here.

We try to store the latest values of properties and telemetry of devices in IOT scenarios. Each record is a device, and they generally have dozens to hundreds of fields.
Some telemetry may be uploaded every 1 minute, others may be uploaded every 5 minutes.
Some devices may upload only a few dozen telemetry fields, others may upload hundreds.

Even for a cache with only three fields, there may be 7 BinarySchemas at the beginning, such as 1, 2, 3, 12, 13, 23, 123. In the end, only one or two of them may exist after updating.

I have 2 plain ideas.
The first is to delete the BianrySchema that is not referenced by any object when writing or updating.

  1. Add a reference count to BinarySchema.
  2. When we write a new record, we add one to the reference count of the BinarySchema corresponding to this record.
  3. When we update an existing record,If we create a new BinarySchema,we can get the old BinarySchema and decrement its reference count by one.
  4. When we delete a record, we decrement the corresponding BinarySchema reference count by one.
  5. If the old BinarySchema reference count becomes 0, we can remove this BinarySchema from memory.
  6. The reference count does not need to be serialized into the binary_metadata file, because we can rebuild it from the cache when restarting, so that there is no need to update the binary_metadata file every time the BinarySchema reference count is updated.
  7. At each restart, when we find a BinarySchema is not referenced by any object, we can delete it from the binary_metadata file.

The second is to periodically check each cache and delete those BinarySchema that are not referenced by any object.

It looks like it's a bit complicated, just a proposal. But if we implement this maybe we can get better sparse storage and flexibility. So We can support schemaless better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants