-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using elasticsearch generated ID's #97
Comments
@idarlington Currently the ID is controlled by the |
Thanks @ewencp. I am currently operating a single cluster. I noticed that the id is now From the offsetchecker it seems the offset is not updating.
Also, if the server is restarted does the offset value change. I currently have my Finally can you point me to examples of |
Interesting that the offset does not seem to be correct. I'm looking at the code where that ID is generated and it's definitely using the Kafka offset for the last part. That wouldn't rule out an issue in the framework, but I don't think anything has changed there. Is the I think it's probably not your problem since you seem to have tracked the issue down to the offset, but although the docs don't have an example of the regex router, I can give a quick example here:
That will prefix the topic names with a constant prefix before the connector processes them. Also, note that |
Thanks @ewencp Yes, the I can't find any errors in the logs.These are mostly the contents:
I would be updating the |
I'm also wondering if possibly you got into a bad state for the topic such that the consumer cannot make progress. If you increase the log level, you should see more messages indicating the sink task's progress. Increasing all the way to TRACE, if even only for WorkerSinkTask, would give log info about both the messages as they are consumed (including even logging the key and value) and the offsets that are actually being committed. You might also just want to check that all the files we'd expect to be in the Kafka directory are there, i.e. that something didn't get deleted. |
I have increased the log to TRACE, this is a snippet:
tree output :
etc/kafka
|
@idarlington Ok, so it looks like it's definitely making requests. We might want a larger snippet though because the TRACE and DEBUG output is so verbose -- most of that is from the underlying library making the request, but doesn't include much output from Connect itself. We'd probably want more lines that include |
@idarlington I'm not seeing any duplicates in that log? In fact it looks like requests are working as expected. I've grepped out a bit to help show what the connector is doing:
(It's missing one and has POST in there because the log is a bit mangled.) So for gritServer-0, we're seeing what we expect -- each message processed once, and in order. Looking at the offset commits:
You can see that the offset being committed for gritServer-0 is definitely increasing. What's notable is that dailyData-0 is not increasing and none of the IDs reported above have that in their name, which indicates no messages are being seen for it. Is there actually any data flowing into that topic partition? The log clearly shows that fetch requests are being sent:
but it seems they are not receiving any data in response. It looks like things are running fine, there just isn't data in one of the topic partitions. |
@ewencp I want to add another input to this thread I can understand the design but for some cases like basic logs, we don't need the validation of the _id and as a result the degradation but just able to index with less performance costs |
The way to do that is send In the logging use case it is usually okay to have some messages added twice (or even completely dropped), so IMO there should be an opt-in configuration which allows you to rely on ES IDs. This will result in much faster inserts to ES. |
See the note I added on issue #139 for a very minor code change I made to support allowing null document keys and thus auto-generating the ids in Elasticsearch. |
I noticed that documents indexed in elasticsearch have their ids in the following format
topic+partition+offset
.I would prefer to use id's generated by elasticsearch. It seems
topic+partition+offset
is not usually unique so I am loosing data.How can I change that?
The text was updated successfully, but these errors were encountered: