New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
De-Duplicate documents in CrawlDB (Solr) #72
Comments
@karanjeets Please resolve this |
@thammegowda This |
Did you try this - https://cwiki.apache.org/confluence/display/solr/De-Duplication and telling me that it doesnt suit our use case? |
<field name="dedupe_id" type="string" stored="true" indexed="true" multiValued="false" />
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">dedupe_id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">crawl_id,url</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
...
</requestHandler> Let me know if have tried this and/or faced any issues with this |
@thammegowda, are you suggesting that dedup_id=crawl_id + url ? And then you let SOLR dedup on that id ? |
Yes What I am saying is (The TODO: here is )- instead of doing it on client side, we shall let solr handle it on the server side. That will be efficient and right way of doing |
I think the crawl_id + url is not sufficient for de-duping large web crawls. Other scenario could be - tracking activity on a website based on how frequently the content changes. In my opinion, we could dedup_id = hash(page content). We could might as well use the TextProfileSignature as mentioned in the link you posted (https://cwiki.apache.org/confluence/display/solr/De-Duplication) |
Yes, I tried this. The de-duplication is not supported with atomic updates or there is a bug in Solr. Let me elaborate; at the time of adding a new document in Solr, it creates the signature with the combination of fields you specify however it updates the signature to a series of zeros when you do an atomic update on that. Therefore, it doesn't help in the de-duplication. If it is a bug, most likely the issue is with the application of P.S. - If you look closely at the first comment, I have referred the same link. |
In my opinion the I was thinking on the line of adding the fields, that contribute to De-duplication, in Sparkler configuration but that won't be a good choice looking at the future prospects. We can have de-duplication plugins if required. |
I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?
|
Better for the dedupe_id Let me elaborate on my point. There are two types of de-duplication.
I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let's push this back because it was just a random thought and not helping the issue. |
Agreed, you can dedup by
We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development. |
Okay, lets first complete the work on dedupe of outlinks and defer dedupe based on content to later weeks. Getting back to the question - deduping outlinks on the server side:
Thanks @karanjeets. Solr seems to have many bugs with atomic updates. We need to file an issue for this bug and let them know about this if we are sure about it. Fixing that bug will take time, so we shall revert back to our old way of handling them at the client side. Going to merge #73 now |
@thammegowda So, I have investigated further on the Solr dedupe issue. The atomic update problem can be solved if we use However, the utility doesn't seems to be working as expected. Although this allows the To take this on server side, we have to:
Let's merge #73 while I work on the above plan to take this at server side. |
Thanks, @thammegowda 👍 |
@karanjeets Yes |
Since now we have a more sophisticated definition of the
id
field (with timestamp included), we have to think on de-duplication of the documents.I am opening a discussion channel here to define de-duplication. Some of the suggestions are:
signature
field (but this will enforce fetching of the duplicate document even though we are not storing it)url
fieldWe can refer here for the implementation.
The text was updated successfully, but these errors were encountered: