Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates the datasets #15

Closed
montxo5 opened this issue Jun 10, 2014 · 7 comments
Closed

Duplicates the datasets #15

montxo5 opened this issue Jun 10, 2014 · 7 comments

Comments

@montxo5
Copy link

montxo5 commented Jun 10, 2014

When I try to harvest this XML-RDF: http://datos.madrid.es/egob/catalogo.rdf
the process inserts the datasets twice. Insted of 101, it appears 202 datasets.

I've also tried whit this one: http://datos.gijon.es/set.rdf
and in this case it works OK.

I think that the problem is with some kind of redirect in the madrid's case. Could it be possible to control this cases?

Thanks in advance!

amercader added a commit that referenced this issue Jun 11, 2014
To check if the remote server supported pagination we were only checking
whether the content from a request was the same as the previous one.
This is quite fragile as some fields might get updated on each request,
eg modified date for realtime data.

We are now checking if the guids from a request are the same as the
previous ones, which should be more reliable.
@amercader
Copy link
Member

@montxo5 That was caused by the harvesters not being careful when checking if two requests had the same contents (to check if the remote server supported pagination).

In Madrid's case, there are some real time datasets that got the timestamp updated on each request:

<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-06-11T03:05:31</dct:modified>

Can you update your sources and check if you only get 101 records?

@montxo5
Copy link
Author

montxo5 commented Jun 13, 2014

Tanks for the reply. Sorry, but I didn't understand what do you mean when you say updating my resources.
You mean reharvesting?

@amercader
Copy link
Member

I meant doing git pull to update the ckanext-dcat source and reharvesting.
Let me know how it goes,

@montxo5
Copy link
Author

montxo5 commented Jun 16, 2014

I've updated the ckanext-dcat with git pull and it's still duplicating datasets.
I've also tried uninstalling and installing dcat, restarting, but also fails.

@amercader
Copy link
Member

Did you restart the two harvester consumers? ctrl+c if running them directly on the terminal or sudo supervisorctl restart all if using Supervisor on production.

@montxo5 montxo5 closed this as completed Jun 17, 2014
@montxo5
Copy link
Author

montxo5 commented Jun 17, 2014

You were right, I forgot to restart the consumers... Thanks!! Now works perfectly!

@amercader
Copy link
Member

Glad you got it working! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants