Skip to content
This repository has been archived by the owner on Feb 27, 2021. It is now read-only.

Reduce import/merging errors #557

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

nemobis
Copy link
Member

@nemobis nemobis commented Jan 17, 2019

Some papers were skipped or overzealous clusters were created.

See issue #512

We already imported a lot of papers with bogus author names such as
"&NA;" and in there are some 17 million lines with null z_author in
the latest unpaywall dump (2018-09-24).
* Don't merge more than 10 papers together.
* Always consider the year in comparisons, full date if available.

dissemin#512
@nemobis
Copy link
Member Author

nemobis commented Jan 17, 2019

Didn't test yet!

@@ -405,7 +405,8 @@ def save_doi_metadata(self, metadata, extra_orcids=None):
if metadata is None or type(metadata) != dict:
raise ValueError('Invalid metadata format, expecting a dict')
if not metadata.get('author'):
raise ValueError('No author provided')
# BareName has "last" as mandatory field
metadata['author'] = {'family': 'N/A'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is never going to be executed because it is just after an exception is raised.

@wetneb
Copy link
Member

wetneb commented Jan 17, 2019

This would need to be motivated by a careful analysis of the existing conflicts, I think (taking dissemin papers with a lot of oai records and understanding how we got to this situation).

@wetneb
Copy link
Member

wetneb commented Jan 17, 2019

To be more precise: first, thanks for the PR, it's a very important issue.
Second, we should have test cases for this (many test cases demonstrating which papers get merged and do not get merged). This will be easier once the tests are fixed.
Third, we should think about the workflow of un-merging papers. If a paper has not been added to any profile and not uploaded we can just delete it and re-import all its records independently, but otherwise it seems a bit complicated to me.

Also: I think there are cases where records were merged even though the fingerprints were completely different - I want to investigate this (maybe a common wrong DOI), but I don't have an example at hand at the moment.

@wetneb wetneb closed this Feb 3, 2019
@wetneb wetneb reopened this Feb 3, 2019
@wetneb
Copy link
Member

wetneb commented Feb 3, 2019

@nemobis Travis is functional again, so the build log should help you find bugs in the PR :)

@nemobis
Copy link
Member Author

nemobis commented Feb 5, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants