Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional article properties #55

Merged

Conversation

takerukoushirou
Copy link
Contributor

Several new on-demand getters for fields that I encountered in articles. I guess there are many more to be added. A different approach could maybe use __getattr__ instead of explicitly defining each field.

One of the newly supported fields, link_data, is a list of strings that contain JSON. I added a value processor to automatically parse those accordingly during the Article constructor (and hence implicitly for the getter).
There are other fields for which additional parsing may be useful, notably all date-related fields that are RFC 3339 encoded and could easily be processed using dateutil.parse.

@vsudilov
Copy link
Collaborator

vsudilov commented May 9, 2016

iirc some of these fields were intentionally omitted due to containing either misleading, redundant, or useless info. @jonnybazookatone could you confirm that that's still the case? If it is, maybe another ticket for adsabs/adsws?

@jonnybazookatone
Copy link
Collaborator

Yes, some are confusing and for internal use, such as classic_factor and cite_read_boost, that without any context are completely misleading/not useful (imo) to the user.

I'll check which are relevant and get back to you.

@takerukoushirou
Copy link
Contributor Author

@vsudilov I'll gladly remove useless fields and instead leave a comment on why those were omitted for future reference.

@jonnybazookatone could you also check if there are additional fields not included yet that may be useful?

Regarding redundant fields, I noticed that the id and recid so far seem to represent the same value, with the difference that the latter is returned as an integer instead of a string.

@vsudilov, any specific reason that the reference property queries references(id:{}) instead of the field reference? I didn't notice any difference in the outcome. Similar with citation, but in my tests the field citation contains more citations than the query citations(id:{}). Querying a single field is most likely faster.

@vsudilov
Copy link
Collaborator

vsudilov commented May 9, 2016

Ideally any fields that shouldn't be public facing would be hidden upstream at adsabs/adsws.

From my recollection, citations() is a built-in operator that dynamically handles resolving the reference->citation relationship between documents that match the inner query -- in this case the inner query is a single document, but it could be anything in general. Querying the citation: field even for a single document shouldn't be done, since that would return the citations at a snapshot in time and not the real-time calculation.

That being said, I'm not sure how citation: should ever return more results than citations(). Actually, there would be no need at all for the citation: field if what I wrote is correct, so perhaps I'm missing something.

The document's reference field should be OK to query directly and probably reverted back in the code AFAIK.

All of that is probably worth confirming with the ADS directly.

@takerukoushirou
Copy link
Contributor Author

That also seems the most logic action to me and would also reduce bandwidth, as all those redundant/private fields are currently returned with a catch-all query for fields fl=*.

Regarding the citations, in further tests so far the list fetched via the citations() operator is a subset of the list retrieved via the citation field or the same in most cases; in at least one case so far the field yields more citations (article with identifier:2012Natur.486..502B).

@jonnybazookatone
Copy link
Collaborator

Sorry for the delay, was double checking on our side. The citation and reference fields should only include resolved (could compute a bibcode), and verified (bibcode is in the system) bibcodes (in our legacy system).

The citations() and references() operators should return similar (if not the same) lists as the fields (normalised on bibcode), but will return only the documents that exist in Solr (the search engine), and are resolvable (ie., if you do q="bibcode:BIBCODE" from the resulting list). It's possible a document may not enter Solr, and as a result it's not returned by citations()/references(), but would be returned via reference/citation. This type of document would then fail on any look up bibcode:BIBCODE, etc.

For self-consistency, and to remove some confusion for the user (as also the UI is using the operators), we're currently encouraging the use of the operators rather than the fields, and the fields may also be removed from the response in the future.

I'm working on the list of fields that may be relevant, and I'll post it here when it's complete.

@jonnybazookatone
Copy link
Collaborator

I would remove:

  • author_norm
  • body
  • cite_read_boost
  • classic_factor
  • date (date type of pubdate)
  • eid
  • email
  • first_author (there's already first_author_norm being used)
  • keyword_facet
  • keyword_norm
  • keyword_schema
  • links_data
  • pub_raw (duplicates pub)
  • reader
  • recid (int version of id)
  • simbad_object_facet_hier
  • simbid
  • simbtype

Then you can also remove the changes for _processor, as we shouldn't need it for the remaining fields. This leaves the fields you added:

  • alternate_bibcode
  • arxiv_class
  • bibcode
  • bibgroup
  • copyright
  • data
  • doi
  • doctype
  • indexstamp

Plus some others that might be useful:

  • ack (Acknowledgements)
  • alternate_title (Alternate titles, e.g., if the original is not English)
  • lang (Language of the title if it is not English)
  • vizier (Keywords given by VizieR)

There are several _facet and _facet_hier field parameters that can be used for client side applications. I'm not sure it's so relevant for them to be added here. We have a few others that we have started populating, which I'll leave out for now, but make a PR when they're ready.

@takerukoushirou
Copy link
Contributor Author

I updated the article properties accordingly and looked into the failing check.

The test fails for Python 3 only because of Python 2 and 3 differences in the hasattr() function used in the equality operator for articles. Since bibcode is now a deferred property, it will raise an APIResponseError as it cannot dynamically fetch the bibcode property of the LHS article in test_equals() as that article is empty and hence has no ID set. This exception is, in contrast to Python 2, not caught by Python 3's hasattr() and propagates. The test expects a TypeError raised by the equality operator though.

I think the equality operator for articles should be fixed to behave identical for both Python 2 and 3.

@takerukoushirou
Copy link
Contributor Author

I found another issue: In the article method __unicode__(), the first_author property is used. This will raise an exception if the field was not requested in the original query and a deferred first_author property doesn't exist.
This could be fixed by either using first_author_norm, adding the deferred property again (probably the most compatible fix, as first_author is also in the default list of fields used if fl isn't defined) or ensuring that it's always in the list of requested fields.

@takerukoushirou takerukoushirou changed the title Additional article properties & value processor Additional article properties May 17, 2016
@jonnybazookatone
Copy link
Collaborator

Yeah, I agree. I was having a think, how about something like:

def __eq__(self, other):
    try:
        this = getattr(self, 'bibcode', None)
        other = getattr(other, 'bibcode', None)
    except APIResponseError:
        raise TypeError("Cannot compare articles without ids")
    else:
        if this is None or other is None:
            raise TypeError("Cannot compare articles without bibcodes")
        return this == other

Or the getter could be modified, but that seemed overkill to me.

As per __unicode__, how about something like:

def __unicode__(self):
    try:
        first_author = getattr(self, 'first_author', 'Unknown author')
        author = getattr(self, 'author', None)

        if author is not None and len(author) > 1:
           first_author += " et al."

        return u"<{author} {year}, {bibcode}>".format(
            author=first_author,
            year=self.year,
            bibcode=self.bibcode,
        )
    except APIResponseError:
        return self.__repr__()

This is if we want to keep the current behaviour of raising an APIError when there is no id defined in the Article instance.

Feels a bit like celetaping the problem though, anyone else got better ideas?

@takerukoushirou
Copy link
Contributor Author

I'm in favour of the proposed equality operator modification, I also had something very similar in mind. Changing the getter would IMHO introduce an inconsistency with the other getters and is harder to maintain...

Regarding __unicode__(), I wonder if we generally should define an article without an id as a virtual/offline article and also probably ensure that id always exists as a property with a default value of None for consistency. If id is None, then _get_field() should return None instead of raising an exception. id is also AFAIK the only field that is always included in the search query fields list, regardless of what is requested by the user. That way we would get a more consistent behaviour regardless of whether an article is fetched or manually created (as in the test case).

Additionally, I think first_author should be added again as a deferred property. Even if first_author_norm is probably the field used by most users, __unicode__() is a good example of a method that relies on it, and it's also in the list of default fields being queried.

Finally, __unicode__() should arguably still be extended to handle exceptions, as author, bibcode and year might be fetched by a getter and an API error may occur. In that case, a string representation could still be returned, using placeholders for consistency though. We could even consider to ensure that those fields are also always queried via a search as i.e. the equality operator relies on bibcode, and those fields could be seen as "core attributes" (and hence reduce the number of additional queries done the user might not even be aware of).

@vsudilov
Copy link
Collaborator

Good discussions, here's my preference:

Fix the tests by explicitly defining bibcode.
Merge this pr
Open issue regarding Unicode, and bibcode, referencing this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants