Additional article properties #55

takerukoushirou · 2016-05-09T12:31:48Z

Several new on-demand getters for fields that I encountered in articles. I guess there are many more to be added. A different approach could maybe use __getattr__ instead of explicitly defining each field.

One of the newly supported fields, link_data, is a list of strings that contain JSON. I added a value processor to automatically parse those accordingly during the Article constructor (and hence implicitly for the getter).
There are other fields for which additional parsing may be useful, notably all date-related fields that are RFC 3339 encoded and could easily be processed using dateutil.parse.

…nd the document type

…r link data

vsudilov · 2016-05-09T12:35:47Z

iirc some of these fields were intentionally omitted due to containing either misleading, redundant, or useless info. @jonnybazookatone could you confirm that that's still the case? If it is, maybe another ticket for adsabs/adsws?

jonnybazookatone · 2016-05-09T12:56:04Z

Yes, some are confusing and for internal use, such as classic_factor and cite_read_boost, that without any context are completely misleading/not useful (imo) to the user.

I'll check which are relevant and get back to you.

takerukoushirou · 2016-05-09T13:38:48Z

@vsudilov I'll gladly remove useless fields and instead leave a comment on why those were omitted for future reference.

@jonnybazookatone could you also check if there are additional fields not included yet that may be useful?

Regarding redundant fields, I noticed that the id and recid so far seem to represent the same value, with the difference that the latter is returned as an integer instead of a string.

@vsudilov, any specific reason that the reference property queries references(id:{}) instead of the field reference? I didn't notice any difference in the outcome. Similar with citation, but in my tests the field citation contains more citations than the query citations(id:{}). Querying a single field is most likely faster.

vsudilov · 2016-05-09T13:57:47Z

Ideally any fields that shouldn't be public facing would be hidden upstream at adsabs/adsws.

From my recollection, citations() is a built-in operator that dynamically handles resolving the reference->citation relationship between documents that match the inner query -- in this case the inner query is a single document, but it could be anything in general. Querying the citation: field even for a single document shouldn't be done, since that would return the citations at a snapshot in time and not the real-time calculation.

That being said, I'm not sure how citation: should ever return more results than citations(). Actually, there would be no need at all for the citation: field if what I wrote is correct, so perhaps I'm missing something.

The document's reference field should be OK to query directly and probably reverted back in the code AFAIK.

All of that is probably worth confirming with the ADS directly.

takerukoushirou · 2016-05-09T14:22:39Z

That also seems the most logic action to me and would also reduce bandwidth, as all those redundant/private fields are currently returned with a catch-all query for fields fl=*.

Regarding the citations, in further tests so far the list fetched via the citations() operator is a subset of the list retrieved via the citation field or the same in most cases; in at least one case so far the field yields more citations (article with identifier:2012Natur.486..502B).

jonnybazookatone · 2016-05-10T11:44:30Z

Sorry for the delay, was double checking on our side. The citation and reference fields should only include resolved (could compute a bibcode), and verified (bibcode is in the system) bibcodes (in our legacy system).

The citations() and references() operators should return similar (if not the same) lists as the fields (normalised on bibcode), but will return only the documents that exist in Solr (the search engine), and are resolvable (ie., if you do q="bibcode:BIBCODE" from the resulting list). It's possible a document may not enter Solr, and as a result it's not returned by citations()/references(), but would be returned via reference/citation. This type of document would then fail on any look up bibcode:BIBCODE, etc.

For self-consistency, and to remove some confusion for the user (as also the UI is using the operators), we're currently encouraging the use of the operators rather than the fields, and the fields may also be removed from the response in the future.

I'm working on the list of fields that may be relevant, and I'll post it here when it's complete.

jonnybazookatone · 2016-05-12T13:37:43Z

I would remove:

author_norm
body
cite_read_boost
classic_factor
date (date type of pubdate)
eid
email
first_author (there's already first_author_norm being used)
keyword_facet
keyword_norm
keyword_schema
links_data
pub_raw (duplicates pub)
reader
recid (int version of id)
simbad_object_facet_hier
simbid
simbtype

Then you can also remove the changes for _processor, as we shouldn't need it for the remaining fields. This leaves the fields you added:

alternate_bibcode
arxiv_class
bibcode
bibgroup
copyright
data
doi
doctype
indexstamp

Plus some others that might be useful:

ack (Acknowledgements)
alternate_title (Alternate titles, e.g., if the original is not English)
lang (Language of the title if it is not English)
vizier (Keywords given by VizieR)

There are several _facet and _facet_hier field parameters that can be used for client side applications. I'm not sure it's so relevant for them to be added here. We have a few others that we have started populating, which I'll leave out for now, but make a PR when they're ready.

…lete value processors

takerukoushirou · 2016-05-17T13:13:55Z

I updated the article properties accordingly and looked into the failing check.

The test fails for Python 3 only because of Python 2 and 3 differences in the hasattr() function used in the equality operator for articles. Since bibcode is now a deferred property, it will raise an APIResponseError as it cannot dynamically fetch the bibcode property of the LHS article in test_equals() as that article is empty and hence has no ID set. This exception is, in contrast to Python 2, not caught by Python 3's hasattr() and propagates. The test expects a TypeError raised by the equality operator though.

I think the equality operator for articles should be fixed to behave identical for both Python 2 and 3.

takerukoushirou · 2016-05-17T13:40:41Z

I found another issue: In the article method __unicode__(), the first_author property is used. This will raise an exception if the field was not requested in the original query and a deferred first_author property doesn't exist.
This could be fixed by either using first_author_norm, adding the deferred property again (probably the most compatible fix, as first_author is also in the default list of fields used if fl isn't defined) or ensuring that it's always in the list of requested fields.

jonnybazookatone · 2016-05-17T16:16:43Z

Yeah, I agree. I was having a think, how about something like:

def __eq__(self, other):
    try:
        this = getattr(self, 'bibcode', None)
        other = getattr(other, 'bibcode', None)
    except APIResponseError:
        raise TypeError("Cannot compare articles without ids")
    else:
        if this is None or other is None:
            raise TypeError("Cannot compare articles without bibcodes")
        return this == other

Or the getter could be modified, but that seemed overkill to me.

As per __unicode__, how about something like:

def __unicode__(self):
    try:
        first_author = getattr(self, 'first_author', 'Unknown author')
        author = getattr(self, 'author', None)

        if author is not None and len(author) > 1:
           first_author += " et al."

        return u"<{author} {year}, {bibcode}>".format(
            author=first_author,
            year=self.year,
            bibcode=self.bibcode,
        )
    except APIResponseError:
        return self.__repr__()

This is if we want to keep the current behaviour of raising an APIError when there is no id defined in the Article instance.

Feels a bit like celetaping the problem though, anyone else got better ideas?

takerukoushirou · 2016-05-17T20:43:46Z

I'm in favour of the proposed equality operator modification, I also had something very similar in mind. Changing the getter would IMHO introduce an inconsistency with the other getters and is harder to maintain...

Regarding __unicode__(), I wonder if we generally should define an article without an id as a virtual/offline article and also probably ensure that id always exists as a property with a default value of None for consistency. If id is None, then _get_field() should return None instead of raising an exception. id is also AFAIK the only field that is always included in the search query fields list, regardless of what is requested by the user. That way we would get a more consistent behaviour regardless of whether an article is fetched or manually created (as in the test case).

Additionally, I think first_author should be added again as a deferred property. Even if first_author_norm is probably the field used by most users, __unicode__() is a good example of a method that relies on it, and it's also in the list of default fields being queried.

Finally, __unicode__() should arguably still be extended to handle exceptions, as author, bibcode and year might be fetched by a getter and an API error may occur. In that case, a string representation could still be returned, using placeholders for consistency though. We could even consider to ensure that those fields are also always queried via a search as i.e. the equality operator relies on bibcode, and those fields could be seen as "core attributes" (and hence reduce the number of additional queries done the user might not even be aware of).

vsudilov · 2016-05-17T21:25:38Z

Good discussions, here's my preference:

Fix the tests by explicitly defining bibcode.
Merge this pr
Open issue regarding Unicode, and bibcode, referencing this pr.

…e getter

takerukoushirou added 2 commits May 6, 2016 19:55

Added article properties for the bibgroup, data links and DOI lists a…

39cf9ef

…nd the document type

Added several missing article properties and a value pre-processor fo…

8375306

…r link data

takerukoushirou added 2 commits May 17, 2016 14:25

Removed internal or unused article properties, including the now obso…

7ef62dd

…lete value processors

Added ack, alternate_title, lang and vizier article properties

efa8668

takerukoushirou changed the title ~~Additional article properties & value processor~~ Additional article properties May 17, 2016

Fixed test by explicitly setting bibcode to None to avoid invoking th…

20a8bf6

…e getter

This was referenced May 18, 2016

Exception handling in Unicode representation of Articles #59

Closed

Article equality operator inconsistencies and bibcode getter #60

Closed

vsudilov merged commit 75a0d3c into andycasey:master May 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional article properties #55

Additional article properties #55

takerukoushirou commented May 9, 2016

vsudilov commented May 9, 2016

jonnybazookatone commented May 9, 2016

takerukoushirou commented May 9, 2016

vsudilov commented May 9, 2016

takerukoushirou commented May 9, 2016

jonnybazookatone commented May 10, 2016

jonnybazookatone commented May 12, 2016

takerukoushirou commented May 17, 2016

takerukoushirou commented May 17, 2016

jonnybazookatone commented May 17, 2016

takerukoushirou commented May 17, 2016

vsudilov commented May 17, 2016

Additional article properties #55

Additional article properties #55

Conversation

takerukoushirou commented May 9, 2016

vsudilov commented May 9, 2016

jonnybazookatone commented May 9, 2016

takerukoushirou commented May 9, 2016

vsudilov commented May 9, 2016

takerukoushirou commented May 9, 2016

jonnybazookatone commented May 10, 2016

jonnybazookatone commented May 12, 2016

takerukoushirou commented May 17, 2016

takerukoushirou commented May 17, 2016

jonnybazookatone commented May 17, 2016

takerukoushirou commented May 17, 2016

vsudilov commented May 17, 2016