Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on extracting papers from email \w 'Showing less relevant results' #76

Open
tombrainbox opened this issue Apr 11, 2022 · 1 comment
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@tombrainbox
Copy link

I've noticed that any scholar alert emails that have been configured with 'all results' rather than 'most relevant' result in an error when processed by this tool. This might because each email starts with:

"Showing less relevant results because there are no great results

Update alert to receive fewer, more relevant results"

Am I correct in this, and if so would this be an easy fix to implement? Here is my code (note this happens in json/html or with just minimal flags):

go run main.go -l 'GScholar' -read -authors
2022/04/11 10:04:41 searching and fetching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 searching messages from Gmail: "label:GScholar is:unread"
2022/04/11 10:04:41 14 messages found (took 0 sec)
14 / 14 [-----------------------------------------------------] 100.00% ? p/s 1s
2022/04/11 10:04:42 14 messages fetched (took 0 sec)
2022/04/11 10:04:42 14 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 searching and fetching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 searching messages from Gmail: "label:GScholar is:read"
2022/04/11 10:04:42 1 messages found (took 0 sec)
1 / 1 [-------------------------------------------------------] 100.00% ? p/s 0s
2022/04/11 10:04:42 1 messages fetched (took 0 sec)
2022/04/11 10:04:42 1 messages found&fetched with (took 0 sec)
2022/04/11 10:04:42 rendering 2 papers
# Google Scholar Alert Digest

**Date**: 2022-04-11T10:04:42+01:00
**Unread emails**: 14
**Paper titles**: 2
**Uniq paper titles**: 2

## New papers

   
 - [Cerebellar Transcranial Magnetic Stimulation (TMS) Impairs Visual Working Memory](https://link.springer.com/article/10.1007/s12311-022-01396-2), <i>N Viñas</i> (1)
   <details>
     <summary>… As a precaution, the coil was positioned using the Brainsight navigator and the</summary>
     <div>experimenter monitored for potential deviation of the target, the “bullseye,” and maintained the coil position targeting the cerebellum targets if needed. Details of this …</div>
   </details>
   

   
 - [Short-term facilitation effects elicited by cortical priming through theta burst stimulation and functional electrical stimulation of upper-limb muscles](https://link.springer.com/article/10.1007/s00221-022-06353-3), <i>Update Alert To Receive Fewer, More Relevant Results</i> (1)
   <details>
     <summary>… The coil position and orientation were monitored throughout the experiment using a</summary>
     <div>neuronavigation system (Brainsight, Rogue Research, Montreal, Canada). Ten TMS stimuli, with approximately 5–7 s inter-stimulus intervals, were delivered for …</div>
   </details>
   

## Old papers

<details id="archive">
  <summary>Archive</summary>


</details>
2022/04/11 10:04:42 Errors: 13
@bzz bzz added the bug Something isn't working label Apr 15, 2022
@bzz bzz changed the title [bug] Scholar alerts 'all results' result in an error Scholar alerts 'all results' result in an error Apr 15, 2022
@bzz
Copy link
Owner

bzz commented Apr 15, 2022

That seems like a bug, thank you for catching it, @tombrainbox!

This bug happens due to a change in email HTML template for specific cases that includes "Showing less relevant results because there are no great results". I was able to find such emails (only 7 out of ~2k of 'all results' in my case) and reproduce the failure.

For such a template seem to include an extra "hidden" paper in it 🤯 , a duplicate of the first one, that for some obscure reason our XPath library is not able to match //h3/a/@href agains :/ which leads to an error

e := fmt.Errorf("titles %d != %d urls in %q", len(titles), len(urls), subj)

that results in skipping the whole email's content from the aggregation.

This is wired since XPath browser extension (and default search in Chromium) for the same expressions both returns the right number of titles and urls!
So, most probably, this has to do with the logic in https://github.com/antchfx/htmlquery 😕 and a fix would require us to introduce some unit-tests that would first reproduce it precisely \wo touching GMail API (example).

#79 has the instructions on localising this bug, and I will look more into it this when time permits. Meanwhile, any attempt to take a stab at digging deeper and reporting the results here/sending a PR with reproducing test/sharing ideas on possible heuristic for a workaround would be very appreciated!

@bzz bzz added the help wanted Extra attention is needed label Apr 15, 2022
@bzz bzz changed the title Scholar alerts 'all results' result in an error Error on extracting papers from email \w 'Showing less relevant results' Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants