Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to get the entire article (as rendered by Pocket), not just the 'excerpt' ? #6

Closed
m040601 opened this issue Oct 9, 2015 · 10 comments

Comments

@m040601
Copy link

m040601 commented Oct 9, 2015

I know that for example,
to to get the latest 5 items' links & excerpts and save them to a file:
pockyt get -n 5 -f '{link} - {excerpt}' -o readlater.txt
works

Is is also possible to get the entire article, as it is displayed and rendered on the pocket website ?
I mean just the extracted text, stored on the Pocket.
I dont want to download from the original server and extract the text on my computer again.

@achembarpu
Copy link
Owner

Article Content API - Unfortunately, pocket does not provide extracted article content to api users without partner privileges.

I'm open to other ideas though. Maybe use a custom extraction method, via BeautifulSoup, or something?

@m040601
Copy link
Author

m040601 commented Oct 12, 2015

Thanks for your attention to this detail !
I see what you mean with the api issue, it makes sense.

But I'm still confused how there seems to be other ways to get the 'whole article' text directly from Pocket.For example with calibre, http://calibre-ebook.com , and it's python 'news recipe' scripts called 'readitlater.recipe' (1)

I'm no python expert, I can barely code some shell scripts and grasp a little bit of python.
I was wondering then,
how is it that using that script and calibre's command line tool 'ebook-convert' , http://manual.calibre-ebook.com/cli/ebook-convert.html I do get the entire text of my Pocket articles.

When i used this like for example,
ebook-convert ./readitlater.recipe outputfile.txt --username my-pocket@username.com --password my-pocket-account-password
or
ebook-convert ./readitlater.recipe output.OEB --username my-pocket@username.com --password my-pocket-account-password

I can get either a text file, or just a bunch of html files,
with all my articles exactly as they are rendered by Pocket

(1)
a. as it is distributed when you install calibre,
https://gist.github.com/m040601/a4258870759f9ad8a6ee
it works for me
b. another fork of the same script (that was not working for me)
tbunnyman/ReadItLater-Calibre-Plugin
https://github.com/tbunnyman/ReadItLater-Calibre-Plugin
This is an updated & modified version of the official Calibre plugin for Pocket (Formerly ReadItLater)

@achembarpu
Copy link
Owner

Interesting. I'll check this out and think of a possible lightweight implementation.

Do you have the time to work on this, by any chance?

@m040601
Copy link
Author

m040601 commented Oct 17, 2015

Cool ! Thanks for your interest.

Do you have the time to work on this, by any chance?

Time yes, unfortunately not the skills to do it.
The only thing I can contribute is with research and feedback, as I like to thoroughly investigate and
compare all the available (python and others) solutions and implementations for this problem.

@achembarpu
Copy link
Owner

Newspaper seems to provide Pocket-like functionality.

If this seems like a good enough alternative, I'm willing to integrate it. Thoughts?

EDIT: Actually, the PyPi distribution of newspaper is outdated, and depends on a lot of heavy libraries - see requirements.

Instead, a better alternative seems to be readability-lxml. Significantly lighter and simpler to use.

@achembarpu
Copy link
Owner

I'm hacking away on this right now. Let's see how it goes.

EDIT: See #7.

@achembarpu
Copy link
Owner

Oops, almost forgot. The reason I'm not considering the scripts you linked to is:

  • They scrape getpocket.com directly, which is forbidden by their ToS.
  • Since it's a scrape, the moment Pocket changes their html, it will fail.

However, if this solution isn't good enough, I might reconsider.

@achembarpu achembarpu self-assigned this Oct 25, 2015
@achembarpu achembarpu added Epic and removed Epic labels Jun 22, 2017
@achembarpu achembarpu removed the Epic label Jul 18, 2017
@achembarpu achembarpu removed their assignment Nov 22, 2019
@billlyzhaoyh
Copy link

Is there any hack to this? I am going back to a historic collection of articles and what I have found is that the articles have been taken down by the news sites... I would think even saving the HTML response of the article at that time and store it into the DB will help tremendously

@achembarpu
Copy link
Owner

achembarpu commented Mar 27, 2020

@billlyzhaoyh - Good use-case, I had a bit of time to hack on this today. Managed to get HTML archiving working in 1.4.0.

eg - Get all favorited items and save offline copies of them:
pockyt get -v 1 -a ./pocket

Let me know if it works for you.

@achembarpu
Copy link
Owner

Closing as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants