Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Download notes (replies, likes, etc.) #169

Open
humanitiesclinic opened this issue Dec 11, 2018 · 4 comments
Open

Feature: Download notes (replies, likes, etc.) #169

humanitiesclinic opened this issue Dec 11, 2018 · 4 comments
Assignees

Comments

@humanitiesclinic
Copy link

I refer to post #98. What is the status of this? Are Notes/Comments now downloadable by the backup script? I tried, but I dun see any Notes/Comments downloaded as of now.

Is this what the --likes command is for?

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Dec 11, 2018

If you add notes_info to the API parameters, you can get a full short list of:

  • People who liked the post
  • People who reblogged the post, with a post ID for each
  • People who replied to the post, with the message they replied with

For every post downloaded. It'll only do anything useful with -j currently, which dumps the API response for each post into a JSON file. (You'll have to read the JSON to see this information right now.)

The following patch will add this functionality:

diff --git a/tumblr_backup.py b/tumblr_backup.py
index 338c7d3..220dd25 100755
--- a/tumblr_backup.py
+++ b/tumblr_backup.py
@@ -195,7 +195,7 @@ def set_period():
 
 
 def apiparse(base, count, start=0):
-    params = {'api_key': API_KEY, 'limit': count, 'reblog_info': 'true'}
+    params = {'api_key': API_KEY, 'limit': count, 'reblog_info': 'true', 'notes_info': 'true'}
     if start > 0:
         params['offset'] = start
     url = base + '?' + urllib.urlencode(params)

@wertercatt
Copy link

@cebtenzzre This patch does not download all the notes due to the Tumblr API only returning 50. https://stackoverflow.com/a/14428010 should help, but you'll need to scrape the /notes/ URL out of the rendered post HTML as well as scrape the paginated URLs out of the /notes/ pages to get the next page of notes.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Dec 14, 2018

import dryscrape
import re
from bs4 import BeautifulSoup

def get_more_link(sess, base, url):
    sess.visit(url)
    soup = BeautifulSoup(sess.body(), 'lxml')
    element = soup.find('a', class_='more_notes_link')
    if not element:
        return None
    onclick = element.get_attribute_list('onclick')[0]
    return base + re.search(r";tumblrReq\.open\('GET','([^']+)'", onclick).groups()[0]

base = 'https://uri-hyukkie.tumblr.com'
url = base + '/post/61181809095'
session = dryscrape.Session()

while True:
    url = get_more_link(session, base, url)
    if not url:
        break
    print url
    session.visit(url)
    soup = BeautifulSoup(session.body(), 'lxml')
    notes = soup.find('ol', class_='notes').find_all('li')[:-1]
    for n in notes:
        print n.prettify()

There's a proof-of-concept script to scrape the notes from a post that was linked in another StackOverflow answer by unor. Any remarks before I try to integrate it into tumblr-utils? (I'm technically still learning this language...)

EDIT: Yes, I realize that there are minor issues here, and that I'm doing duplicate work. I'm fixing that in the version I'm working on.

@cebtenzzre cebtenzzre self-assigned this Dec 14, 2018
@cebtenzzre cebtenzzre changed the title Do Notes/Comments Get Downloaded As Of The Latest Version? Feature: Download notes (replies, likes, etc.) Dec 14, 2018
@cebtenzzre
Copy link
Collaborator

I've made a PR for this (#189).

cebtenzzre added a commit to cebtenzzre/tumblr-utils that referenced this issue Sep 24, 2020
Included revisions:
- Remove log_queue, better status and account logic
- Better tracking and synchronization on ThreadPool.queue.qsize
- Remove remaining_posts

Fixes bbolli#169
cebtenzzre added a commit to cebtenzzre/tumblr-utils that referenced this issue Oct 2, 2020
Included revisions:
- Remove log_queue, better status and account logic
- Better tracking and synchronization on ThreadPool.queue.qsize
- Remove remaining_posts
- Remove getting_tup
- Put back the account parameter
- Make typing optional

Fixes bbolli#169
cebtenzzre added a commit to cebtenzzre/tumblr-utils that referenced this issue Nov 25, 2020
Included revisions:
- Remove log_queue, better status and account logic
- Better tracking and synchronization on ThreadPool.queue.qsize
- Remove remaining_posts
- Remove getting_tup
- Put back the account parameter
- Make typing optional

Fixes bbolli#169
cebtenzzre added a commit to cebtenzzre/tumblr-utils that referenced this issue Jan 17, 2021
Included revisions:
- Remove log_queue, better status and account logic
- Better tracking and synchronization on ThreadPool.queue.qsize
- Remove remaining_posts
- Remove getting_tup
- Put back the account parameter

Fixes bbolli#169
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants