# Tumblr archiver

How to backup the pictures, GIFs, and videos **you** liked



## Getting Started
This guide is for Windows, Mac, and Linux user. 

### Step 1: Register as Developer
1. Got to your Tumblr **settings** (can be found after clicking on the pencil symbol on the top right)
2. On the right side click on **apps**
3. At the bottom of the white box you can register for the Tumblr API. Click on it.
4. Add an application.
5. Fill the form. What you actually fill out does not matter
6. Go back to **settings** and then to **apps**
7. You see the same white box as in step 3 but with your app now. You will later need the **OAuth Consumer Key** and **OAuth Consumer Secret**

### Step 2: Install Python
1. The program we are going to run is a **python file** (indicated by the .py). Python is a programming language and you need to install it in order to execute the file.

2. Go to the [Python website](https://www.anaconda.com/download/). We are downloading Python **3.7**. This programm should work in the same way on all operating systems


### Step 3: Download tumblr_archiver.py

1. Download and unzip this file: [tumblr-archiver.zip](https://github.com/aauss/tumblr_archiver/zipball/master)

2. Unzip the file somewhere easy to find, e.g. in your Downloads folder. 


### Step 4. Use the Command Line

1. The command line is the bit of the computer which makes you feel like you're a hacker. 

2. 
- **Windows**: To find the command line, go to your system search and type in "Anaconda Prompt". Click it.
- **Mac and Linux**: Open the program Terminal

3. Your next step is to navigate the prompt to the file archive.py. I explain the most crucial commands to use the command line. But feel free to look up things elsewhere

4. On the left hand side of the screen is part your current path. On Windows it shows `C:\Users\yourusername>` or just `C:\>` , and then there is a blinky cursor. On mac it shows `NameOfYourMac:~ yourusername.`.

5. Type `cd Downloads` and then press enter. Your screen now reads `C:\Users\Unmutual\yourusername>` or on Mac `NameOfYourMac:Downlaods yourusername.`. "cd" stands for "change directory". You have gone one directory down! This is equivalent to just double clicking on the downloads folder. If you go wrong, typing `cd ..` will go up one directory again (`back to C:\Users\yourusername>`). 
- **Windows**: `dir` will give you the content of the folder you are in.
- **Mac and Linux**: use `ls` instead
6. Once you're done pretending to be a hacker, navigate to the folder the file archive.py is in. So:
`cd C:\Users\yourusername\Downloads\tumblr_archiver` OR `cd yourusername\Downloads\tumblr_archiver`
7. Execute `pip install -r requirements.txt`. This will install some other Python stuff that is needed.


### Step 5. Run!

1. Plug in your laptop charger, and make sure you have a stable internet connection, and that the laptop won't auto shutdown, sleep or screensaver. This program will run for a while and it's a faff to restart. 

2. Where the blinky cursor is, type `python archive.py `. The first bit tells your computer to run Python, the second bit tells Python to run the archiver.

3. Your command prompt will start spitting fancy sentences onto the screen. Read it to understand what is currently happening! You can do other stuff while you wait, just leave the black command prompt box open and running.

### In Case of an Error
Since I wrote this script not long ago there might be still some things I did not think about. Write an issue on [Github](https://github.com/aauss/tumblr_archiver/issues), where you also have the script from. If the error is caused by a weak internet connection or your computer suddenly turns of, restart the script as in step 5

In [1]:
#!/usr/bin/python
import pytumblr
import yaml
import os
import requests
import urllib.request
import re
import pickle 
import time
from datetime import datetime
from tqdm import tqdm

In [1]:
def new_oauth(yaml_path):
    '''
    Return the consumer and oauth tokens with three-legged OAuth process and
    save in a yaml file in the user's home directory.
    '''

    print('Retrieve consumer key and consumer secret from http://www.tumblr.com/oauth/apps')
    consumer_key = input('Paste the consumer key here: ')
    consumer_secret = input('Paste the consumer secret here: ')

    request_token_url = 'http://www.tumblr.com/oauth/request_token'
    authorize_url = 'http://www.tumblr.com/oauth/authorize'
    access_token_url = 'http://www.tumblr.com/oauth/access_token'

    # STEP 1: Obtain request token
    oauth_session = OAuth1Session(consumer_key, client_secret=consumer_secret)
    fetch_response = oauth_session.fetch_request_token(request_token_url)
    resource_owner_key = fetch_response.get('oauth_token')
    resource_owner_secret = fetch_response.get('oauth_token_secret')

    # STEP 2: Authorize URL + Rresponse
    full_authorize_url = oauth_session.authorization_url(authorize_url)

    # Redirect to authentication page
    print('\nPlease go here and authorize:\n{}'.format(full_authorize_url))
    redirect_response = input('Allow then paste the full redirect URL here:\n')

    # Retrieve oauth verifier
    oauth_response = oauth_session.parse_authorization_response(redirect_response)

    verifier = oauth_response.get('oauth_verifier')

    # STEP 3: Request final access token
    oauth_session = OAuth1Session(
        consumer_key,
        client_secret=consumer_secret,
        resource_owner_key=resource_owner_key,
        resource_owner_secret=resource_owner_secret,
        verifier=verifier
    )
    oauth_tokens = oauth_session.fetch_access_token(access_token_url)

    tokens = {
        'consumer_key': consumer_key,
        'consumer_secret': consumer_secret,
        'oauth_token': oauth_tokens.get('oauth_token'),
        'oauth_token_secret': oauth_tokens.get('oauth_token_secret')
    }

    yaml_file = open(yaml_path, 'w+')
    yaml.dump(tokens, yaml_file, indent=2)
    yaml_file.close()

    return tokens

In [2]:
def get_token():
    # Get token
    yaml_path = os.path.expanduser('~') + '/.tumblr'
    yaml_file = open(yaml_path, "r")
    tokens = yaml.safe_load(yaml_file)
    yaml_file.close()
    # Use token to be able to use the client
    client = pytumblr.TumblrRestClient(
        tokens['consumer_key'],
        tokens['consumer_secret'],
        tokens['oauth_token'],
        tokens['oauth_token_secret'])
    return client

In [3]:
client = get_token()

In [4]:
# Get the overall amount of likes
amount_likes = client.likes()["liked_count"]
assert amount_likes != -1, "You don't seem to have any content to download"

In [105]:
def save(url, content_type, index, tags):
    '''A saver funtion for downloading content based on URL'''
    os.mkdir('videos')
    os.mkdir('images')
    tags = tags[:150]  # Otherwise name gets to long
    if content_type == "video":
        try:
            path = os.path.join('videos', str(index) + tags + '.mp4')
            urllib.request.urlretrieve(url, path)
        except:
            with open("failed_urls.txt","a") as file:
                file.write(url + ' Index:[' + index + ']' + "\n")
    else:
        try:
            img_data = requests.get(url).content
            path = os.path.join('images', str(5613) +tags + '.png')
            with open(path, 'wb') as handler:
                handler.write(img_data)
        except:
            with open("failed_urls.txt","a") as file:
                file.write(url + ' Index:[' + index +']' + "\n")

In [11]:
def find_first_post(client):
    now = int(time.time())
    past = time.mktime(datetime.strptime("01/02/2007", "%d/%m/%Y").timetuple())
    timestamp = now - (now - past)/2
    posts = client.likes(before=now,limit=51)['liked_posts']
    while len(posts) in [0,51]:
        posts = client.likes(before=int(timestamp),limit=51)['liked_posts']
        if len(posts) == 0:
            past = timestamp
            timestamp = now - (now - past)/2
        elif len(posts) == 51:
            now = timestamp
            timestamp = now - (now - past)/2
    posts = client.likes(before=int(timestamp),limit=51)['liked_posts']
    first_post_timestamp = min([posts[k]['liked_timestamp'] for k in range(len(posts))])
    pickle.dump(first_post_timestamp, open("first_timestamp.p", 'wb'))
    return first_post_timestamp

In [16]:
def sim(ts):
    now = int(time.time())
    if (now - 1000) < ts < now:
        return 51
    elif ts < (now - 2000):
        return 0
    else:
        return 10
def find_first_post():
    now = int(time.time())
    past = time.mktime(datetime.strptime("01/02/2007", "%d/%m/%Y").timetuple())
    timestamp = now - (now - past)/2
    posts = 51
    while posts in [0,51]:
        posts = sim(timestamp)
        if posts == 0:
            past = timestamp
            timestamp = now - (now - past)/2
        elif posts == 51:
            now = timestamp
            timestamp = now - (now - past)/2
    posts = sim(timestamp)
    return posts
find_first_post()

10

In [17]:
datetime.utcfromtimestamp().strftime('%Y-%m-%d %H:%M:%S')

'2018-12-07 17:30:42'

In [21]:
pickle.load(open('checkpoint.p','rb'))

0

In [None]:
checkpoint = {"caused_error_url" : [],
              "not_found_contenttype": [],
              "offsets" : [1415545092],
              "name_dict" : {},
              "num_post" : 0,
              "current_api_call" : 0 }

In [92]:
def api_calls_for_content():
    posts = []
    for api_call in tqdm(range(160)):
        # Iterate over batches of size 49 to create as little requests as possible
        request = client.likes(after=checkpoint["offsets"][-1],limit=51)
        new_offset = max([request["liked_posts"][k]['liked_timestamp'] for k in range(len(request["liked_posts"]))])
        checkpoint["offsets"].append(new_offset)
        posts.extend(request["liked_posts"])
    pickle.dump(posts, open('posts.p', 'wb'))
    return posts

In [None]:
# DELETE FAILED FIRST

In [2]:
# Messy draft
empty_posts = 0
for index, post in enumerate(tqdm(posts[421:])):
    index += 421
    if len(post) >=1:
        content_type = post['type']
        tags = "_".join(post['tags'])
        index = str(index)
        if content_type == "photo":
            # If only one photo, download, otherwise iterate over them and download
            if len(post["photos"]) == 1:
                url = post["photos"][0]["original_size"]['url']
                save(url, content_type, index, tags)
            else:
                index += "_{}"
                for j in range(len(post["photos"])):
                    url = post["photos"][j]["original_size"]['url']
                    save(url, content_type, index.format(j), tags)
        elif content_type == "text":
            # Get the body as an HTML style string. Use Regex to extract photo URLs
            # If only one photo, download, otherwise iterate over them an download
            content = post["body"]
            url_s = re.findall(r'src="(http[s]:[\S]*media\.tumblr\.com[\S]*)"',content)
            if len(url_s) == 1:
                save(url_s[0],content_type,index, tags)
            else:
                index += "_{}"
                for j in range(len(url_s)):
                    save(url_s[j],content_type, index.format(j), tags)
        elif content_type == "video":
            # Download the video file
            try:
                url_s = post["video_url"]
                save(url_s, content_type,index, tags)
            except KeyError:
                pass
            
    else:
        empty_posts += 1