Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Some thoughts I have about creating thumbnails of common document types.

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 in
Octocat-spinner-32 out
Octocat-spinner-32 tomthumb
Octocat-spinner-32 .gitignore
Octocat-spinner-32 LICENSE.txt
Octocat-spinner-32 README.md
Octocat-spinner-32 setup.py
README.md

Adventures in Document Thumbnailing

A talk by @benjamincoe

The Problem

My company, Attachments.me, makes a visual representation of the attachments inside your inbox.

Recently I've been rebuilding the part of our system that creates these thumbnails.

  • I wanted to use open-source libraries to perform the thumbnailing.
  • I wanted to use a Linux server for hosting the service.
  • I wanted to be able to create thumbnails for the majority of documents that our system processes.
  • I needed things to work on a large-scale (millions and millions of thumbnails being created).

Even though, for all intents and purposes, we live in the future, solving these problems was a bit of a hassle.

LibreOffice / PythonUNO

Up until quite recently, we were using LibreOffice as part of our document thumbnailing process. It was used to convert various document formats into PDFs, at which point the could easily be converted into images.

Pros

  • Supported most major document formats.
  • PythonUNO provides a handy API interface to LibreOffice.

Why We Abandoned It

  • It's not thread safe.
  • It's slow and leaks memory.

The main problem we ran into with LibreOffice was its inability to handle concurrent access. We built middleware that limited document processing to one document at a time, which fixed this. This patch also created a huge bottleneck in our system.

80/20 Rule to the Rescue

Taking a look at the millions of attachments that we had indexed, I noticed something: 80% of documents were either .doc, .docx, or .pdf.

This made me rethink the necessity of using LibreOffice for thumbnail creation.

PIL / ImageMagick / AbiWord

The approach we now use for thumbnailing uses three open-source libraries:

  • AbiWord converts various document formats into PDFs.
  • ImageMagick converts the first page of the PDF outputted by AbiWord into a JPG image.
  • python-imaging-libray is used to resize the images outputted by ImageMagick (this could be done entirely using ImageMagick, but I like the PIL interface).

This new approach:

  • Supports 80% of the documents we observe.
  • Plays nicely in a multi-process environment.
  • Runs about 30% faster than the original approach.

Tomthumb

I've open-sourced a library called tomthumb (see the /tomthumb folder in this repo), that demonstrates the thumbnailing approach discussed above.

Dependencies

apt-get install python-imaging abiword imagemagick timelimit

Usage

tomthumb -i foo.doc -o out/ --width=70 --height=120

Or

tomthumb -d in/ -o out/

Hope people find this useful. If you have any comments, I'm available on the twitters @benjamincoe.

Copyright

Copyright (c) 2011 Attachments.me. See LICENSE.txt for further details.

Something went wrong with that request. Please try again.