Skip to content

Latest commit

 

History

History
155 lines (106 loc) · 6.4 KB

storing-the-files.md

File metadata and controls

155 lines (106 loc) · 6.4 KB

Storing the files

Part of the point of docstore is to abstract away the management of individual files. I don't want to worry about managing individual files and folders – I want the tool to do that for me.

This document explains a bit about how docstore manages my files.

Where the files are stored

I run docstore on my home computer, which shouldn't accessible from the Internet. The files are stored on the local disk, not in cloud storage.

I use docstore to store files with private information: bank statements, medical letters, rental contracts, and more. If I uploaded them to a cloud storage service like S3, there's a risk I'd misconfigure the permissions and inadvertently make the files public. For me, the security of knowing they're not in the cloud outweighs the potential convenience of having remote access.

How the files are named / filename normalisation

Although my scanned documents have autogenerated filenames, sometimes I download documents that I want to save (e.g. electronic banking statements), which have … interesting filename choices.

These are real filenames I've received:

Filename Comments
VolcanoPattern.pdf 10/10 great name.
Alex Chan_5312.pdf Spaces in filenames cause nothing but trouble.
Statement.pdf This is a bank statement with no context. I have dozens of files with identical names, covering different accounts and date ranges.
Alexander Chan›Payslip November 2014-2015.PDF Special characters are annoying.
V5C:3 scrappage note.pdf I have no idea how I created this file. This is the V5C/3 form, so at some point the slash has been converted to a colon – but both the colon and slash are used as path separators on macOS, and are best avoided.

So I can't rely on the original filename: maybe it contains special characters, or I have different files with the same filename. The original filename is a useful piece of metadata that I want to keep, but I can't use it for saving files.

I save files under a normalised version of their original filename. I want to keep as close to the original filename as possible -- so no UUIDs. Then I save the original filename as a bit of metadata in the database.

The normalisation process has two steps:

  • Creating an ASCII-safe filename using Dr Drang's slugify() function. This uses the Unidecode and re libraries to remove any non-ASCII characters and spaces.

  • Appending a random hex value before the filename extension if there are multiple files with the same name. This avoids saving two files with the same name. e.g. Statement.pdf, Statement_1c5e.pdf, Statement_3fc9.pdf, …

For the exact implementation, see file_normalisation.py.

Ensuring I don't save two files with the same name / exclusive-open mode in Python

What if two processes try to save a file with the same name simultaneously? How do I ensure the normalisation kicks in and adds the random hex value to keep the files apart?

This is probably overkill: I'm the only person saving documents, and I can't do multiple things at once. But it was pretty easy to add, and it's a useful example of a less well-known feature in Python.

If you've used Python, you probably know how to read and write files:

>>> with open("greeting.txt", mode="w") as outfile:
...     outfile.write("Hello world!")
12

>>> with open("greeting.txt", mode="r") as infile:
...     print(infile.read())
"Hello world!"

The mode argument tells Python whether you're writing (w) or reading (r). These are by far the most commonly used values.

What if you want to write to a file, but only if it doesn't exist yet? You could check if it exists first:

>>> if not os.path.exists("important.txt"):
...     with open("important.txt", mode="w"):
...

but this is risky – what if the file is created between the existence check and when you open it?

Better is to use mode x which means exclusive open. You write as normal, but if the file already exists, the open() throws a FileExistsError:

>>> with open("greeting.txt", mode="x") as outfile:
...     outfile.write("Hello world!")
12

>>> with open("greeting.txt", mode="x") as outfile:
...     outfile.write("Bonjour le monde!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileExistsError: [Errno 17] File exists: 'greeting.txt'

This is enforced at the OS-level so it's a bit more robust. I use this to ensure I don't save two files with the same name – one will succeed, the other will throw a FileExistsError and get a random hex value inserted to distinguish it.

For the exact implementation, see file_normalisation.py.

Downloading files with their original filename / the Content-Disposition header

When I download a file from the web app, I want to download it with the original filename -- not the normalised version.

For example, if I have an HTML link:

<a href="/files/beijing.pdf">

then if I downloaded this link, my web browser would download a file named beijing.pdf.

But you can use the Content-Disposition header to suggest to a browser that it should download a file with a different name. In particular, if the server returns the header:

Content-Disposition: attachment; filename="北京.pdf"

then the browser will download the file as 北京.pdf.

For the exact implementation, see serve_file() in server.py.