We have a sizeable, digital archive of every copy of the FT going back to the 1880's. This archive contains a high-resolution black and white scan of every page of each edition, and an XML file containing text content of that page (generated through OCR). This archive takes up terabytes of storage space and is currently stored in a form that doesn't allow us to easily access, index or expose it's content. This repo contains the experiments, explorations and results of trying to make this archive more accessible.
Right now, the Archive lives on a networked drive (only accessible from within the FT) at http://newspaperarchive.ftroot.com/default.aspx. To view the issues by year/month/day, you can do so at http://newspaperarchive.ftroot.com/html to get an Apache listing.
Generating map tiles from our image assets isn't all too difficult. There a number of open source projects that we can use to generate tiles which can be displayed in a web browser using software like Leaflet.js, OpenLayers or Google maps. Our efforts so far have centered around the GDAL project and the software it makes available. Using the gdal2tiles.py program, we can generate a raster map. A raster map is a map where the geospatial coordinates point to x,y pixel values instead of latitude and longitude coordinates and doesn't try and factor in any of projection (like the Mercator or Peters projection) .
You can download the binaries for GDAL here
Once you've followed the installation instructions and linked the Python files, you should be able to run them from your Terminal without having to include the path.
For each page of the FT archive turned into a map this way, the storage for the tiles is roughly 2x the amount taken up by the static image.
When gdal2tiles.py is run with the following arguments gdal2tiles.py -z "1-5" -v -w "all" -p 'raster' -a 0 [PATH TO IMAGE FILE]/[IMAGE FILE].JPG ./gdal_tiles
it will create a folder called gdal_tiles in the directory that the script has been executed in. Using the image file that is passed to it, gdal2tiles.py will generate tiles of that image at a zoom levels 1-5 (1 being the zoom level furthest away with the least detail and 5 being the zoom level closest with the highest detail and a greater number of tiles) and place them into directorys in the gdal_tiles folder. gdal2tiles.py will also generate a HTML file with the prequisites necessary to view the files using OpenLayers viewing library.
The tiles generated are placed in directories ordered in the following way.
The top level directory can considered to be the z (or zoom level) values for the map. Inside each Z folder are the Y folders for that zoom level, and in each Y folder there are the X folders for that Y and Z level. It is in the X folders that the image tiles are located. The tiles are stored in this way so they can easily be delivered with a static file server. If we wanted to access the tile 6 from the left, 2 from the top and zoom level 3, we would make a request like http://example.org/3/2/6 or, to put it another way, http://example.org/{z}/{y}/{x}.
Once the gdal2tiles.py has finished running, the gdal_tiles folder will contain the following
- openlayers.html
- tilemapresource.xml
- Five directories labelled 1-5 (These are the folders that represent the zoom level)
The openlayers.html file is generated by gdal2tiles.py. It contains all of the neccessary Javascript, CSS and HTML to display the tiles in a browser. To view these files, you can spin up a simple static file server with a command like python -m SimpleHTTPServer 8080
and browse to http://localhost:8080/openlayers.html. Once loaded, you will see something like the following.
gdal2tiles-mod.py is a file found in this repo. It does everything that gdal2tiles.py can do, but has an extra output option -w "ft-leaflet"
instead of -w "all"
. This will generate the HTML, CSS and Javascript necessary to view the map tile, but with Leaflet.js as the map viewer and some small design tweaks, like the background color. To use it, simply copy the file to the same location as your other GDAL Python files (on OS X, this should be /Library/Frameworks/GDAL.framework/Programs/
) and then run it, using the same arguments as you would for gdal2tiles.py, with gdal2tiles-mod.py
. This will output the same tiles as before, but now there will be a leaflet.html file instead of a openlayers.html file.
tiler.py is a simple Python script which will scrape the digital archives and create a tile map for each indiviudal page. It scrapes the archive using Beautiful Soup and the HTML5Lib parser, which you will need to install in order to run the scraper. Beautiful Soup requires version 2.7 of Python (at least). You can install Beautiful Soup following these instructions and you can install the HTML5Lib parser by following these instructions. tiler.py will work through the entirety of the FT archive and generate a map for each one. This has signifcant implications for storage, so do use it wisely and don't run it without keeping an eye on it (or writing the results to a huge hard drive)
If you don't have Python 2.7 or greater:
brew install python
and then add the following to your .bash_profile file (found in your home directory)
alias python=/usr/bin/python2.7
and run the next instructions...
If you have Python 2.7 or greater:
pip install beautifulsoup4 && $ pip install html5lib
Next, we plan to explore the following:
- How we might present the whole of the FT archive in one giant tile system (the Enormo-map).
- How we might present the arhives in the same format as the microfilm that it exists in.
- How we might make the content of our archives searchable/indexable
- How we might go about redigitalising the archives from film (Automated setup with a Raspberry Pi)