Easily crowdsource the analysis of your documents
Python CSS HTML JavaScript
Latest commit c92d8af Jul 14, 2016 @cbertelegni cbertelegni committed on GitHub Merge pull request #36 from rustico/vozdata-changes
Migrate changes from VozData to CrowData and add some new features



CrowData is a tool to collaborate on the verification or release of data that otherwise would be hard or impossible to get via automatic tools. This is the software we used to create VozData.

In 2014, La Nacion in Argentina launched VozData, a website to crowdsourced senate spendings by asking people to transcribe information from 6500 scanned PDF documents from the senate. This is the code that created that website and it can be used with any document set and any data you may need to take from them.

VozData: collaborating to free data from PDFs: A really nice article about the process of creating VozData from La Nacion.

Install Locally

  1. Python 2.7.5

  2. We recommend the use of virtualenv — Install it.

  3. Create a virtual environment and activate it:

    virtualenv ~/.python-envs/crowdata
    . ~/.python-envs/crowdata/bin/activate
  4. Get the source code:

    git clone https://github.com/crowdata/crowdata.git crowdata
    cd crowdata
  5. Install dependencies:

    Ubuntu users: before you can move forward, please make sure you have the following packages installed: python-dev, postgresql-9.3, postgresql-server-dev-9.3, postgresql-contrib, and libgeos-dev

    pip install -r requirements.txt
  6. Create PostgreSQL database

    $ createuser -s -h localhost crow_user
    $ createdb -O crow_user -h localhost crowdata_development
  7. Create extensions for doing trigram matching and removing accents in PostgreSQL

    $ psql -Ucrow_user crowdata_development
    crowdata_development=# CREATE EXTENSION pg_trgm;
  8. We keep local settings outside GIT. You will need to copy local_settings.py.example to local_settings.py. You will need to edit the database settings there.

        'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2', # Add 'postgresql_psycopg2', 'postgresql', 'mysql', 'sqlite3' or 'oracle'.
            'NAME': 'crowdata_development',                      # Or path to database file if using sqlite3.
            'USER': 'crow_user',
            'PASSWORD': '',
            'HOST': '',
            'PORT': '',
  9. Install the GEOS library in case you don't have installed already.

  10. Initialize the database:

    python manage.py syncdb
    python manage.py migrate --all
  11. Ask a team member for a database backup and load it.

    pg_restore --dbname=crowdata_development --verbose ~/my_backup.backup --clean
  12. Create superuser

    python manage.py createsuperuser

    and follow the prompts.

  13. Start the development server

    python manage.py runserver_plus
  14. Navigate to http://localhost:8000/admin/ and log in with your superuser credentials.

Installing via Docker

  1. Set your environment variables

There are 6 required environment variables.

  • crowdata_NAME : your database name
  • crowdata_USER : the main database user (this will also be the django superuser)
  • crowdata_HOST : usually localhost
  • crowdata_EMAIL : email for django superuser
  • crowdata_WITH_DB : the filename of a prepopulated backup for the database (or simply None)
  • crowdata_PASSWORD : the password you want

set each of them with:

export [var name]=[value you want] (i.e. export crowdata_USER="beyonce")

  1. Build your image with

cat Dockerfile | envsubst | sudo docker build -t lanacion/crowdata -

  1. Once it's built, run the server with

sudo docker run -i -t -d lanacion/crowdata python /crowdata/manage.py runserver_plus && tail -f /dev/null

When creating a document set

If you are going to use document cloud to load and view the PDF documents, then you will have to set the 'head html' in the admin, when creating the document set:

<script src="http://s3.documentcloud.org/viewer/loader.js"></script>

and the template function:

// Javascript function to insert the document into the DOM.
// Receives the URL of the document as its only parameter.
// Must be called insertDocument
// JQuery is available yeah
// resulting element should be inserted into div#document-viewer-container

function insertDocument(document_url) {
  var url = document_url.match(/(.+)\.html$/)[1];
  DV.load(url + '.js', {
    container : 'div#document-viewer-container', width:650,height:835,sidebar:false});

When importing documents to a 'document set' via CSV upload

There is an option 'Add Documents to this document set' in the admin for the document set. You can upload a CSV with columns document_title and document_url. This will create documents in the document set with that name and link to that url.

CrowData's copyright is © 2013 Manuel Aristarán jazzido@jazzido.com. CrowData was developed with Open News and La Nacion Argentina.

Crowdata is an open source project that was born when Manuel Aristaran was an Open News fellow at La Nacion in 2013. It was finally released as free software when Gabriela Rodriguez continued it for VozData in 2014. Thanks to Cristian Bertelegni and La Nacion for contributing to the code.

Now it relies on contributions from people and organizations. Please, use it, comment on it and make improvements by pull requests in 'GitHub http://github.com/crowdata/crowdata'.


  • Fork the repo
  • Clone your fork
  • Make a branch of your changes
  • Make a pull request through GitHub, and clearly describe your changes