Copyright 2015 NPR. All rights reserved. No part of these materials may be reproduced, modified, stored in a retrieval system, or retransmitted, in any form or by any means, electronic, mechanical or otherwise, without prior written permission from NPR.
(Want to use this code? Send an email to nprapps@npr.org!)
- What is this?
- Assumptions
- What's in here?
- Bootstrap the project
- Hide project secrets
- Save media assets
- Add a page to the site
- Run the project
- COPY configuration
- COPY editing
- Arbitrary Google Docs
- Run Python tests
- Run Javascript tests
- Compile static assets
- Test the rendered app
- Deploy to S3
- Deploy to EC2
- Install cron jobs
- Install web services
- Run a remote fab command
- Report analytics
Current scraper started around 12:50pm EST on April 30, 2015
Graeae is a tool for aggregating data about NPR's content from: the NPR story API (Seamus), the NPR homepage, the NPR Facebook page, a spreadsheet of projects we've handled photos for, and a qualitative review by our team of photo quality.
The following things are assumed to be true in this documentation.
- You are running OSX.
- You are using Python 2.7. (Probably the version that came OSX.)
- You have virtualenv and virtualenvwrapper installed and working.
- You have NPR's AWS credentials stored as environment variables locally.
For more details on the technology stack used with the app-template, see our development environment blog post.
The project contains the following folders and important files:
confs
-- Server configuration files for nginx and uwsgi. Edit the templates thenfab <ENV> servers.render_confs
, don't edit anything inconfs/rendered
directly.data
-- Data files, such as those used to generate HTML.fabfile
-- Fabric commands for automating setup, deployment, data processing, etc.etc
-- Miscellaneous scripts and metadata for project bootstrapping.jst
-- Javascript (Underscore.js) templates.less
-- LESS files, will be compiled to CSS and concatenated for deployment.templates
-- HTML (Jinja2) templates, to be compiled locally.tests
-- Python unit tests.www
-- Static and compiled assets to be deployed. (a.k.a. "the output")www/assets
-- A symlink to an S3 bucket containing binary assets (images, audio).www/live-data
-- "Live" data deployed to S3 via cron jobs or other mechanisms. (Not deployed with the rest of the project.)www/test
-- Javascript tests and supporting files.app.py
-- A Flask app for rendering the project locally.app_config.py
-- Global project configuration for scripts, deployment, etc.copytext.py
-- Code supporting the Editing workflowcrontab
-- Cron jobs to be installed as part of the project.public_app.py
-- A Flask app for running server-side code.render_utils.py
-- Code supporting template rendering.requirements.txt
-- Python requirements.static.py
-- Static Flask views used in bothapp.py
andpublic_app.py
.
Node.js is required for the static asset pipeline. If you don't already have it, get it like this:
brew install node
curl https://npmjs.org/install.sh | sh
Then bootstrap the project:
cd graeae
mkvirtualenv graeae
pip install -r requirements.txt
npm install
fab data.local_bootstrap
fab update
Problems installing requirements? You may need to run the pip command as ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future pip install -r requirements.txt
to work around an issue with OSX.
This project currently contains three scrapers:
This scraper collects an event stream by scanning our Facebook feed every 15 minutes.
- run_time (of the scrape)
- headline
- post_type
- art_url
- link_url
- created_time
- updated_time
- message: the text above the link/photo
- description: the text below the link/photo
This additional data is knitted in from Facebook insights:
- shares
- likes
- comments
- link_clicks
- photo_view_clicks
- people_reached
This scraper collects an event stream by scanning our homepage every 15 minutes.
- run_time (of the scrape)
- story_id: Internal unique ID
- slot: Position on the home page (e.g. the top story is slot 0)
- headline
- url
- is_bullet: True if the item is a bullet link related to another story
- layout: One of 'bullet', 'video', 'big-image', 'small-image', 'slideshow'
- has_audio: Has audio player on the home page
- homepage_art_url: URL to image or video on homepage
This additional data is knitted in from the Seamus API:
- homepage_art_provider: Who provided the homepage art (e.g. AFP/Getty)
- has_story_art: True if any image is present in the story
- has_lead_art: True if there is an image at the top of the story
- lead_art_provider: Who provided the lead art (e.g. Reuters)
- lead_art_url
This scraper collects a canonical dataset of every NPR story in the API. (It is not an event stream.)
- run_time (of the scrape)
- id
- title: also known as the headline
- publication_date
- story_date
- last_modified_date
- canonical_url
- has_lead_art
- lead_art_provider
- lead_art_url
Project secrets should never be stored in app_config.py
or anywhere else in the repository. They will be leaked to the client if you do. Instead, always store passwords, keys, etc. in environment variables and document that they are needed here in the README.
Any environment variable that starts with $PROJECT_SLUG_
will be automatically loaded when app_config.get_secrets()
is called.
Large media assets (images, videos, audio) are synced with an Amazon S3 bucket specified in app_config.ASSETS_S3_BUCKET
in a folder with the name of the project. (This bucket should not be the same as any of your app_config.PRODUCTION_S3_BUCKETS
or app_config.STAGING_S3_BUCKETS
.) This allows everyone who works on the project to access these assets without storing them in the repo, giving us faster clone times and the ability to open source our work.
Syncing these assets requires running a couple different commands at the right times. When you create new assets or make changes to current assets that need to get uploaded to the server, run fab assets.sync
. This will do a few things:
- If there is an asset on S3 that does not exist on your local filesystem it will be downloaded.
- If there is an asset on that exists on your local filesystem but not on S3, you will be prompted to either upload (type "u") OR delete (type "d") your local copy.
- You can also upload all local files (type "la") or delete all local files (type "da"). Type "c" to cancel if you aren't sure what to do.
- If both you and the server have an asset and they are the same, it will be skipped.
- If both you and the server have an asset and they are different, you will be prompted to take either the remote version (type "r") or the local version (type "l").
- You can also take all remote versions (type "ra") or all local versions (type "la"). Type "c" to cancel if you aren't sure what to do.
Unfortunantely, there is no automatic way to know when a file has been intentionally deleted from the server or your local directory. When you want to simultaneously remove a file from the server and your local environment (i.e. it is not needed in the project any longer), run fab assets.rm:"www/assets/file_name_here.jpg"
A site can have any number of rendered pages, each with a corresponding template and view. To create a new one:
- Add a template to the
templates
directory. Ensure it extends_base.html
. - Add a corresponding view function to
app.py
. Decorate it with a route to the page name, i.e.@app.route('/filename.html')
- By convention only views that end with
.html
and do not start with_
will automatically be rendered when you callfab render
.
A flask app is used to run the project locally. It will automatically recompile templates and assets on demand.
workon $PROJECT_SLUG
fab app
Visit localhost:8000 in your browser.
This app uses a Google Spreadsheet for a simple key/value store that provides an editing workflow.
To access the Google doc, you'll need to create a Google API project via the Google developer console.
Enable the Drive API for your project and create a "web application" client ID.
For the redirect URIs use:
http://localhost:8000/authenticate/
http://127.0.0.1:8000/authenticate
http://localhost:8888/authenticate/
http://127.0.0.1:8888/authenticate
For the Javascript origins use:
http://localhost:8000
http://127.0.0.1:8000
http://localhost:8888
http://127.0.0.1:8888
You'll also need to set some environment variables:
export GOOGLE_OAUTH_CLIENT_ID="something-something.apps.googleusercontent.com"
export GOOGLE_OAUTH_CONSUMER_SECRET="bIgLonGStringOfCharacT3rs"
export AUTHOMATIC_SALT="jAmOnYourKeyBoaRd"
Note that AUTHOMATIC_SALT
can be set to any random string. It's just cryptographic salt for the authentication library we use.
Once set up, run fab app
and visit http://localhost:8000
in your browser. If authentication is not configured, you'll be asked to allow the application for read-only access to Google drive, the account profile, and offline access on behalf of one of your Google accounts. This should be a one-time operation across all app-template projects.
It is possible to grant access to other accounts on a per-project basis by changing GOOGLE_OAUTH_CREDENTIALS_PATH
in app_config.py
.
View the sample copy spreadsheet.
This document is specified in app_config
with the variable COPY_GOOGLE_DOC_KEY
. To use your own spreadsheet, change this value to reflect your document's key. (The long string of random looking characters in your Google Docs URL. For example: 1DiE0j6vcCm55Dyj_sV5OJYoNXRRhn_Pjsndba7dVljo
)
A few things to note:
- If there is a column called
key
, there is expected to be a column calledvalue
and rows will be accessed in templates as key/value pairs - Rows may also be accessed in templates by row index using iterators (see below)
- You may have any number of worksheets
- This document must be "published to the web" using Google Docs' interface
The app template is outfitted with a few fab
utility functions that make pulling changes and updating your local data easy.
To update the latest document, simply run:
fab text.update
Note: text.update
runs automatically whenever fab render
is called.
At the template level, Jinja maintains a COPY
object that you can use to access your values in the templates. Using our example sheet, to use the byline
key in templates/index.html
:
{{ COPY.attribution.byline }}
More generally, you can access anything defined in your Google Doc like so:
{{ COPY.sheet_name.key_name }}
You may also access rows using iterators. In this case, the column headers of the spreadsheet become keys and the row cells values. For example:
{% for row in COPY.sheet_name %}
{{ row.column_one_header }}
{{ row.column_two_header }}
{% endfor %}
When naming keys in the COPY document, please attempt to group them by common prefixes and order them by appearance on the page. For instance:
title
byline
about_header
about_body
about_url
download_label
download_url
Sometimes, our projects need to read data from a Google Doc that's not involved with the COPY rig. In this case, we've got a helper function for you to download an arbitrary Google spreadsheet.
This solution will download the uncached version of the document, unlike those methods which use the "publish to the Web" functionality baked into Google Docs. Published versions can take up to 15 minutes up update!
Make sure you're authenticated, then call oauth.get_document(key, file_path)
.
Here's an example of what you might do:
from copytext import Copy
from oauth import get_document
def read_my_google_doc():
file_path = 'data/extra_data.xlsx'
get_document('0Ak3IIavLYTovdGdpcXlFS1lBaVE5aEJqcG1nMUFTVWc', file_path)
data = Copy(file_path)
for row in data['example_list']:
print '%s: %s' % (row['term'], row['definition'])
read_my_google_doc()
Python unit tests are stored in the tests
directory. Run them with fab tests
.
With the project running, visit localhost:8000/test/SpecRunner.html.
Compile LESS to CSS, compile javascript templates to Javascript and minify all assets:
workon graeae
fab render
(This is done automatically whenever you deploy to S3.)
If you want to test the app once you've rendered it out, just use the Python webserver:
cd www
python -m SimpleHTTPServer
fab staging master deploy
You can deploy to EC2 for a variety of reasons. We cover two cases: Running a dynamic web application (public_app.py
) and executing cron jobs (crontab
).
Servers capable of running the app can be setup using our servers project.
For running a Web application:
- In
app_config.py
setDEPLOY_TO_SERVERS
toTrue
. - Also in
app_config.py
setDEPLOY_WEB_SERVICES
toTrue
. - Run
fab staging master servers.setup
to configure the server. - Run
fab staging master deploy
to deploy the app.
For running cron jobs:
- In
app_config.py
setDEPLOY_TO_SERVERS
toTrue
. - Also in
app_config.py
, setINSTALL_CRONTAB
toTrue
- Run
fab staging master servers.setup
to configure the server. - Run
fab staging master deploy
to deploy the app.
You can configure your EC2 instance to both run Web services and execute cron jobs; just set both environment variables in the fabfile.
Cron jobs are defined in the file crontab
. Each task should use the cron.sh
shim to ensure the project's virtualenv is properly activated prior to execution. For example:
* * * * * ubuntu bash /home/ubuntu/apps/graeae/repository/cron.sh fab $DEPLOYMENT_TARGET cron_jobs.test
To install your crontab set INSTALL_CRONTAB
to True
in app_config.py
. Cron jobs will be automatically installed each time you deploy to EC2.
The cron jobs themselves should be defined in fabfile/cron_jobs.py
whenever possible.
Web services are configured in the confs/
folder.
Running fab servers.setup
will deploy your confs if you have set DEPLOY_TO_SERVERS
and DEPLOY_WEB_SERVICES
both to True
at the top of app_config.py
.
To check that these files are being properly rendered, you can render them locally and see the results in the confs/rendered/
directory.
fab servers.render_confs
You can also deploy only configuration files by running (normally this is invoked by deploy
):
fab servers.deploy_confs
Sometimes it makes sense to run a fabric command on the server, for instance, when you need to render using a production database. You can do this with the fabcast
fabric command. For example:
fab staging master servers.fabcast:deploy
If any of the commands you run themselves require executing on the server, the server will SSH into itself to run them.
The Google Analytics events tracked in this application are:
Category | Action | Label | Value |
---|---|---|---|
graeae | tweet | location |
|
graeae | location |
||
graeae | location |
||
graeae | new-comment | ||
graeae | open-share-discuss | ||
graeae | close-share-discuss | ||
graeae | summary-copied | ||
graeae | featured-tweet-action | action |
|
graeae | featured-facebook-action | action |