Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Home

ceteri edited this page · 24 revisions
Clone this wiki locally

CMU Workshop on Cascading plus City of Palo Alto Open Data

We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.

We will also draw some introductory material from these related talks:

There is also a chapter which describes this app in more detail in the O'Reilly book "Enterprise Data Workflows with Cascaading" (June 2013)

Example App

We used some of the CoPA open data for parks, roads, trees, etc., and have shown how to use Cascading and Hadoop to clean up the raw, unstructured download. Based on that initial ETL workflow, we get geolocation + metadata for each item of interest:

  • trees w/ species
  • road pavement w/ traffic conditions
  • parks

One use case could be “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” In other words, we could determine estimates for albedo vs. relative shade. Perhaps as the starting point for a mobile killer app. Or something.

Additional data is included here, to be joined with the cleaned-up CoPA data about trees and roads. We will also use log data collected using GPS Tracks.

A conceptual diagram for this app in Cascading is shown as:

Conceptual Workflow Diagram

App Development Process

  1. Clean up the raw, unstructured data from the CoPA download… aka ETL
  2. Before modeling, perform visualization and summary statistics in RStudio
  3. Ideation and research for potential use cases
  4. Iterate on business process for the app workflow
  5. Apply best practices and TDD at scale
  6. Integrate with end use cases represented by the workflow endpoints
  7. PROFIT!

Some Caveats:

  • Data Quality: some species names have spelling errors or misclassifications -- could be cleaned up and provided back to CoPA
  • Assumptions have been made about missing data -- were these appropriate for the intended use case?
  • The resulting data still needs: common names for trees, photos, natives vs. invasives, toxicity, etc.
  • There are much better ways to handle the geospatial work, e.g., k-d trees
  • Arguably, this is not a large data set; however, it’s early for the open data initiative, and besides Palo Alto has only 65K population.
  • This provides a good area for a POC, prior to deploying in other, larger metro areas.
  • This example helps illustrate how in terms of “Big Data”, complexity is more important to consider than bigness.

Next Steps

Other relevant data science aspects... some extensions could improve results:

  • Bayesian point estimates for identifying "most frequented" paths and locations from the GPS logs
  • Kriging to smooth the geo distribution of estimated metrics

The use of geohash is arguably a hack, but it works fine for this case. In a larger geographic area there might be discontinuities. A more robust approach for geospatial indexing, for example, would be to use K-D Trees

Note that this example illustrates some key elements of a good data product:

  • ETL of unstructured data (CoPA GIS export)
  • curated metadata: tree species dataset, road albedo dataset
  • log files: iPhone personalized mobile coordinates
  • calibration and testing based on R
  • algorithms: geospatial indexing, replicated joins

Enriching The Data

We could combine this CoPA open data with access to external APIs:

Other Potential Use Cases

Trulia:

  • estimate allergy zones, for real estate preferences
  • optimize sales leads: target sites for conversion to residential solar
  • optimize sales leads: target sites for an urban agriculture venture

Calflora:

  • report observations of natives on endangered species list
  • report new observations of invasives / toxicology
  • infer regions of affinity for beneficial insects

City of Palo Alto:

  • premium payment / bid system for an open parking spot in the shade
  • welcome services for visitors (ecotourism, translated park info, etc.)
  • city planning: expected rates for tree replanting, natives vs. invasives, etc.
  • liabilities: e.g., oleander (common, highly toxic) near day care centers
  • epidemiology, e.g. there are outbreaks of disastrous tree diseases -- with big impact on property values

community organizations:

  • volunteer events: harvest edibles to donate to shelters

start-ups:

  • some of the invasive species are valuable in Chinese medicine while others can be converted to biodiesel -- potential win-win for targeted harvest services

Extending The Data

Looks like this data would be even more valuable if it included ambient noise levels. Somehow.

Question: How could your new business obtain data for ambient noise levels in Palo Alto?

  • infer from road data
  • infer from bus lines, rail schedule
  • sample/aggregate from mobile devices in exchange for micropayments
  • buy/aggregate data from home security networks
  • fly nano quadrotors, DIY "Street View" for audio
  • fly micro aerostats, with Arduino-based accelerometer and positioned parabolic mic
  • partner with City of Palo Alto to deploy a simple audio sensor grid

Build Instructions

To generate an IntelliJ project use:

gradle ideaModule

To build the sample app from the command line use:

gradle clean jar

Before running this sample app, be sure to set your HADOOP_HOME environment variable. Then clear the output directory. To run on a desktop/laptop with Apache Hadoop in standalone mode:

rm -rf output
hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv output/trap output/tsv output/tree output/road output/park output/shade output/reco

To view the results, for example the output recommendations in reco:

ls output
more output/reco/part-00000

An example of log captured from a successful build+run is at https://gist.github.com/3660888

To run the R script, load src/scripts/copa.R into RStudio or from the command line run:

R --vanilla -slave < src/scripts/copa.R

...and then check output in the file Rplots.pdf

About Cascading

There is a tutorial about getting started with Cascading in the blog post series called Cascading for the Impatient. Other documentation is available at http://www.cascading.org/documentation/.

For more discussion, see the cascading-user email forum. We also have a meetup started.

Something went wrong with that request. Please try again.