We have built an example app in Cascading and Apache Hadoop, based on the City of Palo Alto open data provided via Junar: http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/
Students can extend the example workflow to build derivative apps, or use it as a starting point for other ways to leverage this data.
We will also draw some introductory material from these related talks:
There is also a chapter which describes this app in more detail in the O'Reilly book "Enterprise Data Workflows with Cascaading" (June 2013)
We used some of the CoPA open data for parks, roads, trees, etc., and have shown how to use Cascading and Hadoop to clean up the raw, unstructured download. Based on that initial ETL workflow, we get geolocation + metadata for each item of interest:
One use case could be “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” In other words, we could determine estimates for albedo vs. relative shade. Perhaps as the starting point for a mobile killer app. Or something.
A conceptual diagram for this app in Cascading is shown as:
Other relevant data science aspects... some extensions could improve results:
The use of geohash is arguably a hack, but it works fine for this case. In a larger geographic area there might be discontinuities. A more robust approach for geospatial indexing, for example, would be to use K-D Trees
Note that this example illustrates some key elements of a good data product:
We could combine this CoPA open data with access to external APIs:
Looks like this data would be even more valuable if it included ambient noise levels. Somehow.
Question: How could your new business obtain data for ambient noise levels in Palo Alto?
To generate an IntelliJ project use:
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, be sure to set your
HADOOP_HOME environment variable. Then clear the
output directory. To run on a desktop/laptop with Apache Hadoop in standalone mode:
rm -rf output hadoop jar ./build/libs/copa.jar data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv output/trap output/tsv output/tree output/road output/park output/shade output/reco
To view the results, for example the output recommendations in
ls output more output/reco/part-00000
An example of log captured from a successful build+run is at https://gist.github.com/3660888
To run the R script, load
src/scripts/copa.R into RStudio or from the command line run:
R --vanilla -slave < src/scripts/copa.R
...and then check output in the file