Combining datasets with MapReduce on NFL play by play data.
Java PigLatin Python Shell
Switch branches/tags
Nothing to show
Permalink
Failed to load latest commit information.
.settings Initial commit. Jan 4, 2013
conf Initial commit. Jan 4, 2013
input Added the latest datasets. Oct 11, 2013
spark Added program that runs all HQL queries. Caches the playbyplay table … Jul 3, 2015
src Fixed regex bug for interceptions. Oct 13, 2013
.classpath Changed class paths to CDH4.3 to support the Cloudera Quickstart VM. … Jul 10, 2013
.gitignore Removed enums as AvroSaver doesn't work correctly with them. It error… Jul 3, 2015
.project Initial commit. Jan 4, 2013
173328.csv Recreated weather data because it was missing too much. Added more co… Jul 11, 2013
README.md Added setup script to make it easier to get started. Oct 11, 2013
adddriveresult.hql Added a new playid to uniquely identify a play and keep its order. Ch… Oct 7, 2013
adddrives.hql Added a new playid to uniquely identify a play and keep its order. Ch… Oct 7, 2013
arrests.csv Added arrests data. Jul 9, 2013
columns.txt Updated readme with URL to play by play. Jul 17, 2013
drivesresulttransform.py Added a new playid to uniquely identify a play and keep its order. Ch… Oct 7, 2013
drivestransform.py Added a new playid to uniquely identify a play and keep its order. Ch… Oct 7, 2013
license.md Create license.md Jul 24, 2014
playbyplay_join.hql Added drive and play numbers. Added number of plays per drive and the… Oct 6, 2013
playbyplay_tablecreate.hql Added a new playid to uniquely identify a play and keep its order. Ch… Oct 7, 2013
queries.hql Modified queries to work with Spark SQL. Jul 3, 2015
queries.pig Added cast to float for percentages. Oct 10, 2013
setup.sh Changed setup script to have a hard-coded directory. Feb 22, 2014
stadiums.csv Added elevation to stadium dataset. Jul 16, 2013

README.md

nfldata

The are two series of MapReduce programs. One is a series of programs to extract and normalize the data. The second is a simple program to look at incomplete passes.

The play by play dataaset can be found at http://www.advancednflstats.com/2010/04/play-by-play-data.html.

ETL Series

This program takes the play by play dataset and merges it with other datasets like arrests, stadiums and weather.

Set things up by running the setup.sh script or run the following steps manually:
Run the PlayByPlayDriver on the play by data data.
Run the ArrestJoinDriver on the data from PlayByPlayDriver. (place in HDFS under joinedoutput)
Put the stadiums.csv in HDFS in a directory called stadium.
Put the 173328.csv in HDFS in a directory called weather.
In Hive, run playbyplay_tablecreate.hql.
In Hive, run playbyplay_join.hql.
In Hive, run adddrives.hql.
In Hive, run adddriveresult.hql.
Query and have fun!
See the queries in queries.hql for some examples of how and what to query.

Incomplete Passes

Simple MapReduce on NFL play by play data. This program focuses on incomplete passes and which receiver they were throw to. See http://www.jesse-anderson.com/2013/01/nfl-play-by-play-analysis/ for the resulting charts.