# Using open data with Jupyter notebooks

This notebook highlights how to use and work with open data using Jupyter notebooks in comparison to a more traditional approach of using standard, desktop tools to perform an open data assignment. 

The goal of the exercise is to use current National Hockey League (NHL) results to determine whether a team is on pace for making the playoffs. 

## Traditional approach

### Tool 1
Traditionally, students would have had to go to a particular website to access the data:
    http://www.hockey-reference.com/teams/CGY/2017_games.html

<img src="opendata_imgs/cgy_standings.png" width="500px" />

### Tool 2
From there, they would have to manually copy and paste the data into a tool such as Microsoft Excel. 

<img src="opendata_imgs/cgy_excel.png" width="100%"/>

...and create some graph...

![title](opendata_imgs/cgy_excel_graph.png)

### Tool 3
...that is then copied and pasted into Microsoft Word in order to write up a final report. 

![title](opendata_imgs/cgy_word.png)

In total, that means the students would need to use the following tools:
- a web browser
- Microsoft Excel or something like it
- Microsoft Word or something similar

The final product is usually a static snapshot in time. 

## Jupyter notebooks approach

Using Jupyter notebooks, the entire analysis can be done in one tool, requiring only a web browser. The end product is an interactive notebook that combines active code along with the explanatory narrative for how the analysis was conducted - literate programming - which can be interpreted by anyone. 

Start by installing some libraries that are needed to accomplish the tasks. These only need to be installed once and this will take a couple of minutes.

In [None]:
install.packages(c("RCurl", "XML", "plyr", "ggplot2"))

Load the libraries: 

In [None]:
library(RCurl)
library(XML)
library(plyr)
library(ggplot2)

Now, let's look at the data and load it directly into the Jupyter notebook from the site we previously manually visited: 

In [None]:
readHTMLTable("http://www.hockey-reference.com/teams/CGY/2017_games.html", header=T)

Clean the data and calculate the number of points the team has accumulated.

In [None]:
cgy_results <- readHTMLTable("http://www.hockey-reference.com/teams/CGY/2017_games.html", header=T)
results.clean <- cgy_results$games

results.clean <- results.clean[results.clean$Opponent!='Opponent',]
results.clean <- results.clean[,c(1,2,5,6,7,10,11,12)]

results.clean$W <- as.integer(as.character(results.clean$W))
results.clean$L <- as.integer(as.character(results.clean$L))
results.clean$OL <- as.integer(as.character(results.clean$OL))
results.clean$GP <- as.integer(as.character(results.clean$GP))

results.clean$Points <- (results.clean$W) * 2 + results.clean$OL * 1

Create a plot to see how the team has been accumulating points

In [None]:
p <- ggplot(data=results.clean, aes(x=GP, y=Points)) + geom_jitter() 
p

Let's add a line for how many points they should have in order to get to 96 points at the end of the season, which pretty much guarantees a playoff spot. 

In [None]:
results.clean$PtsPace <- results.clean$GP * (96/82.0)

In [None]:
p <- ggplot(data=results.clean, aes(x=GP, y=Points)) + geom_line() + geom_line(aes(x = GP, y=PtsPace), colour="red") 
p

Mark-up the figure in order to make it a bit easier to understand. 

In [None]:
p.title <- list(labs(title="How my team is doing against a 96 point pace"), xlab("Games Played"), ylab("Points"))
p + p.title

In conclusion, the Flames are on pace to make the playoffs.