# Demo 8 - Raleigh 2014 Incident Time Series Analysis

This is a demo to perform basic time series analysis on our Raleigh 2014 incident dataset in R.  In this demo, we will perform basic analysis, particularly around quick visualization.

In [None]:
install.packages('nlme', repos = "http://cran.us.r-project.org")
install.packages('tidyverse', repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages('RODBC', repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages('scales', repos = "http://cran.us.r-project.org", dependencies = TRUE)

In [None]:
library(tidyverse)
library(RODBC)
library(scales)

Connect to a local database and load Raleigh incident data.  If you need help loading the data, check out DataLoad\RaleighIncidents2014\0 - Database Prep.sql and follow the instructions from there.

In [None]:
conn <- odbcDriverConnect("driver={SQL Server};server=LOCALHOST;database=OutlierDetection;trusted_connection=true")
raleigh2014 <- sqlQuery(conn,
  "SELECT
	i.BeatID,
	i.IncidentCode,
	ic.IncidentDescription,
	it.IncidentType,
	i.IncidentDate,
	i.IncidentNumber
FROM Raleigh2014.Incident i
	INNER JOIN Raleigh2014.IncidentCode ic
		ON i.IncidentCode = ic.IncidentCode
	INNER JOIN Raleigh2014.IncidentType it
		ON ic.IncidentTypeID = it.IncidentTypeID;"
)

The first step when analyzing a data set:  review the variables and basic summary information.

In [None]:
str(raleigh2014)

We're going to want to do a bit of cleanup here.  We'll make the text values (including the ill-named Incident Number) into strings and split out date into several columns for easier analysis downstream.

In [None]:
raleigh2014$IncidentNumber <- as.character(raleigh2014$IncidentNumber)
raleigh2014$IncidentCode <- as.character(raleigh2014$IncidentCode)
raleigh2014$IncidentType <- as.character(raleigh2014$IncidentType)
raleigh2014$IncidentDescription <- as.character(raleigh2014$IncidentDescription)
raleigh2014$IncidentYear <- as.integer(format(raleigh2014$IncidentDate, format="%Y"))
raleigh2014$IncidentMonth <- as.integer(format(raleigh2014$IncidentDate, format="%m"))
raleigh2014$IncidentDay <- as.integer(format(raleigh2014$IncidentDate, format="%d"))

## Time Series Basics

We can look at incidents by month with a simple time-series analysis.  We will add a new variable called IncidentCount and aggregate those incident counts.

In [None]:
raleigh2014.timeseries <- raleigh2014 %>% filter(IncidentYear < 2014)
raleigh2014.timeseries$IncidentCount <- 1
raleigh2014.timeseries$Month <- as.Date(cut(raleigh2014.timeseries$IncidentDate, breaks = "month"))
raleigh2014.timeseries$Date <- as.Date(cut(raleigh2014.timeseries$IncidentDate, breaks = "day"))
ggplot(data = raleigh2014.timeseries, aes(x = Month, y = IncidentCount)) +
  stat_summary(fun.y = sum, geom = "line") +
  scale_x_date(labels = date_format("%Y-%m"), breaks = date_breaks("year"))

It seems that the number of incidents drops sharply each year...but that coincides with February, so it could just be that February has 2-3 fewer days than the rest of the months.  Let's look instead of incident count, incidents per day.

In [None]:
ipd <- raleigh2014.timeseries %>%
          group_by(Month) %>%
          summarize(IncidentsPerDay = sum(IncidentCount)/max(IncidentDay))

ggplot(data = ipd, aes(x = Month, y = IncidentsPerDay)) +
  geom_point() +
  geom_line() +
  scale_x_date(labels = date_format("%Y-%m"), breaks = date_breaks("year"))

This shows that the number of incidents per day stays within the range 114-134 per day (outside a few exceptional months), but something interesting is that incident rates per day are still well below the average.  This tells us that the two missing days aren't the entire difference here.