Simple Crime Statistics
----------------------

**Updating the MCCA data** 

When we finished our last session, I left you with a challenge -- see if you can get the data into shape from the Major Cities Chiefs Association for 2016-2017, and update our analysis. The original report that generated the CSV we used for 2015-2016 [is posted here](https://www.majorcitieschiefs.com/pdf/news/mcca_violent_crime_data_midyear_20162015.pdf). The updated version for 2016-2017 [is posted here](https://www.majorcitieschiefs.com/pdf/news/mcca_violent_crime_report_2017_and_2016_midyear_07312017_update.pdf). I asked that you grab the new file and see what you could do with it. I mentioned that a program like [Tabula](http://tabula.technology/) can be your friend when pulling data from tables and converting it to  more usable CSV files. How did you do? Let's review this exercise briefly, as you will have to do this kind of thing a lot. A lot. 

First, before we try to do anything fancy, what can you tell me about the two reports? Just look at the PDF's for a second and jot down some conclusions. First think about the data and then think about its "structure" -- how is it organized? How are things labled? If we want to combine the two data sets, this kind of ovservation is important.

**Put your observations here**



Now, let's be a little more systematic. I've gone to the trouble of creating CSVs from both PDFs. You can download
[MCCA 2015-2016 here](https://www.dropbox.com/s/ds39805jpc2gajy/mcca2016.csv?dl=0) and 
[MCCA 2016-2017 here](https://www.dropbox.com/s/8oogqsl86iijb9y/mcca2017.csv?dl=0). Download each and put them in the folder where you put this notebook file. We can then read them in and have a look.

In [None]:
m1 = read.csv("mcca2016.csv",as.is=TRUE)
m2 = read.csv("mcca2017.csv",as.is=TRUE)

head(m1)
head(m2)

Any new observations?

What we'd like to do next is "join" the two data sets. We are going to use the agency name as our "key" for this operation. That is, we want to take all the crime data for Austin, say, and create one long row that has both the 2016-2017 data as well as the 2015-2016 data. The libary dplyr comes to our rescue again with another verb. There are several kinds of "joins" that we can do when bringing the data from two tables together. 

We know, for example, that the two MCCA data sets include slightly different cities. So, how do we handle that? Do we view the current data set as our master, in some sense, and just add whatever data we can from the previous year, ignoring all the cities in 2015-2016 that don't appear in 2016-2017? Do we do it the other way? Do we only include cities that are part of both data sets? That are part of either data set? You can read about the different flavors of "join" by asking R for help.

In [None]:
library(dplyr)

In [None]:
?join

In this case, I think we want a "full join," meaning we keep all the data and just fill in with missing values (NA's) data for cities that occur in one year's report but not the other. Here's what we get. 

In [None]:
mcca = full_join(m1,m2,by="agency")
head(mcca)

Since we had columns named "hom16" in both data sets, R added a suffix to each to disambiguate them. I'm not wild about ".x" and ".y" so we can specify something different using the argument "suffix" to the full_join() command. What happens now?

In [None]:
mcca = full_join(m1,m2,by="agency",suffix=c("a","b"))
head(mcca)

We use the option() command to control the way R output appears in the notebook. We have used this already when resizing our histogram plots. The options() call below lets us show a few more rows and columns of our data set, replacing the "..." we see in the output above.

In [None]:
options(repr.matrix.max.cols=50, repr.matrix.max.rows=100)
head(mcca)

Mostly for practice, let's now just look at the homicide rate. When the MCCA data came out in 2016, it was during the election cycle and candidates used them to argue about whether crime was on the rise or not. Journalistic organizations fed into this. [Here is an article from Breitbart](http://www.breitbart.com/big-government/2016/07/26/survey-violent-crime-major-cities/) and [here is one from the New York Times](https://www.nytimes.com/2016/05/14/us/murder-rates-cities-fbi.html). Remember from last time that crime statistics can be extremely variable from year to year. 

What does the current data say about our crime rate? Going up? Down? Remember the command select() chooses which columns from a data frame to keep. Here we skinny things down just to homicides.

In [None]:
homicides = select(mcca,agency,hom17,hom16b,hom16a,hom15)
homicides

One thing you'll notice right away is that the counts from 2016 don't match between the data sets. Why? You also see which cities were in and out of the previous file. Why? 

What do you think about crime? Homicides on the rise? On the decline? Let's be more systematic. Let's filter() the data so that we keep the rows where the homicide count increased from 2015 to 2016 and from 2016 to 2017. How many cities are there? 

In [None]:
filter(homicides,hom17-hom16b>0 & hom16a-hom15>0)

In [None]:
filter(homicides,hom17-hom16b<=0 & hom16a-hom15<=0)

So that means there are 11 cities that saw year-on-year increases, four with decreases and the remaining 40 or so switched from up to down or down to up. So let's look back at what was written in, say, the Brietbart article.

>“The biggest take away is that even though there is an increase in several violent crimes a few cities (Chicago, Phoenix, Las Vegas, Dallas, LA County, Louisville, San Antonio) account for much of the overall increase,” Darrel Stephens, the executive director of the Major Cities Chiefs Association explained in an email to Breitbart News, “Of course the tragic mass shooting in Orlando accounts for 49 of the homicides.
<br><br>
Of the 51 major cities’ police departments’, 29 reported increases in the number of homicides, including: Arlington PD, Atlanta PD, Aurora PD, Austin PD, Baltimore County PD, Boston PD, Chicago PD, Dallas PD, Forth Worth PD, Jacksonville Sheriff’s Dept., Las Vegas Metropolitan PD, Long Beach PD, Los Angeles County Sheriff’s Dept., Los Angeles PD, Louisville Metro PD, Nashville PD, Newark PD, Oklahoma City PD, Orlando PD, Philadelphia PD, Phoenix PD, Pittsburgh PD, Prince George’s County PD, San Antonio PD, San Diego PD, San Jose PD, Seattle PD, Tulsa PD, Washington DC (Metro PD).

Chicago had an increase in the 2016-2017 report but very small one compared to the 2015-2016 number. Phoenix dropped in the recent report, as did Las Vegas, Dallas, LA County, and San Antonio. Nearly all of the cities cited by the MCCA executive director. Let's look at the percentage increase again and then compare 2016-2017 to 2015-2016. There are a lot of ways to do this, but let's keep it simple with a histogram. We'll use mutate() to create a new version of "homicides" that has the percentage change as columns.

What do you see?

In [None]:
options(repr.plot.width=8, repr.plot.height=6)

homicides = mutate(homicides,hom1716=(hom17-hom16b)/hom16b,hom1615=(hom16a-hom15)/hom15)
hist(homicides$hom1716-homicides$hom1615,main="Difference in percent change")

The big drop?

In [None]:
filter(homicides,homicides$hom1716-homicides$hom1615< -6)

This goes to explain also why criminologists prefer to look at longer time periods than a single year. One event can completely skew the statistics and a blind analysis might mistake it for a trend.

In [None]:
hist(homicides$hom1716-homicides$hom1615,xlim=c(-3,3),main="Difference in percent change")

**Uniform Crime Reporting - The first data from the Trump Administration**

We finished the last session by looking at data from the [Uniform Crime Reporting program](https://ucr.fbi.gov/) from the FBI. This aggregates crime statistics from local police and law enforcement agencies across the country. [The first report under the Trump administration](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016) was released recently and was picked up by various news outlets. The Atlantic had [a long piece](https://www.theatlantic.com/politics/archive/2017/09/americas-uneven-crime-spike/541023/) examining where crime increases are happening. Let's start by verifying some of their data. 

For the moment, we'll focus on 2016 data and not try to compare it to 2015 as we did with the MCCA data. Toward that end, the Atlantic states

>Rural, urban, and suburban communities all saw increases in violent crimes in 2016. But they were of varying degrees. Some places, like Houston and Washington, D.C., saw the number of murders either stay roughly the same or slightly decline. Other communities fared worse. Chicago ended 2016 with 762 murders, a whopping 58 percent jump over 2015’s total. Baltimore experienced its second-deadliest year on record with 358 murders, surpassing the previous record set in 2015.

Let's check that out. First, we will read in the data. [I've loaded it here.](https://www.dropbox.com/s/cx3ngaser1w19f9/ucr2016.csv?dl=0)

In [None]:
crime = read.csv("ucr2016.csv",as.is=TRUE)
head(crime)

In [None]:
tail(crime)

One of the functions, or verbs, added in dplyr is a special summary function. It helps you tell what kind of variable each column represents. This boils down to qualitative versus quantitative, a fundamental distinction as we get into analysis. Why? The function is called glimpse().

In [None]:
glimpse(crime)

In [None]:
dim(crime)

So glimpse() tells us that we have 9579 cities in the US reporting data to the the FBI. There are 14 columns, describing the name, state, population and crime statistics for each. Recall we can extract the contents of a single column using the dollar sign. Here we take just the state names...

In [None]:
crime$State

... or the city names (using head to give us the first handful)...

In [None]:
head(crime$City)

... and then summarize the state names with a simple tabulation. That is, how many times does each state appear in the data set, or, equivalently, how many cities are included in the data set for each state.

In [None]:
table(crime$State)

As we have done with the MCCA data, we can pull out columns and summarize them in various ways. Here we just look at the overall count of violent crime in the country for 2016.

In [None]:
sum(crime$Violent_crime,na.rm=TRUE)

While the dollar sign lets us extract a single variable, the dplyr function select() lets us pull a number of them. Here we take City, State, Population and Violent_crime. In the select() function we can refer to the names of variables without the dollar sign -- all the names are interpeted as coming from the data frame "crime".

In [None]:
head(select(crime,State, City, Population, Violent_crime),25)

The verb summarise() (Hadley is from NZ and hence the "ise") takes a data frame in, this time "crime" and returns another one, but consisting of summaries. So here we sum up all the violent crime, and in the next cell we sum crime and population across cities.

In [None]:
summarise(crime, Violent_total = sum(Violent_crime,na.rm=TRUE))

In [None]:
summarise(crime,Violent_total = sum(Violent_crime,na.rm=TRUE),Population_total = sum(Population,na.rm=TRUE))

This gets interesting when we break the data into parts. Here we group_by() the "State" variable and then compute the summaries. Here are the totals of violent crime and then violent crime and population by state.

In [None]:
summarise(group_by(crime,State),Violent_total = sum(Violent_crime,na.rm=TRUE))

In [None]:
summarise(group_by(crime,State),Violent_total = sum(Violent_crime,na.rm=TRUE),Population_total = sum(Population,na.rm=TRUE))

We can store the output of this operation in a new variable called "state" that we can then examine.

In [None]:
state = summarise(group_by(crime,State),Violent_total = sum(Violent_crime,na.rm=TRUE),Population_total = sum(Population,na.rm=TRUE))
head(state)

The obvious thing to do next is introduce a new column. Previously we used the dollar sign to make this happen. With dplyr we can use a new verb mutate() that takes a data set like "state" and then adds new variables. Here we create a variable called "Violent_per100" to represent the number of violent crimes per 100,000 people in the state. (While this is just a rescaling of the "per capita" number, it does seem a little awkward for some states.)

In [None]:
state = mutate(state,Violent_per100=100000*Violent_total/Population_total)
state

This is a little advanced but rather than worry about all the parentheses (passing tables as arguments to functions), we can use a "pipe" that takes the output of one command and uses it as input to the next. So here we take the crime data set, group it by state and the summarize the grouped data set with total population and total violent crime. Finally we take the result and add a new column to the table using mutate(). The whole thing is stored (using ->) in the data frame "state". 

In [None]:
crime %>%
    group_by(State) %>%
    summarize(Violent_total = sum(Violent_crime,na.rm=TRUE),Population_total = sum(Population,na.rm=TRUE)) %>%
    mutate(Violent_per100=100000*Violent_total/Population_total) -> state

**The most dangerous cities**

States are fine, but there are plenty of stories that come out each year about city-specific crime rates. [Here is one for this year](http://247wallst.com/special-report/2017/09/27/25-most-dangerous-cities-in-america-2/5/), although Forbes does it every year. It is almost irresistable. The typical figure used to compare cities is again the incidents of violent crime per 100,000 people. 

Here we create the per 100,000 figure but for each city...

In [None]:
new_crime = mutate(select(crime,State,City,Population,Violent_crime),Violent_per100=100000*Violent_crime/Population)
head(new_crime)

... and then have a look.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)

hist(new_crime$Violent_per100)

In [None]:
hist(new_crime$Violent_per100,breaks=100)

We can sort the table using the verb arrange(). Descending order makes sense because we want to get the most dangerous places, those with the highest incident rates at the top.

In [None]:
head(arrange(new_crime,desc(Violent_per100)),25)

To remove the small cities, we can filter() (another verb in dplyr) the data, giving the same kind of condition we saw with the MCCA data. Here we look for cities with population bigger than 100,000.

In [None]:
new_crime = filter(new_crime,Population>100000)
head(new_crime)

In [None]:
head(arrange(new_crime,desc(Violent_per100)),25)

In [None]:
hist(new_crime$Violent_per100)

In [None]:
qqnorm(new_crime$Violent_per100)

**Comparisons.** For years now, certain cities have remained at the top of this list and many mayors reason that these kinds of rankings are fundamentally inaccurate because they don't compare "apples to apples". Essentially, city boundaries are determined on political grounds and not based on something more generalizable like population density. Some cities are essentially "locked in place," having a small older inner portion of the city that contains very few low-crime suburbs. It is in this older inner core that more crime tends to happen. 

Have a look, for example, at [Baltimore](https://www.trulia.com/real_estate/Baltimore-Maryland/crime/) and [Detroit](https://www.trulia.com/real_estate/Detroit-Michigan/crime/). Compare these to a city like [Jacksonville, Forida.](https://www.trulia.com/real_estate/Jacksonville-Florida/crime/). In the first two cases you see that small, inner core, whereas Jacksonville city limits include relatively crime-free suburbs. It's worth looking at the historical city boundaries of Jacksonville.
<img src=https://photos.smugmug.com/photos/564021382_VSDFJ-M.jpg>

This is just one of several reasons why people try to avoid these kinds of city-to-city comparisons, even though they come out every year. [Even the FBI cautions against using rankings!](https://ucr.fbi.gov/ucr-statistics-their-proper-use).

**Data publication.** As this is the first report of the Trump administration, it's worth asking if things are "business as usual" in the FBI or if something had changed. The FBI itself notes that [things have changed in terms of their publication of statistics this year.](https://www.fbi.gov/news/pressrel/press-releases/fbi-releases-2016-crime-statistics)

>This publication is a statistical compilation of offense, arrest, and police employee data reported by law enforcement agencies voluntarily participating in the FBI’s Uniform Crime Reporting (UCR) Program. The UCR Program streamlined the 2016 edition by reducing the number of tables from 81 to 29, but still presented the major topics, such as offenses known, clearances, and persons arrested. Limited federal crime, human trafficking, and cargo theft data are also included.

You can see what tables have been removed by [consulting the UCR web site.](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-crosswalk-1) FiveThirtyEight had [a nice article on the changes to data publication.](https://fivethirtyeight.com/features/the-first-fbi-crime-report-issued-under-trump-is-missing-a-ton-of-info/) They report the following.

>In response to queries from FiveThirtyEight about whether the changes to the 2016 report had been made in consultation with the Advisory Policy Board, a spokesman for the UCR responded that the program had “worked with staff from the Office of Public Affairs to review the number of times a user actually viewed the tables on the internet.” When FiveThirtyEight informed a former FBI employee of the process, he said it was abnormal.
<br><br>
“To me it’s shocking that they made these decisions to publish that many fewer tables and they didn’t make the decision with the APB,” James Nolan, who worked at the UCR for five years and now teaches at West Virginia University, told FiveThirtyEight.

It might be worth writing to the FBI and obtaining a copy of the original data from which their tables are derived. Jeff Asher told me that the data collection remains "as is", it's just the tables that are being dropped. This will, he reckons, have a bigger impact on journalists who depend on the tables and are not able to work easily with the raw data (unlike, say criminologists or academics). 

**Other kinds of comparisons.** The UCR data has been used to make a number of other kinds of comparisons. For example, [Breitbart publishes an nearly annual story on shotgun deaths](http://www.breitbart.com/big-government/2017/10/16/fbi-over-four-times-more-people-stabbed-to-death/), observing that more people are stabbed to death with knives than are shot to death with shotguns. [Here is the raw table.](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-12) What do you think? ([Here is what Snopes thinks](https://www.snopes.com/four-times-more-stabbed-than-rifles-any-kind/))

**Filling gaps.** Before we return to more statistical work, one more comment on data that is or isn't there. For many years, the voluntary nature of the UCR meant that we had no good accounting of how many people were killed by police. Eventually, The Guardian and the Washington Post each mounted projects to try to come up with better numbers. [Here is The Guardian's piece "The Counted".](https://www.theguardian.com/us-news/series/counted-us-police-killings) It's worth looking at how they "crowdsourced" the data (partially) and how they relied on partnerships with other organizations that had been using media searches to come up with similar calculations.

**Back to our crime analysis and "clustering"**

Let's load up another data set. It can be downloaded from Courseworks or from [this link.](https://www.dropbox.com/s/3qmxqg7vri1ja8c/new_ucr.csv?dl=0) It contains information in our original data, but organized differently. Can you see what happened? How is this data set diferent from our original?

In [None]:
nucr = read.csv("new_ucr.csv",row.names=1)
head(nucr)

Explain the difference here. What computations are suddenly easier in this form and what's harder?


This form makes simple plots across years very easy to express. Here are histograms of violent crimes per 100K people in 1990 and 2015. Notice the ranges are quite different. 

In [None]:
options(repr.plot.width=7,repr.plot.height=6)

hist(nucr$y1990,main="1990 Violent Crimes",xlab="Violent crimes per 100K people")
hist(nucr$y2015,main="2015 Violent Crimes",xlab="Violent crimes per 100K people")

Recall the way we extract data from a table using "$". While dplyr takes old tables and makes new tables, our lessons from last week allow us to extract the data from a table and feed it to a histogram or some other display or summary. 

Functions like hist() want a "vector" of data points (the data from a column or row in a table, say) and not a table. A vector in R is simply a collection of values. They might be numeric or characters (names of things) or boolean (TRUE/FALSE). You can create one by subsetting a table...

In [None]:
nucr$y1990

... or by using the concatenate command c(). Here we make a vector of six numbers...

In [None]:
c(1,3,12,3,4,2)

... and here six cities.

In [None]:
c("Boston","Atlanta","Los Angeles","New York","Chicago","Seattle")

To combine our two lessons, we can use dplyr functions to create new data sets in a clean (readable) way and then make a histogram of what we've created by extracting data from the table. So here we look at relative differences in violent crimes from 1990 to 2015 and 2010 to 2015. In the Marshall Project article, the former is Obama's view of crime and the latter is Trump's.

In [None]:
drops = mutate(nucr,long=(y2015-y1990)/y1990,short=(y2015-y2010)/y2010)

hist(drops$long,main="Relative drops between 1990 and 2015",xlab="Percent change")
hist(drops$short,main="Relative drops between 2010 and 2015",xlab="Percent change")

Last time, we made use of the plot() command to create simple scatterplots and time series plots of our crime statistics. To overlay the data from many cities at once, we can use a command called matplot(), which is short for "matrix plot."

Here we supply a common x-axis (here the years 1975 through 2015) and a table of columns, where each column is to be plotted against the same x-axis. In the code below, our columns are, well, rows of nucr. That is, we take the transpose, t(), of nucr so that the columns are cities and the rows are years. We then create our plot. 

This is a very specialized plot so don't worry too much about t() and matplot(). The result is what we're after -- to look at all the years for the different cities at one time. 

In [None]:
# here is what t() does...
head(nucr)
head(t(nucr))

In [None]:
matplot(1975:2015,t(nucr),type="l",lty=1,col='grey',xlab="year",ylab="Violent Crime per 100K residents")

Now, for our clustering, we want to identify groups of cities. Notice from the plot above, for example, that we have a number of cities with really low crime rates. Their curves are all grouped together visually. The top four or five curves also have basically the same shape and are near each other visually.

Clustering is a way to create groups of cities that have similar crime characteristics. Clustering often begins with a notion of **distance** between two rows in a data set. Clusters are then defined as groups of points that are closer to each other than to the rest of the data set. 

The curves above are good examples to start with because we can see each row of data as a curve. That's the reason we went to the new data format! When two curves near each other, the distance should be small. For distance, we'll just use the sum of squared differences between curve values, across time. 

That's a mouthful so here's a closeup, the distance between Atlanta and Boston. First extract the rows of data we need as vectors.

In [None]:
a = nucr["Atlanta",]
a

In [None]:
b = nucr["Boston",]
b

... and here is the distance. We take differences between a and b, square them, sum them up and then take a square root. Remember your geometry! Below we make the calculation and also unpack it in pieces.

In [None]:
sqrt(sum((a-b)^2))

In [None]:
a-b

In [None]:
(a-b)^2

In [None]:
sum((a-b)^2)

R has a function that will compute the distances between all pairs of rows in a data set. It's called, poetically enough, **dist()**. Here we make a small data set of three rows and compute the 3 pairs of distances.

In [None]:
small = nucr[c("Atlanta","Boston","Houston"),]
small

In [None]:
dist(small)

Now, let's do this for the entire data set of 61 cities. 

In [None]:
d = dist(nucr)
d

Hierarchical clustering forms groups of cities, rows in our data set. It does so by progressively joining the nearest data points and organizes them into a tree. We went over this in class but [here is a short YouTube video from Jeff Leek](https://www.youtube.com/watch?v=nIsLDtXlalo) that explains the process nicely.

Here how we fit the clustering tree. The magic of this procedure is how you compare two groups of points using their pairwise distances. We'll talk about this more later, but for now, there's a guy with the last name of Ward and he created a technique that tries to form really tight clusters.

In [None]:
fit = hclust(d,method="ward.D")

... and how we define k=4 clusters.

In [None]:
options(repr.plot.width=8,repr.plot.height=6)

plot(fit,cex=0.8)
rect.hclust(fit,k=4,border="red")

We can create a vector of 61 values (one for each city) that tells us the different groups defined by the clustering. 

In [None]:
cutree(fit,k=4)

Finally, let's look at the different groups.

In [None]:
groups = cutree(fit,k=4)
matplot(1975:2015,t(nucr[groups==1,]),type="l",lty=1,col='grey',xlab="year",ylab="Violent Crime per 100K residents",ylim=c(0,4400))
matplot(1975:2015,t(nucr[groups==2,]),type="l",lty=1,col='grey',xlab="year",ylab="Violent Crime per 100K residents",ylim=c(0,4400))
matplot(1975:2015,t(nucr[groups==3,]),type="l",lty=1,col='grey',xlab="year",ylab="Violent Crime per 100K residents",ylim=c(0,4400))
matplot(1975:2015,t(nucr[groups==4,]),type="l",lty=1,col='grey',xlab="year",ylab="Violent Crime per 100K residents",ylim=c(0,4400))

Notice that the data group roughly from low values to high. But the shapes look about right. From here we can dig in and explain what these cities have in common.

**Distances, revisited**

Last time, we saw how distances could be used to identify clusters in data. A simple (to describe) recipe for finding clusters involved simply growing them from the ground up. That is, progressively join nearby points. The resulting structure is a cluster tree (often called a dendrogram). The whole process started with a **distance matrix** that records all pairwise distances between points in a data set (rows in the table). 

A number of useful statistical procedures begin with a distance matrix. One that sounds worse than it is, is **multidimensional scaling** (MDS). It was popular in the psychometric literature where it was used to evaluate the differences in perception for different stimula -- color is a classic example. MDS is, like the hierarchical clustering technique we learned, easy to explain. We start with a distance matrix and then create a map of the data points such that if we measure the distance between the points on the map, they are as close as we can get to the entries in our distance matrix. 

Here's the classic introductory example. Suppose you have measured the distances between 10 cities. So Seattle to Boston, Los Angeles to San Francisco and so on. Here's our distance matrix.

<img src=http://forrest.psych.unc.edu/teaching/p208a/mds/mdstable1.gif>

MDS positions the cities (data points) such that their distances match those in the table as closely as possible. Because these data start with distances between actual places on a map, the procedure gives you back something that looks like the correct placement of cities. That's encouraging to see. If MDS couldn't recover the original map, we might be a little concerned about applying it to data.

<img src=http://forrest.psych.unc.edu/teaching/p208a/mds/mdsfig1.gif>

Now, instead of relating data geographically, we can use the crime statistics to evaluate how far different cities are. This would create a map not arranged according to longitude and latitude but some other qualities of the data. Cities nearby each other on the map are close in distance (part of the same cluster). From this map we can "read" clustering a bit more directly than with the tree. 

We also have narrative content in the dimensions of the map itself...

In [None]:
nucr = read.csv("new_ucr.csv",row.names=1)
head(nucr)

In [None]:
options(repr.plot.width=8,repr.plot.height=8)

d = dist(nucr)
fit = cmdscale(d)

par(xpd=TRUE)

plot(fit,type="n",xlab="Coord 1",ylab="Coord 2")
text(fit,labels=rownames(fit),cex=0.5)

Let's use our groups from the cluster tree to color our points and investigate the clustering job we did.

In [None]:
options(repr.plot.width=8,repr.plot.height=6)

hfit = hclust(d,method="ward.D")

plot(hfit,cex=0.8)
rect.hclust(hfit,k=4,border="red")

groups = cutree(hfit,k=4)

In [None]:
options(repr.plot.width=8,repr.plot.height=8)

plot(fit,type="n",xlab="Coord 1",ylab="Coord 2")
text(fit,labels=rownames(fit),cex=0.5,col=groups)

In [None]:
matplot(1975:2015,t(nucr[c("Miami","Atlanta","Fairfax County, Va.","Honolulu"),]),type="l",lty=1,col=c("grey","grey","blue","blue"))

So the first dimension (the x-axis) has to do with the overall crime rate. Low on the left, high on the right. What about the y-axis? Here are cities that are high, medium and low on the y-axis.

In [None]:
plot(1975:2015,nucr["New York City",],type="l",col="grey")

In [None]:
plot(1975:2015,nucr["Chicago",],type="l",col="grey")

In [None]:
plot(1975:2015,nucr["Memphis, Tenn.",],type="l",col="grey")

This dimension is basically a contrast between what happened before the 1990s and what happend after. Those cities with increasing crime rates like Memphis score large and positive. Those with decreasing rates like New York score large but negative. Those that are low across the board or have crime rates that follow low-high-low tend to be near 0 in the middle. 

Plots like these can tell us a lot about the structures in data. What we need to do next is report on these findings. Why are these shapes as they are? What about these cities make them cluster?