# DC Crimebusters - Georgetown Data Analytics Certificate, Spring 2015

DC Crimebusters set out to help users make knowledgeable decisions about their personal safety when taking Washington Metropolitan Area Transit Authority (WMATA) trains to their destinations. Riders have developed perceptions about Metro stops within the District based on anecdotal evidence and this analysis aims to support or disprove riders’ notions. The final product will be a mobile application in which users may enter their destination in Washington, DC and the time of travel, and our app will inform them of the relative safety of the destination neighborhood at that time. It will inform them which crimes to be most aware of and recommend taking a taxi or private car hire if the probability of being victimized exceeds a specified threshold.

In [8]:
# Import data to be used for multiple regression
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
from bokeh import plotting
import numpy as np
plotting.output_notebook()

In [9]:
crime_summary = pd.read_csv("CrimeEventMultReg.csv")

#Crime events 2014 with calculated attributes
all_events = pd.read_csv('CrimeEvents_CalculatedAttr.csv')

# Summarize data

crime_summary.describe()

Unnamed: 0,Avg_Dist_Metro_KM,Avg_Ridership,Avg_Percent_Public_Transit,Avg_Percent_Vacant_Houses,Avg_Percent_Below_Poverty,Avg_Median_Home_Value,AssaultRate,HomicideRate,RobberyRate,RapeRate
count,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
mean,0.675135,10424.75,0.35913,0.137428,0.044706,4.639665,0.022227,0.021429,0.022775,0.023366
std,0.353901,7948.993578,0.067318,0.056496,0.051432,1.72519,0.028809,0.034152,0.024425,0.023803
min,0.175926,1664.0,0.190672,0.039486,0.0,1.075,0.0,0.0,0.000303,0.0
25%,0.374831,5310.25,0.33408,0.097404,0.006529,3.427251,0.002836,0.0,0.004162,0.005719
50%,0.571915,7612.0,0.372541,0.121354,0.026295,4.449858,0.009034,0.0,0.010593,0.01634
75%,0.955057,11372.0,0.397369,0.172988,0.060091,5.429757,0.029412,0.02381,0.031174,0.033497
max,1.616392,32465.0,0.469926,0.292654,0.213179,8.670287,0.127311,0.12381,0.087772,0.107843


Run the regression in R...

In [13]:
#make a histogram showing dist from metro
hist, edges = np.histogram(all_events['Distance from metro KM'], bins=20)
distance_histogram = plotting.figure(title="Crime and Distance From Metro Stations",
                                     background_fill='#E8DDCB')
distance_histogram.yaxis.axis_label = "Number of Crimes"
distance_histogram.xaxis.axis_label = "Distance From Metro (KM)"
distance_histogram.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
                        fill_color="#036564", line_color="#033649")

plotting.show(distance_histogram)

AttributeError: 'str' object has no attribute 'remove'

In [9]:
# Make sure you have run "pip install rpy2" on the command line prior to running this code

%pylab inline
%load_ext rpy2.ipython

AssaultRate = crime_summary['AssaultRate']
HomicideRate = crime_summary['HomicideRate']
RobberyRate = crime_summary['RobberyRate']
RapeRate = crime_summary['RapeRate']

Avg_Dist_Metro_KM = crime_summary['Avg_Dist_Metro_KM']
Avg_Ridership = crime_summary['Avg_Ridership']
Avg_Percent_Public_Transit = crime_summary['Avg_Percent_Public_Transit']
Avg_Percent_Vacant_Houses = crime_summary['Avg_Percent_Vacant_Houses']
Avg_Percent_Below_Poverty = crime_summary['Avg_Percent_Below_Poverty']
Avg_Median_Home_Value = crime_summary['Avg_Median_Home_Value']

%Rpush AssaultRate HomicideRate RobberyRate RapeRate Avg_Dist_Metro_KM Avg_Ridership Avg_Percent_Public_Transit Avg_Percent_Vacant_Houses Avg_Percent_Below_Poverty Avg_Median_Home_Value 
%R assault_model <- lm(AssaultRate ~ Avg_Dist_Metro_KM + Avg_Ridership + Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + Avg_Percent_Below_Poverty + Avg_Median_Home_Value)
%R homicide_model <- lm(HomicideRate ~ Avg_Dist_Metro_KM + Avg_Ridership + Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + Avg_Percent_Below_Poverty + Avg_Median_Home_Value)
%R robbery_model <- lm(RobberyRate ~ Avg_Dist_Metro_KM + Avg_Ridership + Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + Avg_Percent_Below_Poverty + Avg_Median_Home_Value)
%R rape_model <- lm(RapeRate ~ Avg_Dist_Metro_KM + Avg_Ridership + Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + Avg_Percent_Below_Poverty + Avg_Median_Home_Value)

Populating the interactive namespace from numpy and matplotlib
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


<ListVector - Python:0x10a6e7908 / R:0x10890ebf0>
[Float..., Array, Array, ..., Vector, Formula, DataF...]
  coefficients: <class 'rpy2.robjects.vectors.FloatVector'>
  <FloatVector - Python:0x10a6f33b0 / R:0x108475950>
[-0.012405, 0.029163, 0.000000, ..., 0.015429, 0.247343, -0.000611]
  residuals: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x10a633fc8 / R:0x10a14eee0>
[-0.022922, -0.000004, 0.017131, ..., -0.012065, -0.013255, -0.010495]
  effects: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x10a633ef0 / R:0x10a14f330>
[-0.147780, 0.097167, 0.000879, ..., -0.007381, -0.008693, -0.008703]
  ...
  coefficients: <class 'rpy2.robjects.vectors.Vector'>
  <Vector - Python:0x10a633c68 / R:0x100bdfb08>
[RNULLType, Vector]
  residuals: <class 'rpy2.robjects.Formula'>
  <Formula - Python:0x10a633b00 / R:0x101bcfcb0>
<ListVector - Python:0x10a6e7908 / R:0x10890ebf0>
[Float..., Array, Array, ..., Vector, Formula, DataF...]

The summary of the linear model for assaults shows that average distance to metro and average percent below the poverty line are related to the assault rate.

In [10]:
%R print(summary(assault_model))


Call:
lm(formula = AssaultRate ~ Avg_Dist_Metro_KM + Avg_Ridership + 
    Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + 
    Avg_Percent_Below_Poverty + Avg_Median_Home_Value)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.028542 -0.009680 -0.002397  0.008812  0.029578 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -4.073e-02  3.448e-02  -1.181 0.245966    
Avg_Dist_Metro_KM           3.976e-02  1.050e-02   3.786 0.000616 ***
Avg_Ridership               2.869e-07  4.012e-07   0.715 0.479639    
Avg_Percent_Public_Transit  4.308e-02  5.441e-02   0.792 0.434092    
Avg_Percent_Vacant_Houses   9.884e-02  6.584e-02   1.501 0.142827    
Avg_Percent_Below_Poverty   2.449e-01  8.061e-02   3.039 0.004623 ** 
Avg_Median_Home_Value      -1.483e-03  2.333e-03  -0.636 0.529296    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0157 on 33 degrees of freedo

The summary of the linear model for homicides shows that the average percent below the poverty line is related to the homicide rate.

In [11]:
%R print(summary(homicide_model))


Call:
lm(formula = HomicideRate ~ Avg_Dist_Metro_KM + Avg_Ridership + 
    Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + 
    Avg_Percent_Below_Poverty + Avg_Median_Home_Value)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.025402 -0.010804 -0.000566  0.007486  0.041529 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -6.937e-02  3.605e-02  -1.924    0.063 .  
Avg_Dist_Metro_KM           1.051e-02  1.098e-02   0.957    0.346    
Avg_Ridership               4.098e-07  4.195e-07   0.977    0.336    
Avg_Percent_Public_Transit  6.638e-02  5.688e-02   1.167    0.252    
Avg_Percent_Vacant_Houses   1.106e-01  6.884e-02   1.606    0.118    
Avg_Percent_Below_Poverty   5.851e-01  8.428e-02   6.942 6.21e-08 ***
Avg_Median_Home_Value       3.071e-03  2.439e-03   1.259    0.217    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01641 on 33 degrees of free

The summary of the linear model for robberies shows that average distance to metro and average percent below the poverty line are related to the assault rate.

In [12]:
%R print(summary(robbery_model))


Call:
lm(formula = RobberyRate ~ Avg_Dist_Metro_KM + Avg_Ridership + 
    Avg_Percent_Public_Transit + Avg_Percent_Vacant_Houses + 
    Avg_Percent_Below_Poverty + Avg_Median_Home_Value)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.025861 -0.012134 -0.001541  0.010073  0.038742 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)  
(Intercept)                -2.825e-02  3.635e-02  -0.777   0.4425  
Avg_Dist_Metro_KM           2.701e-02  1.107e-02   2.439   0.0203 *
Avg_Ridership               1.669e-07  4.230e-07   0.395   0.6956  
Avg_Percent_Public_Transit  5.304e-02  5.736e-02   0.925   0.3618  
Avg_Percent_Vacant_Houses   4.528e-02  6.941e-02   0.652   0.5187  
Avg_Percent_Below_Poverty   1.996e-01  8.498e-02   2.348   0.0250 *
Avg_Median_Home_Value      -6.760e-04  2.459e-03  -0.275   0.7851  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01655 on 33 degrees of freedom
Multiple R-sq

The summary of the linear model for rape shows that average distance to metro and average percent below the poverty line are related to the assault rate.

In [13]:
%R print(summary(rape_model))


Call:
lm(formula = RapeRate ~ Avg_Dist_Metro_KM + Avg_Ridership + Avg_Percent_Public_Transit + 
    Avg_Percent_Vacant_Houses + Avg_Percent_Below_Poverty + Avg_Median_Home_Value)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.022922 -0.011665 -0.000735  0.007573  0.034482 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)   
(Intercept)                -1.241e-02  3.320e-02  -0.374  0.71108   
Avg_Dist_Metro_KM           2.916e-02  1.011e-02   2.884  0.00687 **
Avg_Ridership               4.005e-07  3.864e-07   1.036  0.30751   
Avg_Percent_Public_Transit  4.362e-03  5.239e-02   0.083  0.93415   
Avg_Percent_Vacant_Houses   1.543e-02  6.340e-02   0.243  0.80925   
Avg_Percent_Below_Poverty   2.473e-01  7.763e-02   3.186  0.00314 **
Avg_Median_Home_Value      -6.114e-04  2.247e-03  -0.272  0.78721   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01512 on 33 degrees of freedom
Multiple R-sq

In [1]:
%%html
<iframe width="940" height="600" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.arcgis.com/apps/Embed/index.html?webmap=c84cfb5536f540ac810cd7d7860d99e5&amp;extent=-77.2227,38.8097,-76.8182,38.9866&amp;home=true&amp;zoom=true&amp;scale=true&amp;legend=true&amp;theme=light"></iframe>