# MSDS 7331 Project

## Dataset
We initially looked at Kaggle to see what there might be of interest, and one dataset caught our eye:
https://www.kaggle.com/vinchinzu/dc-metro-crime-data
**DC Metro Crime Data**

The description on the Kaggle page contained a link to the raw data: http://crimemap.dc.gov/CrimeMapSearch.aspx#tabs-GeoOther

Links from this site showed summary statistics for 2015 with comparisons to 2014 and the 20-year averages

<table>
    <tr><th>Offense</th><th>2014</th><th>2015</th><th>Percent Change</th></tr>
    <tr><td>Homicide</td><td style='text-align:right'>105</td><td style='text-align:right'>162</td><td style='background-color:red;text-align:center'>+54%</td></tr>
    <tr><td>Sex Abuse</td><td style='text-align:right'>319</td><td style='text-align:right'>297</td><td style='background-color:green;text-align:center'>-7%</td></tr>
    <tr><td>Assault w/ a Dangerous Weapon</td><td style='text-align:right'>2,490</td><td style='text-align:right'>2,426</td><td style='background-color:green;text-align:center'>-3%</td></tr>
    <tr><td>Robbery</td><td style='text-align:right'>3,296</td><td style='text-align:right'>3,446</td><td style='background-color:red;text-align:center'>+5%</td></tr>
    <tr style='background-color:silver'><td>Violent Crime - Total</td><td style='text-align:right'>6,210</td><td style='text-align:right'>6,331</td><td style='background-color:red;text-align:center'>+2%</td></tr>
    <tr><td>Burglary</td><td style='text-align:right'>3,182</td><td style='text-align:right'>2,543</td><td style='background-color:green;text-align:center'>-20%</td></tr>
    <tr><td>Motor Vehicle Theft</td><td style='text-align:right'>3,132</td><td style='text-align:right'>2,825</td><td style='background-color:green;text-align:center'>-10%</td></tr>
    <tr><td>Theft from Auto</td><td style='text-align:right'>11,406</td><td style='text-align:right'>11,160</td><td style='background-color:green;text-align:center'>-2%</td></tr>
    <tr><td>Theft (Other)</td><td style='text-align:right'>14,666</td><td style='text-align:right'>14,117</td><td style='background-color:green;text-align:center'>-4%</td></tr>
    <tr><td>Arson</td><td style='text-align:right'>26</td><td style='text-align:right'>18</td><td style='background-color:green;text-align:center'>-31%</td></tr>
    <tr style='background-color:silver'><td>Property Crime - Total</td><td style='text-align:right'>32,412</td><td style='text-align:right'>30,663</td><td style='background-color:green;text-align:center'>-5%</td></tr>
    <tr style='background-color:black;color:white'><td>All Crime - Total</td><td style='text-align:right'>38,622</td><td style='text-align:right'>36,994</td><td style='background-color:green;text-align:center'>-4%</td></tr>
</table>

Since we had this information about 2015, we decided to use the 2015 data as our dataset.

The site's help file (http://mpdc.dc.gov/node/200622) gave us the information we needed to query and download the data we wanted.  The site provided an interface by which we could request data for specific geo-political regions and time frames.  With some experimentation, we discovered that an easy way to download the data was to request data from a specific Ward for the custom time frame of 01/01/2015 to 12/31/2015 and export it as a CSV file.  That gave us 8 separate files (one per Ward), which was easy to combine into a single file for 2015.

![Ward 1 Data Extraction](Ward1Data.png "Data Extraction for Ward 1")
<p style='text-align:center;font-weight:bold;'>(*Screen capture of Data Extraction page showing summary statistics for Ward 1*)</p>

The [Download Crime Data] link is where we were able to download the CSV file.  The [Crime Definitions] link took us to http://crimemap.dc.gov/CrimeDefinitions.aspx, where we were able to get the definitions of the codes used in the dataset.

## Business Understanding
Understanding the patterns in crimes can help allocate resources more efficiently; identify the need for specialized units, training, or task forces; generate predictive models that are applicable for certain time periods, locations, or conditions.

## Data Understanding

In [1]:
#  Import the PANDAS library so we can work with dataframes
import pandas as pd

#  Read in the crime data from the combined CSV file
dc = pd.read_csv('DC_Crime_2015.csv')

#  Show info about the fields
print dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36493 entries, 0 to 36492
Data columns (total 18 columns):
REPORT_DAT              36493 non-null object
SHIFT                   36493 non-null object
OFFENSE                 36493 non-null object
METHOD                  36493 non-null object
BLOCK                   36493 non-null object
DISTRICT                36446 non-null float64
PSA                     36445 non-null float64
WARD                    36493 non-null int64
ANC                     36493 non-null object
NEIGHBORHOOD_CLUSTER    36076 non-null object
BLOCK_GROUP             36379 non-null object
CENSUS_TRACT            36379 non-null float64
VOTING_PRECINCT         36480 non-null object
CCN                     36493 non-null int64
XBLOCK                  36493 non-null float64
YBLOCK                  36493 non-null float64
START_DATE              36493 non-null object
END_DATE                36241 non-null object
dtypes: float64(5), int64(2), object(11)
memory usage: 5.0+ 

### Data Description

|Column|Data Type|Value Range|Description|
|:-----|:--------|:----------|:----------|
|REPORT_DAT|Date/Time|01/01/2015 00:00:00 - 12/31/2015 23:59:59|The date/time the offense was *reported*|
|SHIFT|Text|Day = 0700-1500, Evening = 1500-2300, Midnight = 2300-0700|The duty shift that responded to the call|
|OFFENSE|Text|Various|The category of crime committed (from the Crime Definitions link above)|
|METHOD|Text|"OTHERS", "GUN", "KNIFE"|A qualifier to the Offense that flags special considerations, such as the use of a gun|
|BLOCK|Text|Varies|The street and block identifier|
|DISTRICT|Integer|1-7|The police district|
|PSA|Integer|{1-7}(01-08}: 101-108,...,701-708|Police Service Area|
|WARD|Integer|1-8|The political Ward identifier|
|ANC|Text|{1-8}{A-G}|Advisory Neighborhood Commission|
|NEIGHBORHOOD_CLUSTER|Text|"Cluster "{1-39}|Neighborhood identifier|
|BLOCK_GROUP|Text|{CENSUS_TRACT}{space}{1-6}|Subdivision within a tract|
|CENSUS_TRACT|Integer|Discontinuous values between 100 and 11100|Land management tract identifier|
|VOTING_PRECINCT|Text|"Precinct "{1-143}|Political subdivision|
|CCN|Integer|Discontinuous values between 14151815 and 15403340|Criminal Complaint Number - unique to each report|
|XBLOCK|Float|min: 390,147; max: 407,806|Eastern coordinate of crime scene (meters)|
|YBLOCK|Float|min: 147,292; max: 127,300|Northern coordinate of crime scene (meters)|
|START_DATE|Date/Time|Varies|The earliest the crime *might* have been committed|
|END_DATE|Date/Time|Varies|The latest the crime *might* have been committed|

In [2]:
#  Show the file headers
dc.head()

Unnamed: 0,REPORT_DAT,SHIFT,OFFENSE,METHOD,BLOCK,DISTRICT,PSA,WARD,ANC,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,CCN,XBLOCK,YBLOCK,START_DATE,END_DATE
0,03/04/2015 12:05,DAY,THEFT/OTHER,OTHERS,2100 - 2199 BLOCK OF 14TH STREET NW,3.0,305.0,1,1B,Cluster 3,004300 1,4300.0,Precinct 22,14151815,397229.0,138975.0,08/01/2014 12:00,09/01/2014 12:03
1,01/22/2015 09:00,DAY,THEFT F/AUTO,OTHERS,3400 - 3499 BLOCK OF MOUNT PLEASANT STREET NW,4.0,408.0,1,1D,Cluster 2,002701 3,2701.0,Precinct 40,14174448,396535.83,140772.03,11/08/2014 10:00,11/10/2014 11:08
2,01/03/2015 21:20,EVENING,THEFT/OTHER,OTHERS,3100 - 3299 BLOCK OF 14TH STREET NW,3.0,302.0,1,1A,Cluster 2,003000 1,3000.0,Precinct 39,15001508,397162.0,140182.0,01/03/2015 19:10,01/03/2015 19:20
3,01/05/2015 12:44,DAY,THEFT/OTHER,OTHERS,400 - 599 BLOCK OF HOWARD PLACE NW,3.0,306.0,1,1B,Cluster 3,003400 2,3400.0,Precinct 37,15002278,398290.0,139412.0,12/19/2014 17:00,01/05/2015 11:30
4,01/20/2015 07:01,DAY,THEFT F/AUTO,OTHERS,IRVING STREET NW AND 13TH STREET NW,3.0,302.0,1,1A,Cluster 2,003000 2,3000.0,Precinct 39,15009493,397424.06,140084.43,01/19/2015 11:00,01/20/2015 06:59


## Data Preparation
### Coordinates
One obvious visualization tool would be to plot the geo-spatial relationship of the data, and, fortunately, this dataset provides the *approximate* location of the crime (presumably to preserve the privacy of the victim(s)) in grid coordinates (XBLOCK = East offset from the "Origin"; YBLOCK = North offset from the "Origin").  The question is, where is that Origin?

![Identity of Coordinate Origin](Coordinates.png "Origin for Location Coordinates")
<p style='text-align:center'>(*Screen capture of description of coordinate origin*)</p>

On the download page for the datasets, next to the "Map Coordinates" field selector, there is a description of the origin, which states that the values are in the Maryland State Plane, NAD 83 map projection.  Further research led to a web page that defined the [Maryland coordinate system](http://www.mgs.md.gov/geology/maryland_coordinate_system.html "Maryland State Coordinate System")

The coordinate system is a Lambert conformal conical projection with two standard parallels (latitudes). This attempts to reduce the distortion of trying to map a flat plane on a curved surface.  With the coordinate system defined, we can then reverse the projection and re-project to a different system that can be used with other mapping/GIS tools.  The transformation methodology came from the National Geospatial Intelligence Agency (NGA), but a more concise explanation of the method was provided by this website: http://www.linz.govt.nz/data/geodetic-system/coordinate-conversion/projection-conversions/lambert-conformal-conic-geographic 

In order to do the coordinate transformations, we need to get several parameters set up first.

|Parameter|Description|Value|
|:--------|:----------|:----|
|a|Semi-major axis of reference ellipsoid (meters)|6378137 (Maryland uses the GRS80 reference)|
|f|Ellipsoidal flattening|1/298.257222101 (GRS80)|
|&theta;<sub>1</sub>|Latitude of first standard parallel (degrees)|38.3 (38&deg; 18' from Maryland definition)|
|&theta;<sub>2</sub>|Latitude of second standard parallel (degrees)|39.45 (39&deg; 27' from Maryland definition)|
|&theta;<sub>0</sub>|Origin Latitude (degrees)|37.66667 (North 37&deg; 40' from Maryland definition)|
|&lambda;<sub>0</sub>|Origin Longitude (degrees)|-81.52918855 (West 81&deg; 31' 45.07877" from Maryland definition)|
|N<sub>0</sub>|False Northing (meters)|0.0 (from Maryland definition)|
|E<sub>0</sub>|False Easting (meters)|400,000 (from Maryland definition)|

From these, we can derive the projection constants

|Constant|Derivation|Value|
|-------|-----------|-----|
|e      |$\sqrt{2f - f^2}$ | 0.081819191|
|m<sub>i</sub>|$\frac{\cos \theta_i}{\sqrt{1-e^2\sin^2 \theta_i}}$|m<sub>1</sub>=0.785787341<br>m<sub>2</sub>=0.773225009|
|t<sub>i</sub>|$\frac{\tan \left[(\frac{\pi}{4})-(\frac{\theta_i}{2})\right]}{\left(\frac{1-e\sin\theta_i}{1+e\sin\theta_i}\right)^\frac{e}{2}}$|t<sub>0</sub>=0.493354296<br>t<sub>1</sub>=0.486512044<br>t<sub>2</sub>=0.474178631|
|n      |$\frac{\ln m_1 - \ln m_2}{\ln t_1 - \ln t_2}$|0.627634132|
|F      |$\frac{m_1}{n(t_1)^n}$|1.967837417|
|&rho;<sub>0</sub>  |$a F t_0^n$|8055622.737|

Now, for each point (i) in our dataset, we must perform the following steps:
1. Adjust the North offset using the false northing - our false northing is 0, so this step is skipped
2. Adjust the East offset using the false easting: $E_i' = E_i - E_0$
3. $\rho_i' = \sqrt{(E_i')^2 + (\rho_0-N_i)^2}$
4. $t_i'= \left(\frac{\rho_i'}{a F}\right)^\frac{1}{n}$
5. $\gamma_i' = \tan^{-1}\left(\frac{E_i'}{\rho_0-N_i}\right)$
6. $\lambda_i = \frac{\gamma_i'}{n}+\lambda_0$ (This is the longitude of the location
7. The calculation for latitude is iterative.
 1. $\theta_{i0} = \frac{\pi}{2}-2\tan^{-1}(t_i')$ (This is our initial estimate of latitude)
 2. $\theta_{i,j} = \frac{\pi}{2}-2\tan^{-1}\left[t_i'\left(\frac{1-e\sin\theta_{i,j-1}}{1+e\sin\theta_{i,j-1}}\right)\right]$ (We use the previous estimate to create a new estimate)
 3. Repeat the previous step until the difference in estimates is negligible (this typically takes three iterations)
8. $\theta_i$ is our estimate of the latitude for the location


In [9]:
import math

#  Build a class that handles generic reference ellipsoid parameters in case we have multiple coordinate systems to deal with
class refEllipsoid:
    #  a = Equatorial radius (meters)
    #  f = Flattening (the degree to which the polar radius is compressed compared to the equatorial radius)
    #  b = Polar radius (meters): f = (a-b)/a; af = a-b; af - a = -b; b = a - af = a(1-f)
    #  e2 = First eccentricity squared: 1 − b2/a2 = 2f − f2
    #  e = First eccentricity
    #  p2 = Second eccentricity squared: a2/b2 − 1 = f(2 − f)/(1 − f)^2
    def __init__(self, equator, flattening):
        #  Provided
        self.a = float(equator)
        self.f = 1.0/float(flattening)
        
        #  Derived
        self.b = self.a * (1.0 - self.f)
        self.e2 = (2.0 * self.f) - self.f**2
        self.e = math.sqrt(self.e2)
        self.p2 = (self.a**2 / self.b**2) - 1.0

GRS80 = refEllipsoid(6378137.0,298.257222101)  #  Define the Geodetic Reference System 1980 (GRS80) ellipsoid

#  Function to convert individual angular components to floating-point degrees
def DMS(degrees, minutes, seconds):
    sign = 1.0
    if degrees < 0:
        sign = -1.0
    return sign * (math.fabs(float(degrees)) + (float(minutes) / 60.0) + (float(seconds) / 3600.0))

class coordOrigin:
    origin = {'lat':0.0, 'lon': 0.0}
    parallel = {1:0.0, 2:0.0}
    false = {'n':0.0, 'e':0.0}
                        
    def __init__(self,latOrigin,lonOrigin,parallel1,parallel2,northing,easting):
        self.origin['lat'] = math.radians(latOrigin)
        self.origin['lon'] = math.radians(lonOrigin)
        self.parallel[1] = math.radians(parallel1)
        self.parallel[2] = math.radians(parallel2)
        self.false['n'] = float(northing)
        self.false['e'] = float(easting)

#  Define the origin for the Maryland state coordinate system
MD = coordOrigin(DMS(37,40,0),DMS(-81,31,45.07877),DMS(38,18,0),DMS(39,27,0),0,400000)

#  Define functions to compute the 'm' and 't' factors (would not work within the class for some reason)
def Lambert_m(parallel):
    return math.cos(float(parallel)) / math.sqrt(1.0 - (GRS80.e2 * math.sin(float(parallel))**2))

def Lambert_t(parallel):
    return math.tan((math.pi / 4.0) - (float(parallel) / 2.0)) / ((1 - (GRS80.e * math.sin(float(parallel)))) / (1 + (GRS80.e * math.sin(float(parallel)))))**(GRS80.e / 2.0)

class Lambert:
    def __init__(self):
        self.m1 = Lambert_m(MD.parallel[1])
        self.m2 = Lambert_m(MD.parallel[2])
        self.t0 = Lambert_t(MD.origin['lat'])
        self.t1 = Lambert_t(MD.parallel[1])
        self.t2 = Lambert_t(MD.parallel[2])
        self.n = (math.log(self.m1) - math.log(self.m2))/(math.log(self.t1)-math.log(self.t2))
        self.F = self.m1 / (self.n * self.t1**self.n)
        self.p0 = GRS80.a * self.F * self.t0**self.n

Projection = Lambert()
Projection.__dict__

{'F': 1.9678374170334183,
 'm1': 0.7857873413486276,
 'm2': 0.7732250089525023,
 'n': 0.6276341323554715,
 'p0': 8055622.737265018,
 't0': 0.49335429608783643,
 't1': 0.48651204380528484,
 't2': 0.4741786305194415}

In [1]:
# TODO: Create a python function to perform these calculations and add the resulting columns to the data frame
#  I did this in Excel to make sure the process worked.

### Visualization - Map
After transforming the coordinates to Geodetic (Latitude/Longitude) we can plot the locations with a mapping/GIS tool.  In this example, we used a tool developed by Mercury Solutions, Inc. (Tom Elkins' company) that plots multiple tactical data sources.
![Crime data on map](Data_on_Map.png "Crime locations on DC Map")
