# MSDS 7331 - Lab One: Visualization and Data Processing

### Investigators
- [Matt Baldree](mailto:mbaldree@smu.edu?subject=lab1)
- [Tom Elkins](telkins@smu.edu?subject=lab1)
- [Autin Kelly](ajkelly@smu.edu?subject=lab1)
- [Murali Parthasarathy](mparthasarathy@smu.edu?subject=lab1)


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:5px;'>
    <h3>Lab Instructions</h3>
    <p>You are to perform analysis of a data set: exploring the statistical summaries of the features,
visualizing the attributes, and making conclusions from the visualizations and analysis. Follow the
CRISP-DM framework in your analysis (you are not performing all of the CRISP-DM outline, only
the portions relevant to understanding and visualization). This report is worth 20% of the final
grade. Please upload a report (one per team) with all code used, visualizations, and text in a single
document. The format of the document can be PDF, *.ipynb, or HTML. You can write the report in
whatever format you like, but it is easiest to turn in the rendered iPython notebook.</p>
</div>

<a id='business_understanding'></a>
## Business Understanding
<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Business Understanding (<b>10 points total</b>)</h3>
    <ol><li>Describe the purpose of the data set you selected (i.e., why was this data collected in
the first place?).</li>
    <li>Describe how you would define and measure the outcomes from the
dataset. That is, why is this data important and how do you know if you have mined
useful knowledge from the dataset?</li>
    <li>How would you measure the effectiveness of a
good prediction algorithm? Be specific.</li></ol>
</div>

### 1. Purpose of Data Set
The data set chosen for lab 1 is the 2015 Washington DC Metro Crime inspired from a Kaggle data set found at https://www.kaggle.com/vinchinzu/dc-metro-crime-data. The data set was obtained by following the steps found on the [Using the Crime Map Application](http://mpdc.dc.gov/node/200622) page. This site allowed us to download all eight wards from 01/01/2015 to 12/31/2015 as an exported CSV files. These individual ward files were then merged together into a single file for our use. This data set contains 36,493 entries and 18 attributes that are both continuous and discrete. This satisfies the data set requirement for a minimum of 30,000 entries and 10 attributes which are both continuous and discrete. Further definition of this data set will be discussed in the [Data Understanding](#data_understanding) section.

![Ward Map](images/wards_small.png "Washington DC Wards") 
<p style='text-align: center;'>
Washington DC Metro Ward Map
</p>

The crime data is published by the Washington DC Metro police department daily (see below image) to provide their residents a clear picture of crime trends as they actually happen. The data is shared with its residents such as Advisory Neighborhood Commissions to help the police determine how to keep neighborhoods safe. The data is also analyzed to determine the effectiveness of current investments such as putting more officers on the streets, buying police more tools, and launching community partnerships, see [Washington DC Metro Police Department report](http://mpdc.dc.gov/publication/mpd-annual-report-2015) for more details.

![Ward Map](images/dc_2015_crime.tiff "Washington DC Year End Crime Data") 
<p style='text-align: center;'>
Washington DC Metro 2015 Year End Crime Data
</p>

### 2. Importance of the Data Set
This data set could be used to predict the number of violent and property crimes in police district given time of day, day of week, and other factors. This would allow the police department to appropriate adequate resources to each district to respond and possibly prevent the crimes.

### 3. Measurement of Importance
The measurement of the importance would be to perform a validation on the machine learning model that was trained on the data set to predict the number of crimes that would be committed in a police district. The prediction error would be reported.

<a id="data_understanding"></a>
## Data Understanding

<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Data Understanding (<b>80 points total</b>)</h3>
    <ol><li>[<b>10 points</b>] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file.</li>
    <li>[<b>15 points</b>] Verify data quality: Explain any missing values, duplicate data, and outliers.
Are those mistakes? How do you deal with these problems? Be specific.</li>
    <li>[<b>10 points</b>] Give simple, appropriate statistics (range, mode, mean, median, variance,
counts, etc.) for the most important attributes and describe what they mean or if you
found something interesting. Note: You can also use data from other sources for
comparison. Explain the significance of the statistics run and why they are meaningful.</li>
    <li>[<b>15 points</b>] Visualize the most important attributes appropriately (at least 5 attributes).
Important: Provide an interpretation for each chart. Explain for each attribute why the
chosen visualization is appropriate.</li>
    <li>[<b>15 points</b>] Explore relationships between attributes: Look at the attributes via scatter
plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain
any interesting relationships.</li>
    <li>[<b>10 points</b>] Identify and explain interesting relationships between features and the class
you are trying to predict (i.e., relationships with variables and the target classification).</li>
    <li>[<b>5 points</b>] Are there other features that could be added to the data or created from
existing features? Which ones?</li></ol>
</div>



### 1. Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

In [97]:
#  Import the PANDAS library so we can work with dataframes
import pandas as pd

#  Read in the crime data from the combined CSV file
dc = pd.read_csv('data/DC_Crime_2015.csv')

#### Information about Data Frame

In [93]:
# dataframe info
print dc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36493 entries, 0 to 36492
Data columns (total 18 columns):
REPORT_DAT              36493 non-null object
SHIFT                   36493 non-null object
OFFENSE                 36493 non-null object
METHOD                  36493 non-null object
BLOCK                   36493 non-null object
DISTRICT                36446 non-null float64
PSA                     36445 non-null float64
WARD                    36493 non-null int64
ANC                     36493 non-null object
NEIGHBORHOOD_CLUSTER    36076 non-null object
BLOCK_GROUP             36379 non-null object
CENSUS_TRACT            36379 non-null float64
VOTING_PRECINCT         36480 non-null object
CCN                     36493 non-null int64
XBLOCK                  36493 non-null float64
YBLOCK                  36493 non-null float64
START_DATE              36493 non-null object
END_DATE                36241 non-null object
dtypes: float64(5), int64(2), object(11)
memory usage: 5.0+ 

#### Feature Summary Statistics

In [94]:
# summary statistics
print dc.describe()

           DISTRICT           PSA          WARD  CENSUS_TRACT           CCN  \
count  36446.000000  36445.000000  36493.000000  36379.000000  3.649300e+04   
mean       3.697196    374.298395      4.421259   6211.275791  1.511937e+07   
std        1.947438    194.524001      2.339270   3146.217537  1.087825e+05   
min        1.000000    101.000000      1.000000    100.000000  6.155556e+06   
25%        2.000000    206.000000      2.000000   3400.000000  1.505885e+07   
50%        4.000000    401.000000      5.000000   7000.000000  1.511063e+07   
75%        5.000000    506.000000      6.000000   8904.000000  1.516497e+07   
max        7.000000    708.000000      8.000000  11100.000000  1.619697e+07   

              XBLOCK         YBLOCK  
count   36493.000000   36493.000000  
mean   399301.346694  137698.576414  
std      3113.115343    3424.503748  
min    390147.000000  127300.000000  
25%    397228.000000  136027.000000  
50%    398878.000000  137622.530000  
75%    401257.000000  

#### Example Record from Data Set

In [95]:
# print an example 
print dc.ix[1234]

REPORT_DAT                                 06/24/2015 23:10
SHIFT                                              MIDNIGHT
OFFENSE                                         THEFT/OTHER
METHOD                                               OTHERS
BLOCK                   600 - 699 BLOCK OF MORTON STREET NW
DISTRICT                                                  3
PSA                                                     302
WARD                                                      1
ANC                                                      1A
NEIGHBORHOOD_CLUSTER                              Cluster 2
BLOCK_GROUP                                        003200 3
CENSUS_TRACT                                           3200
VOTING_PRECINCT                                 Precinct 38
CCN                                                15095285
XBLOCK                                               398044
YBLOCK                                               140473
START_DATE                              

#### Field Definitions
The [Crime Definitions](http://crimemap.dc.gov/CrimeDefinitions.aspx) provides detail definitions of codes used in this data set.

|Column|Data Type|Value Range|Description|Missing|
|:-----|:--------|:----------|:----------|:-----:|
|REPORT_DAT|Date/Time|01/01/2015 00:00:00 - 12/31/2015 23:59:59|The date/time the offense was *reported*|0|
|SHIFT|Nominal|Day = 0700-1500, Evening = 1500-2300, Midnight = 2300-0700|The duty shift that responded to the call|0|
|OFFENSE|Nominal|Various|The category of crime committed (from the Crime Definitions link above)|0|
|METHOD|Nominal|"OTHERS", "GUN", "KNIFE"|A qualifier to the Offense that flags special considerations, such as the use of a gun|0|
|BLOCK|Nominal|Varies|The street and block identifier|0|
|DISTRICT|Integer|1-7|The police district|47 (0.13%)|
|PSA|Integer|{1-7}(01-08}: 101-108,...,701-708|Police Service Area|48 (0.13%)|
|WARD|Integer|1-8|The political Ward identifier|0|
|ANC|Nominal|{1-8}{A-G}|Advisory Neighborhood Commission|0|
|NEIGHBORHOOD_CLUSTER|Nominal|"Cluster "{1-39}|Neighborhood identifier|417 (1.14%)|
|BLOCK_GROUP|Nominal|{CENSUS_TRACT}{space}{1-6}|Subdivision within a tract|114 (0.31%)
|CENSUS_TRACT|Integer|Discontinuous values between 100 and 11100|Land management tract identifier|114 (0.31%)|
|VOTING_PRECINCT|Nominal|"Precinct "{1-143}|Political subdivision|12 (0.03%)|
|CCN|Integer|Discontinuous values between 14151815 and 15403340|Criminal Complaint Number - unique to each report|0|
|XBLOCK|Ratio|min: 390,147; max: 407,806|Eastern coordinate of crime scene (meters)|0|
|YBLOCK|Ratio|min: 147,292; max: 127,300|Northern coordinate of crime scene (meters)|0|
|START_DATE|Date/Time|Varies|The earliest the crime *might* have been committed|0|
|END_DATE|Date/Time|Varies|The latest the crime *might* have been committed|252 (0.69%)|

Given that we have geo-physical coordinates, we believe we can impute some of the missing geo-political values (such as Police District).

### 2. Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific.

#### Convert Data Frame Columns to Correct Data Type

TODO: describe the conversions including the mappings

In [96]:
# convert REPORT_DAT to datetime
dc['REPORT_DAT'] = pd.to_datetime(dc['REPORT_DAT'])

# convert SHIFT to int
shift_mapping = {'day':0, 'evening':1, 'midnight':2}
dc['SHIFT'] = pd.to_numeric(dc['SHIFT'].str.lower().map(shift_mapping))

# convert OFFENSE to
offense_mapping = {'theft/other':0, 'theft f/auto':1, 'burglary':2, 'assault w/dangerous weapon':3, 'robbery':4}
dc['OFFENSE'] = pd.to_numeric(dc['OFFENSE'].str.lower().map(offense_mapping))

# convert METHOD to
method_mapping = {'others':0, 'gun':1}
dc['METHOD'] = pd.to_numeric(dc['METHOD'].str.lower().map(method_mapping))

# convert BLOCK to
block_mapping = {'2100 - 2199 BLOCK OF 14TH STREET NW':0,
            '3400 - 3499 BLOCK OF MOUNT PLEASANT STREET NW':1,
            '3100 - 3299 BLOCK OF 14TH STREET NW':2,
            '400 - 599 BLOCK OF HOWARD PLACE NW':3,
            'IRVING STREET NW AND 13TH STREET NW':4,
            '2030 - 2199 BLOCK OF 9TH STREET NW':5,
            '1200 - 1300 BLOCK OF CLIFTON STREET NW':6,
            '3517 - 3648 BLOCK OF 16TH STREET NW':7,
            '944 - 960 BLOCK OF FLORIDA AVENUE NW':8,
            '2600 - 2798 BLOCK OF 15TH STREET NW':9}
dc['BLOCK'] = pd.to_numeric(dc['BLOCK'].str.upper().map(block_mapping))

# convert DISTRICT to
dc['DISTRICT'] = pd.to_numeric(dc['DISTRICT'])

# convert PSA to
dc['PSA'] = pd.to_numeric(dc['PSA'])

# convert WARD to
dc['WARD'] = pd.to_numeric(dc['WARD'])

# convert ANC to
anc_mapping = {'1a':0, '1b':2, '1d':3}
dc['ANC'] = pd.to_numeric(dc['ANC'].str.lower().map(anc_mapping))

# convert NEIGHBORHOOD_CLUSTER to
neighborhood_cluster_mapping = {'cluster 2':0, 'cluster 3':1}
dc['NEIGHBORHOOD_CLUSTER'] = pd.to_numeric(dc['NEIGHBORHOOD_CLUSTER'].str.lower().map(neighborhood_cluster_mapping))

# convert BLOCK_GROUP to
block_group_mapping = {'004300 1':0, '002701 3':1, '003000 1':2, '003400 2':3, '003000 2':4, '003500 2':5,
 '003600 3':6, '002701 4':7, '004400 1':8, '003700 2':9}
dc['BLOCK_GROUP'] = pd.to_numeric(dc['BLOCK_GROUP'].str.lower().map(block_group_mapping))

# convert CENSUS_TRACT to
dc['CENSUS_TRACT'] = pd.to_numeric(dc['CENSUS_TRACT'])

# convert VOTING_PRECINCT to
voting_precinct_mapping = {'precinct 22':0, 'precinct 40':1, 'precinct 39':2, 'precinct 37':3, 'precinct 23':4,
 'precinct 36':5}
dc['VOTING_PRECINCT'] = pd.to_numeric(dc['VOTING_PRECINCT'].str.lower().map(voting_precinct_mapping))

# convert CCN to
dc['CCN'] = pd.to_numeric(dc['CCN'])

# convert XBLOCK, YBLOCK to 
dc['XBLOCK'] = pd.to_numeric(dc['XBLOCK'])
dc['YBLOCK'] = pd.to_numeric(dc['YBLOCK'])

# convert START_DATE, END_DATE to  
dc['START_DATE'] = pd.to_datetime(dc['START_DATE'])
dc['END_DATE'] = pd.to_datetime(dc['END_DATE'])

print dc.info()
print dc[:5]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36493 entries, 0 to 36492
Data columns (total 18 columns):
REPORT_DAT              36493 non-null datetime64[ns]
SHIFT                   36493 non-null int64
OFFENSE                 33246 non-null float64
METHOD                  35326 non-null float64
BLOCK                   434 non-null float64
DISTRICT                36446 non-null float64
PSA                     36445 non-null float64
WARD                    36493 non-null int64
ANC                     4384 non-null float64
NEIGHBORHOOD_CLUSTER    4537 non-null float64
BLOCK_GROUP             1672 non-null float64
CENSUS_TRACT            36379 non-null float64
VOTING_PRECINCT         2829 non-null float64
CCN                     36493 non-null int64
XBLOCK                  36493 non-null float64
YBLOCK                  36493 non-null float64
START_DATE              36493 non-null datetime64[ns]
END_DATE                36241 non-null datetime64[ns]
dtypes: datetime64[ns](3), float64(1

#### Missing Values Strategy

#### Duplicate Data

#### Unique Data

#### Outliers

#### New Features

In [25]:
# need to create two dummy columns to represent classify the crime violent or property that can be used for prediction

### 3. Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.

#### guidance

### 4. Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.

#### guidance

### 5. Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

#### guidance

### 6. Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

#### guidance

### 7. Are there other features that could be added to the data or created from existing features? Which ones?

#### New Features
1. Dummy variables are required to indicate if the crime was a violent or property category. These variables would then be used to train a machine learning regression model.
2. It would be nice to have the number of police deployed to a police ward for a shift to see if their numbers are correlated with crimes and their types.
3. It would be nice to have the police improvement campaigns and dollars by ward to determine if these improvements are correlated with crimes and their types.

# Parking lot

In [2]:
#  Show the file headers
dc.head()

Unnamed: 0,REPORT_DAT,SHIFT,OFFENSE,METHOD,BLOCK,DISTRICT,PSA,WARD,ANC,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,CCN,XBLOCK,YBLOCK,START_DATE,END_DATE
0,03/04/2015 12:05,DAY,THEFT/OTHER,OTHERS,2100 - 2199 BLOCK OF 14TH STREET NW,3.0,305.0,1,1B,Cluster 3,004300 1,4300.0,Precinct 22,14151815,397229.0,138975.0,08/01/2014 12:00,09/01/2014 12:03
1,01/22/2015 09:00,DAY,THEFT F/AUTO,OTHERS,3400 - 3499 BLOCK OF MOUNT PLEASANT STREET NW,4.0,408.0,1,1D,Cluster 2,002701 3,2701.0,Precinct 40,14174448,396535.83,140772.03,11/08/2014 10:00,11/10/2014 11:08
2,01/03/2015 21:20,EVENING,THEFT/OTHER,OTHERS,3100 - 3299 BLOCK OF 14TH STREET NW,3.0,302.0,1,1A,Cluster 2,003000 1,3000.0,Precinct 39,15001508,397162.0,140182.0,01/03/2015 19:10,01/03/2015 19:20
3,01/05/2015 12:44,DAY,THEFT/OTHER,OTHERS,400 - 599 BLOCK OF HOWARD PLACE NW,3.0,306.0,1,1B,Cluster 3,003400 2,3400.0,Precinct 37,15002278,398290.0,139412.0,12/19/2014 17:00,01/05/2015 11:30
4,01/20/2015 07:01,DAY,THEFT F/AUTO,OTHERS,IRVING STREET NW AND 13TH STREET NW,3.0,302.0,1,1A,Cluster 2,003000 2,3000.0,Precinct 39,15009493,397424.06,140084.43,01/19/2015 11:00,01/20/2015 06:59


## Data Preparation
### Shift
##### TO DO:
* Frequency plot.  Plot the number of crimes reported for each shift.  Which shift has the most activity?
* Encode the shift data to follow the frequency plot: -1 = the shift before the most activity, 0 = the shift with the most activity, +1 = the shift after the most activity

### Offense
##### TO DO:
* Frequency plot.  Plot the number instances of each offense and compare to the 2015 published numbers.  We *should* match!
* Encode the offense nominal values to numeric values.  May have to be an arbitrary coding scheme
* Calculate rate for each offense given the published estimated population value of 672,228 (use the 'per 100,000' criterion).
  * For example Homicide rate = (# Homicides) / (population / 100,000) = 162 / 6.72228 = 24.09.  Compare to published value (24).
* That gives us odds of being murdered: 24.09/100000 = 0.024%
* Calculate odds for each offense type.
* Now we have a continuous response variable we can use for regression/PCA/logistic/ etc.

### Method
##### TO DO:
* Frequency plot.
* This one should be interesting because this leads to additional options of predicting if a gun or knife will be involved, etc.
* Compare to published values about gun-related crimes

### Block
##### TO DO:
* Harvest the street name from the value and put in its own column
* Is there a "most dangerous street"?

### District
##### TO DO:
* There are missing values for District.  Since the districts have a geo-physical separation, we **should** be able to impute the missing district value given the coordinates of the crime.
  * Compute the mean XBlock and YBlock values for each district
  * For each record missing a District value, compare the record's XBlock and YBlock to each District mean and report the district that is closest.  Use that value to fill in the blank
* Once the missing values are imputed, perform a frequency plot (by offense?).
* Can we get the estimated number of police officers per District? Estimated population?

### PSA (Police Service Area)
##### TO DO:
* Similar to District.
* Compute mean XBlock/YBlock for each PSA
* Estimate the PSA for the missing values based on proximity
* Frequency plot (by offense?)
* Can we get the estimated number of officers per PSA?  estimated population?

### Ward
##### TO DO:
* Frequency plot (by offense?)
* Which police districts are involved?

### ANC (Advisory Neighborhood Commission)
##### TO DO:
* Frequency plot (by offense?)
* Which police district(s) are involved?

### Neighborhood Cluster
* Not sure what to do with this.  Ideas?
* Separate the numeric value from the label and place in its own column

### Block Group
* This value appears to be based on the CENSUS_TRACT variable, but with higher resolution.
* The current value uses a space character to separate the block group from the CENSUS_TRACT value
##### TO DO:
* Recommend replacing the space with a "." and turn the value into a floating point number.  It still retains the information.
* Frequency plot (by offense?)
* What police district(s) are involved?

### Census Tract
* Appears to be related to BLOCK_GROUP.
* Not sure what to do with this

### Voting Precinct
##### TO DO:
* The field values have a common label in them.  Separate out the numeric precinct number into its own field.
* This field has missing values, but due to political "gerry-mandering", the shapes of the voting precincts may not be conducive to imputing missing values by proximity.
* Can we get an indication whether these precincts are Republican or Democrat?

### CCN (Criminal Complaint Number)
##### TO DO:
* Since these are the actual "case" numbers, it would be nice to see if we could get publically-available data for the cases (perpetrator information, victim demographics, etc.)

### Start and End dates
* These values represent the span of time in which the crime *might* have been committed.
* There are a lot of missing values for the END_DATE field.  Examine these to see if, perhaps, they should be the same as the START_DATE value.
  * This might be imputed if the reporting date coincides with the Start date.
* There are also incorrect values (I noticed a date of 1915, for example - is it truly a 100-year-old cold case, or did the person simply enter the wrong century?)

### Coordinates (XBLOCK, YBLOCK)
One obvious visualization tool would be to plot the geo-spatial relationship of the data, and, fortunately, this dataset provides the *approximate* location of the crime (presumably to preserve the privacy of the victim(s)) in grid coordinates (XBLOCK = East offset from the "Origin"; YBLOCK = North offset from the "Origin").  The question is, where is that Origin?

![Identity of Coordinate Origin](images/Coordinates.png "Origin for Location Coordinates")
<p style='text-align:center'>(*Screen capture of description of coordinate origin*)</p>

On the download page for the datasets, next to the "Map Coordinates" field selector, there is a description of the origin, which states that the values are in the Maryland State Plane, NAD 83 map projection.  Further research led to a web page that defined the [Maryland coordinate system](http://www.mgs.md.gov/geology/maryland_coordinate_system.html "Maryland State Coordinate System")

The coordinate system is a Lambert conformal conical projection with two standard parallels (latitudes). This attempts to reduce the distortion of trying to map a flat plane on a curved surface.  With the coordinate system defined, we can then reverse the projection and re-project to a different system that can be used with other mapping/GIS tools.  The transformation methodology came from the National Geospatial Intelligence Agency (NGA), but a more concise explanation of the method was provided by this website: http://www.linz.govt.nz/data/geodetic-system/coordinate-conversion/projection-conversions/lambert-conformal-conic-geographic 

In order to do the coordinate transformations, we need to get several parameters set up first.

|Parameter|Description|Value|
|:--------|:----------|:----|
|a|Semi-major axis of reference ellipsoid (meters)|6378137 (Maryland uses the GRS80 reference)|
|f|Ellipsoidal flattening|1/298.257222101 (GRS80)|
|&theta;<sub>1</sub>|Latitude of first standard parallel (degrees)|38.3 (38&deg; 18' from Maryland definition)|
|&theta;<sub>2</sub>|Latitude of second standard parallel (degrees)|39.45 (39&deg; 27' from Maryland definition)|
|&theta;<sub>0</sub>|Origin Latitude (degrees)|37.66667 (North 37&deg; 40' from Maryland definition)|
|&lambda;<sub>0</sub>|Origin Longitude (degrees)|-81.52918855 (West 81&deg; 31' 45.07877" from Maryland definition)|
|N<sub>0</sub>|False Northing (meters)|0.0 (from Maryland definition)|
|E<sub>0</sub>|False Easting (meters)|400,000 (from Maryland definition)|

From these, we can derive the projection constants

|Constant|Derivation|Value|
|-------|-----------|-----|
|e      |$\sqrt{2f - f^2}$ | 0.081819191|
|m<sub>i</sub>|$\frac{\cos \theta_i}{\sqrt{1-e^2\sin^2 \theta_i}}$|m<sub>1</sub>=0.785787341<br>m<sub>2</sub>=0.773225009|
|t<sub>i</sub>|$\frac{\tan \left[(\frac{\pi}{4})-(\frac{\theta_i}{2})\right]}{\left(\frac{1-e\sin\theta_i}{1+e\sin\theta_i}\right)^\frac{e}{2}}$|t<sub>0</sub>=0.493354296<br>t<sub>1</sub>=0.486512044<br>t<sub>2</sub>=0.474178631|
|n      |$\frac{\ln m_1 - \ln m_2}{\ln t_1 - \ln t_2}$|0.627634132|
|F      |$\frac{m_1}{n(t_1)^n}$|1.967837417|
|&rho;<sub>0</sub>  |$a F t_0^n$|8055622.737|

Now, for each point (i) in our dataset, we must perform the following steps:
1. Adjust the North offset using the false northing - our false northing is 0, so this step is skipped
2. Adjust the East offset using the false easting: $E_i' = E_i - E_0$
3. $\rho_i' = \sqrt{(E_i')^2 + (\rho_0-N_i)^2}$
4. $t_i'= \left(\frac{\rho_i'}{a F}\right)^\frac{1}{n}$
5. $\gamma_i' = \tan^{-1}\left(\frac{E_i'}{\rho_0-N_i}\right)$
6. $\lambda_i = \frac{\gamma_i'}{n}+\lambda_0$ (This is the longitude of the location
7. The calculation for latitude is iterative.
 1. $\theta_{i0} = \frac{\pi}{2}-2\tan^{-1}(t_i')$ (This is our initial estimate of latitude)
 2. $\theta_{i,j} = \frac{\pi}{2}-2\tan^{-1}\left[t_i'\left(\frac{1-e\sin\theta_{i,j-1}}{1+e\sin\theta_{i,j-1}}\right)\right]$ (We use the previous estimate to create a new estimate)
 3. Repeat the previous step until the difference in estimates is negligible (this typically takes three iterations)
8. $\theta_i$ is our estimate of the latitude for the location


In [8]:
import math

#  Build a class that handles generic reference ellipsoid parameters in case we have multiple coordinate systems to deal with
class refEllipsoid:
    #  a = Equatorial radius (meters)
    #  f = Flattening (the degree to which the polar radius is compressed compared to the equatorial radius)
    #  b = Polar radius (meters): f = (a-b)/a; af = a-b; af - a = -b; b = a - af = a(1-f)
    #  e2 = First eccentricity squared: 1 − b2/a2 = 2f − f2
    #  e = First eccentricity
    #  p2 = Second eccentricity squared: a2/b2 − 1 = f(2 − f)/(1 − f)^2
    def __init__(self, equator, flattening):
        #  Provided
        self.a = float(equator)
        self.f = 1.0/float(flattening)
        
        #  Derived
        self.b = self.a * (1.0 - self.f)
        self.e2 = (2.0 * self.f) - self.f**2
        self.e = math.sqrt(self.e2)
        self.p2 = (self.a**2 / self.b**2) - 1.0

GRS80 = refEllipsoid(6378137.0,298.257222101)  #  Define the Geodetic Reference System 1980 (GRS80) ellipsoid

#  Function to convert individual angular components to floating-point degrees
def DMS(degrees, minutes, seconds):
    sign = 1.0
    if degrees < 0:
        sign = -1.0
    return sign * (math.fabs(float(degrees)) + (float(minutes) / 60.0) + (float(seconds) / 3600.0))

class coordOrigin:
    origin = {'lat':0.0, 'lon': 0.0}
    parallel = {1:0.0, 2:0.0}
    false = {'n':0.0, 'e':0.0}
                        
    def __init__(self,latOrigin,lonOrigin,parallel1,parallel2,northing,easting):
        self.origin['lat'] = math.radians(latOrigin)
        self.origin['lon'] = math.radians(lonOrigin)
        self.parallel[1] = math.radians(parallel1)
        self.parallel[2] = math.radians(parallel2)
        self.false['n'] = float(northing)
        self.false['e'] = float(easting)
        
#  Define the origin for the Maryland state coordinate system
MD = coordOrigin(DMS(37,40,0),DMS(-81,31,45.07877),DMS(38,18,0),DMS(39,27,0),0,400000)

class Lambert:
    def __init__(self):
        self.m1 = Lambert._m(MD.parallel[1])
        self.m2 = Lambert._m(MD.parallel[2])
        self.t0 = Lambert._t(MD.origin['lat'])
        self.t1 = Lambert._t(MD.parallel[1])
        self.t2 = Lambert._t(MD.parallel[2])
        self.n = (math.log(self.m1) - math.log(self.m2))/(math.log(self.t1)-math.log(self.t2))
        self.F = self.m1 / (self.n * self.t1**self.n)
        self.p0 = GRS80.a * self.F * self.t0**self.n
        
    @staticmethod
    def _m(parallel):
        return math.cos(float(parallel)) / math.sqrt(1.0 - (GRS80.e2 * math.sin(float(parallel))**2))

    @staticmethod
    def _t(parallel):
        return math.tan((math.pi / 4.0) - (float(parallel) / 2.0)) / ((1 - (GRS80.e * math.sin(float(parallel)))) / (1 + (GRS80.e * math.sin(float(parallel)))))**(GRS80.e / 2.0)

Projection = Lambert()
Projection.__dict__

{'F': 1.9678374170334183,
 'm1': 0.7857873413486276,
 'm2': 0.7732250089525023,
 'n': 0.6276341323554715,
 'p0': 8055622.737265018,
 't0': 0.49335429608783643,
 't1': 0.48651204380528484,
 't2': 0.4741786305194415}

In [1]:
# TODO: Create a python function to perform these calculations and add the resulting columns to the data frame
#  I did this in Excel to make sure the process worked.

### Visualization - Map
After transforming the coordinates to Geodetic (Latitude/Longitude) we can plot the locations with a mapping/GIS tool.  In this example, we used a tool developed by Mercury Solutions, Inc. (Tom Elkins' company) that plots multiple tactical data sources.
![Crime data on map](images/Data_on_Map.png "Crime locations on DC Map")

### Data By Geo-Political Identifiers
#### Political Ward
![Crimes by Ward](images/CrimesByWard.png "Crimes by Ward")
#### Police District
![Crimes by Police District](images/DistrictBorderOverlay.png "Crimes by Police District")
<p style='text-align:center'>Police District Map from http://mpdc.dc.gov/sites/default/files/dc/sites/mpdc/page_content/images/districtmap_2012.jpg
Modified in PowerPoint to remove the background color, and resized to fit over the data plot</p>
#### Advisory Neighborhood Commission
![Crimes by ANC](images/CrimesByANC.png "Crimes by ANC")


<div style='margin-left:10%;margin-right:10%;margin-top:15px;background-color:#d3d3d3;padding:10px;'>
<h3>Exceptional Work (<b>10 points total</b>)</h3>
    <ul><li>You have free reign to provide additional analyses.</li>
    <li>One idea: implement dimensionality reduction, then visualize and interpret the results.</li></ul>
</div>