# Crime Data Exploration

We're going to explore the crime data. Specifically, we're exploring a specific aspect of the data that's been problematic: the distribution of crimes over time. Recently, when attempting to predict the crime rate on the date March 3, 2016, I found that--apparently--there were no crimes within the last 31 days! Obviously, I'm not aware of the dates for which we have crime data. Time for some exploration. 

## Preliminaries

Load the data as an RDD. 

In [8]:
import os
import sys

import pyspark.sql as sql

# From https://stackoverflow.com/a/36218558 .
def sparkImport(module_name, module_directory):
    """
    Convenience function. 
    
    Tells the SparkContext sc (must already exist) to load
    module module_name on every computational node before
    executing an RDD. 
    
    Args:
        module_name: the name of the module, without ".py". 
        module_directory: the path, absolute or relative, to
                          the directory containing module
                          module_Name. 
    """
    module_path = os.path.abspath(
        module_directory + "/" + module_name + ".py")
    sc.addPyFile(module_path)

# Add all scripts from repository to local path. 
# From https://stackoverflow.com/a/35273613 .
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import preprocessing as pp

sparkImport("preprocessing", "..")

ss = sql.SparkSession(sc)
complaints_df = ss.read.csv("crime_complaints_with_header.csv", inferSchema=True, header=True)
complaints_rdd = complaints_df.rdd.map(list)

Row(CMPLNT_NUM=101109527, CMPLNT_FR_DT=u'12/31/2015', CMPLNT_FR_TM=u'23:45:00', CMPLNT_TO_DT=None, CMPLNT_TO_TM=None, RPT_DT=u'12/31/2015', KY_CD=113, OFNS_DESC=u'FORGERY', PD_CD=729, PD_DESC=u'FORGERY,ETC.,UNCLASSIFIED-FELO', CRM_ATPT_CPTD_CD=u'COMPLETED', LAW_CAT_CD=u'FELONY', JURIS_DESC=u'N.Y. POLICE DEPT', BORO_NM=u'BRONX', ADDR_PCT_CD=44, LOC_OF_OCCUR_DESC=u'INSIDE', PREM_TYP_DESC=u'BAR/NIGHT CLUB', PARKS_NM=None, HADEVELOPT=None, X_COORD_CD=1007314, Y_COORD_CD=241257, Latitude=40.828848333, Longitude=-73.916661142, Lat_Lon=u'(40.828848333, -73.916661142)')
[101109527, u'12/31/2015', u'23:45:00', None, None, u'12/31/2015', 113, u'FORGERY', 729, u'FORGERY,ETC.,UNCLASSIFIED-FELO', u'COMPLETED', u'FELONY', u'N.Y. POLICE DEPT', u'BRONX', 44, u'INSIDE', u'BAR/NIGHT CLUB', None, None, 1007314, 241257, 40.828848333, -73.916661142, u'(40.828848333, -73.916661142)']


## Basic Counting

We want to know how many crimes in total are in our dataset. We also want to know how many *valid* crimes (the ones we actually predict) are in our dataset. A crime is *valid* iff it starts and ends on the same calendar date. 

In [7]:
print("total number of crimes: " + str(complaints_rdd.count()))

total number of crimes: 838153


In [9]:
valid_complaints_rdd = complaints_rdd.filter(pp.complaint_is_valid)
print("number of valid crimes: " + str(valid_complaints_rdd.count()))

number of valid crimes: 710577


Good. We still have upwards of $80\%$ of our crimes. 

## Distribution Over Time And Space

For the model, it doesn't particularly matter what the distribution is over space and time. However, for evaluating the accuracy of our model's crime predictions, we want to be aware of when there is *absolutely no crime* during a particular day. If there is absolutely no crime on a given day, then--regardless of how the model predicts crime for each grid square on that day--choosing to catch crime in the top-rated squares will technically allow you to catch all crime for that day. In other words, *any* model will achieve perfect performance on a day in which no crime occurs.

(Note that, when I say "model" here, I am talking about models which attempt to rank grid squares on a given day according to how much crime will occur in each square. This class of models includes our linear model.)

So, let's plot this: the number of crimes per day. 

In [12]:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pyplot

complaints_per_day = complaints_rdd \
    .map(lambda record: (pp.get_complaint_occurrence_day(record), 1)) \
    .countByKey()
sorted_counts = sorted(complaints_per_day.items())
days, counts = zip(*sorted_counts)
pyplot.plot(days, counts)
pyplot.show()

Name: org.apache.toree.interpreter.broker.BrokerException
Message: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 30.0 failed 1 times, most recent failure: Lost task 26.0 in stage 30.0 (TID 296, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/share/apps/spark/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/share/apps/spark/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/share/apps/spark/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/share/apps/spark/spark-2.1.0-bin-hadoop2.6/python/pyspark/rdd.