### <div align='center'> Constructing the Dataset </div>

----

Our goal is to estimate the likelihood of a crime being report / the number of crimes report within a pre-defined distance of a given location over a pre-deterimed time. For example, we would like to say how many crimes will occur within .5miles of a given location over the next hour. In order to model these likelihoods, we need a dataset of locations, times and the number of crimes that occured within that specified parameter. Thusly, the goal of this notebook is to construct such a dataset from raw crime-incident records.

#### Approach 
----

The approach will be to randomly generate locations and times that fall within a valid domain, and then compute the number of crimes that occur around those points. Since we only have reports of crimes that occur within the city of los angeles, we need a way to generate points only within the city limits of LA. Luckily I was able to find geo-json data describing the los angeles city limits and shapely is able to do the rest.



In [23]:
import json
import numpy as np
import os
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon


geometries = [os.path.abspath(os.getcwd() + \
                              '/../CrymeClarity/crymeweb/regions/migrations/fixtures/city_of_los_angeles_geometry.json')]


class GeometricDomain:
    included_geometries = {}

    def __init__(self, geoms):
        for geom in geoms:

            with open(geom, 'r') as geom_file:
                polygon_coords = json.loads(geom_file.read())['geometries'][0]['coordinates'][0][0]

            lats_vect = np.array([coord[0] for coord in polygon_coords])
            lons_vect = np.array([coord[1] for coord in polygon_coords])
            polygon = Polygon(np.column_stack((lons_vect, lats_vect)))

            self.included_geometries[geom] = {
                'polygon_coords': polygon_coords,
                'lats_vect': lats_vect,
                'lons_vect': lons_vect,
                'polygon': polygon,
            }

    def get_bounding_box(self):
        max_lat = max([self.included_geometries[geometry]['lats_vect'].max()
                       for geometry in self.included_geometries])
        min_lat = min([self.included_geometries[geometry]['lats_vect'].min()
                       for geometry in self.included_geometries])
        max_long = max([self.included_geometries[geometry]['lons_vect'].max()
                        for geometry in self.included_geometries])
        min_long = min([self.included_geometries[geometry]['lons_vect'].min()
                        for geometry in self.included_geometries])

        return min_lat, max_lat, min_long, max_long

    def in_domain(self, y, x):
        point = Point(x, y)
        for geometry in self.included_geometries:
            if self.included_geometries[geometry]['polygon'].contains(point):
                return True
        return False


Now that we have our domain issue solved, lets generate latitude, longitude and timestamps all from a uniform distribution. For our timestamp domain, lets use the last year as our domain.

In [38]:
import datetime
import random
def generate_location_times(min_lat, max_lat, min_long, max_long, td_size, domain, n_samples):
    samples = []
    gd = domain
    while len(samples) < n_samples:
        lat = random.uniform(min_lat, max_lat)
        long = random.uniform(min_long, max_long)
        if gd.in_domain(lat, long):
            ts = datetime.datetime(year=2018, month=1, day=1) + datetime.timedelta(days=random.uniform(0,td_size))
            samples.append([long, lat, ts])
            
    return np.array(samples)
        
        


In [41]:
gd = GeometricDomain(geometries)
generate_location_times(-118.6681776, -118.1552948, 33.7036216, 34.337306, 444, gd, 10000)

array([[34.050648853899325, -118.55450142024317,
        datetime.datetime(2018, 2, 19, 23, 50, 58, 247868)],
       [34.265093314007885, -118.3830286792221,
        datetime.datetime(2018, 11, 26, 9, 16, 18, 124628)],
       [34.17513089828658, -118.64807166440373,
        datetime.datetime(2018, 1, 18, 22, 43, 37, 845688)],
       ...,
       [34.12677444909474, -118.21297303420394,
        datetime.datetime(2018, 11, 25, 17, 21, 55, 456266)],
       [34.09006294613667, -118.4065968498316,
        datetime.datetime(2018, 7, 10, 5, 25, 12, 224171)],
       [34.199823852608866, -118.62464589177222,
        datetime.datetime(2018, 12, 7, 6, 10, 28, 785824)]], dtype=object)

Looks good, lets write this up as a management command for our django app.