# Combined Data

This notebook was loaded with:

```bash
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./dse/bin/dse pyspark --num-executors 5 --driver-memory 8g --executor-memory 8g
```

At this point, we've got several sets of data processed and cleansed. We also have discovered several fields we can use for joining:

- license_id
- longitude, latitude

Longitude and latitude are great candidates for joining crime, sanitation, weather, and inspections. The problem is that it's not reasonable to expect them to fall on exactly the same coordinate.

Suppose we divided the city up into a grid and determined the coordinates for the center of each cell. Then, we could determine which sanitation complaints and crimes were committed in the cell, and connect that to inspections.

In [1]:
%pylab inline 

Populating the interactive namespace from numpy and matplotlib


In [2]:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import count, datediff, lag, sum, coalesce, rank, lit, when,col, udf, to_date, year, mean, month, date_format, array
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DateType
from pyspark.ml.feature import StringIndexer
from datetime import datetime
from pyspark.sql.window import Window
import pyspark
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

# Creating the City Grid

Let's create a grid by finding the boundaries of our coordinates (using crime data because it's the largest set), then assign a grid identifier (`city_grid`). Then, we'll add that id to all of our sets. We'll follow the logical Cassandra pattern of creating a table representing the query we'll want (with the grid identifier as the key)

In [29]:
#--------  cartesian
# A function that creates the cartesian product/combination
# Input: 
#      x1    (numpy vector 1)
#      x2    (numpy vector 2)
# Returns: 
#      cartesian combination
def cartesian(x1, x2):
    return np.transpose([np.tile(x1, len(x2)), np.repeat(x2, len(x1))])

#--------  cartesian
# A function that creates "risk cells" out of the longitude/latitude combination. That means in 
# seperates a x, y plane into n cells.
# Input: 
#      longitude
#      latitude
#      n_cells
# Returns: 
#      risk cells
def create_risk_cells(longitude, latitude, n_cells, ward, district):
    n = int(np.sqrt(n_cells))
    x1 = np.zeros(n)
    x2 = np.zeros(n)

    min_long = min(longitude)
    min_lat = min(latitude)
    step_long = (max(longitude) - min(longitude)) / n
    step_lat = (max(latitude) - min(latitude)) / n
    
    for i in range(0, n):
        x1[i] = min_long + (step_long * i)
        x2[i] = min_lat + (step_lat * i) 
        
    df = pd.DataFrame(cartesian(x1, x2))
    df["ward"] = ward
    df["district"] = district
    return df

Start by loading the crime data and map it to risk cells to create a master table.

In [4]:
df_test = sqlContext.read.format("org.apache.spark.sql.cassandra")\
               .load(keyspace="chicago_data", table="crime")\
               .toPandas()

We've got some missing districts and wards. Let's convert that to 0 to indicate we don't know.

In [24]:
df_test.ward.unique()

array([u'40.0', u'27.0', u'23.0', u'25.0', u'42.0', u'19.0', u'5.0',
       u'34.0', u'24.0', u'9.0', u'43.0', u'1.0', u'11.0', u'21.0',
       u'20.0', u'33.0', u'28.0', u'22.0', u'47.0', u'44.0', u'2.0',
       u'15.0', u'30.0', u'46.0', u'36.0', u'8.0', u'38.0', u'6.0',
       u'26.0', u'49.0', u'3.0', u'45.0', u'4.0', u'16.0', u'17.0',
       u'39.0', u'7.0', u'50.0', u'18.0', u'32.0', u'12.0', u'14.0',
       u'37.0', u'35.0', u'29.0', u'41.0', u'10.0', u'13.0', u'48.0',
       u'31.0', u'NaN'], dtype=object)

In [28]:
df_test.district.unique()

array([u'20.0', u'11.0', u'8.0', u'12.0', u'18.0', u'22.0', u'3.0', u'5.0',
       u'9.0', u'6.0', u'10.0', u'17.0', u'19.0', u'1.0', u'7.0', u'25.0',
       u'16.0', u'24.0', u'2.0', u'4.0', u'14.0', u'15.0', u'31.0', u'NaN',
       u'23.0'], dtype=object)

In [32]:
df_test['ward'] = pd.to_numeric(df_test.ward, errors='coerce').fillna(0).astype(int)
df_test['district'] = pd.to_numeric(df_test.district, errors='coerce').fillna(0).astype(int)

In [33]:
df_grid = create_risk_cells(df_test.longitude, df_test.latitude, 100*100, df_test.ward, df_test.district)
df_grid.head()

Unnamed: 0,0,1,ward,district
0,-91.686569,36.619446,40,20
1,-91.644949,36.619446,27,11
2,-91.603328,36.619446,23,8
3,-91.561708,36.619446,25,12
4,-91.520088,36.619446,42,18


In [34]:
df_grid.columns=["center_longitude", "center_latitude", "ward", "police_district"]
df_grid["id"] = df_grid.index
df_grid.head()

Unnamed: 0,center_longitude,center_latitude,ward,police_district,id
0,-91.686569,36.619446,40,20,0
1,-91.644949,36.619446,27,11,1
2,-91.603328,36.619446,23,8,2
3,-91.561708,36.619446,25,12,3
4,-91.520088,36.619446,42,18,4


In [35]:
df_grid.shape

(10000, 5)

Now, we'll save that master table. It'll only be useful for human readability. We won't use it in analysis from here on out.

```cql
CREATE TABLE chicago_data.city_grid (
    id int,
    center_latitude float,
    center_longitude float,
    ward int,
    police_district int,
    PRIMARY KEY (id));
```

In [36]:
#save the grid cells
sqlContext.createDataFrame(df_grid).write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="city_grid", keyspace="chicago_data")\
    .save()

Now, we'll use KNN to find out which grid cell each of our inspections are in. If you're not familiar with it, KNN is an ML classification algorithm that calculates the distance between (let's say the Euclidian distance) an item in question, and it's $k$ nearest neighbors. Mathematically, for $k=1$ it looks like this:

$$\hat{y} = \min \sqrt{\sum_{i=1}^{k} (x_i-y_i)^2}$$

Because of that, it's computationally intensive, so it requires the entire set to be in memory and traversed each time. That's ok because we'll do this during cleanup and save the results to a table to use when we run the various models.

In [37]:
df_inspections = sqlContext.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="chicago_data", table="inspections")

In [38]:
#use knn to figure out which cell you're in. Unfortunately, MLlib doesn't seem to have KNN for classification. 
#We could use this: https://github.com/saurfang/spark-knn or we could just use sklearn, since we're already in a pandas
#dataframe
from sklearn.neighbors import KNeighborsRegressor as KNN

#this gives us something for the model to predict. It doesn't matter that they are all labels.
knn = KNN(n_neighbors=1)

Uh oh, we've got 117 inspections without the coordinates entered. We need to fix that.

In [39]:
df_inspections.filter(col("longitude").isNull()).count()

117

Let's see if the license records have the GPS coordinates entered.

In [40]:
df_license_coords = sqlContext.sql("select license_id, latitude as lat, longitude as long from chicago_data.licenses")

In [41]:
df_license_coords.head()

Row(license_id=u'2387978', lat=41.86563491821289, long=-87.7695541381836)

In [42]:
df_inspections_joined = df_inspections.join(df_license_coords, on="license_id", how="left")

In [43]:
df_inspections_joined.columns

['license_id',
 'inspection_dt',
 'canvass',
 'complaint',
 'cumulative_failures',
 'cumulative_inspections',
 'days_since_last_inspection',
 'ever_failed',
 'fire',
 'inspection_date_string',
 'inspection_type',
 'inspection_type_description',
 'latitude',
 'license_related',
 'liquor',
 'longitude',
 'month',
 'prev_fail',
 'proportion_past_failures',
 'recent_inspection',
 'reinspection',
 'risk',
 'risk_description',
 'special_event',
 'task_force',
 'weekday',
 'weekday_description',
 'y',
 'y_description',
 'y_fail',
 'zip',
 'lat',
 'long']

In [44]:
df_inspections2 = df_inspections_joined.select('license_id', 'inspection_dt', 'canvass', 'complaint', 'cumulative_failures', \
 'cumulative_inspections', 'days_since_last_inspection', 'ever_failed', 'fire', 'inspection_date_string', 'inspection_type', \
 'inspection_type_description', 'license_related', 'liquor', 'month', \
 'prev_fail', 'proportion_past_failures', 'recent_inspection', 'reinspection', 'risk', 'risk_description', \
 'special_event', 'task_force', 'weekday', 'weekday_description', 'y', 'y_description', 'y_fail', 'zip', \
 coalesce(df_inspections_joined["latitude"], df_inspections_joined["lat"]).alias("latitude"),
 coalesce(df_inspections_joined["longitude"], df_inspections_joined["long"]).alias("longitude"),                                             )

In [45]:
df_inspections2.filter(col("longitude").isNull()).count()

117

Hmm... that did not help at all. Such is the life of a data scientist. ok! `coalesce` to fill in our missing data from the license set was worth a shot.

In [46]:
df_inspections2 = df_inspections2.filter(col("longitude").isNotNull())

We also have a similar issue with `days_since_last_inspection` since the first year in the set has no previous inspections. We'll just set that to zero to avoid nulls.

In [47]:
df_inspections2 = df_inspections2.withColumn("days_since_last_inspection", coalesce(col("days_since_last_inspection"), lit(0)))

Now, we'll compute the `city_grid` for each license_id

In [48]:
y_train = pd.Series(range(0, df_grid.shape[0], 1))
fit_knn = knn.fit(df_grid[["center_longitude", "center_latitude"]].values, y_train.values)
x_test = df_inspections2.toPandas()[["longitude", "latitude", "license_id"]].values

inspections_gridspots = pd.DataFrame(fit_knn.predict(x_test[:,0:2])).values

In [49]:
len(inspections_gridspots)


78122

Concatenate the computed grids with the `license_id` and coordinates so that we can use that to join them to our  inspections on `license_id`

In [50]:
np.concatenate((x_test,inspections_gridspots), axis=1)

array([[u'-87.76052657', u'41.93145721', u'2427620', 9894.0],
       [u'-87.56487511', u'41.76465891', u'2093694', 9599.0],
       [u'-87.56487511', u'41.76465891', u'2093694', 9599.0],
       ..., 
       [u'-87.62612449', u'41.87362569', u'2060210', 9798.0],
       [u'-87.62612449', u'41.87362569', u'2060210', 9798.0],
       [u'-87.62612449', u'41.87362569', u'2060210', 9798.0]], dtype=object)

In [58]:
df_inspections3 = pd.DataFrame(np.concatenate((x_test,inspections_gridspots), axis=1))
df_inspections3.columns=["longitude", "latitude", "license_id", "city_grid"]
df_inspections3.head()

Unnamed: 0,longitude,latitude,license_id,city_grid
0,-87.76052657,41.93145721,2427620,9894
1,-87.56487511,41.76465891,2093694,9599
2,-87.56487511,41.76465891,2093694,9599
3,-87.56487511,41.76465891,2093694,9599
4,-87.63405907,41.85410541,1998584,9797


In [69]:
df_inspections4 = df_inspections3.drop_duplicates()

In [71]:
df_inspections4 = pd.merge(df_inspections4, df_grid, left_on="city_grid", right_on="id", how="left")

To do the join, we need to take this pandas dataframe and convert it to a Spark dataframe. That gives us the added benefit of being able to drop it back to Cassandra.

In [72]:
df_inspections4 = sqlContext.createDataFrame(df_inspections4)

In [79]:
df_inspections5 = df_inspections2.join(df_inspections4.select("license_id", "city_grid", "ward", "police_district"), on="license_id", how="left_outer")

In [80]:
df_inspections5.head()

Row(license_id=u'1042702', inspection_dt=datetime.date(2016, 8, 26), canvass=1, complaint=0, cumulative_failures=2, cumulative_inspections=9, days_since_last_inspection=-8, ever_failed=1, fire=0, inspection_date_string=u'8/26/16', inspection_type=3, inspection_type_description=u'Canvass Re-Inspection', license_related=0, liquor=0, month=8, prev_fail=1, proportion_past_failures=0.2222222222222222, recent_inspection=0, reinspection=1, risk=1, risk_description=u'Risk 1 (High)', special_event=0, task_force=0, weekday=2, weekday_description=u'Fri', y=1, y_description=u'Pass', y_fail=0, zip=u'60614', latitude=u'41.93041201', longitude=u'-87.64388653', city_grid=9897.0, ward=4, police_district=2)

In [81]:
df_inspections5.count()

78274

In [82]:
df_inspections5.cache()

DataFrame[license_id: string, inspection_dt: date, canvass: int, complaint: int, cumulative_failures: int, cumulative_inspections: int, days_since_last_inspection: int, ever_failed: int, fire: int, inspection_date_string: string, inspection_type: int, inspection_type_description: string, license_related: int, liquor: int, month: int, prev_fail: int, proportion_past_failures: double, recent_inspection: int, reinspection: int, risk: int, risk_description: string, special_event: int, task_force: int, weekday: int, weekday_description: string, y: int, y_description: string, y_fail: int, zip: string, latitude: string, longitude: string, city_grid: double, ward: bigint, police_district: bigint]

In [83]:
df_inspections5.dtypes

[('license_id', 'string'),
 ('inspection_dt', 'date'),
 ('canvass', 'int'),
 ('complaint', 'int'),
 ('cumulative_failures', 'int'),
 ('cumulative_inspections', 'int'),
 ('days_since_last_inspection', 'int'),
 ('ever_failed', 'int'),
 ('fire', 'int'),
 ('inspection_date_string', 'string'),
 ('inspection_type', 'int'),
 ('inspection_type_description', 'string'),
 ('license_related', 'int'),
 ('liquor', 'int'),
 ('month', 'int'),
 ('prev_fail', 'int'),
 ('proportion_past_failures', 'double'),
 ('recent_inspection', 'int'),
 ('reinspection', 'int'),
 ('risk', 'int'),
 ('risk_description', 'string'),
 ('special_event', 'int'),
 ('task_force', 'int'),
 ('weekday', 'int'),
 ('weekday_description', 'string'),
 ('y', 'int'),
 ('y_description', 'string'),
 ('y_fail', 'int'),
 ('zip', 'string'),
 ('latitude', 'string'),
 ('longitude', 'string'),
 ('city_grid', 'double'),
 ('ward', 'bigint'),
 ('police_district', 'bigint')]

This is our new table.

```cql
CREATE TABLE chicago_data.inspections_by_city_grid (
    city_grid int,
    license_id text,
    risk_description text,
    zip text,
    inspection_date_string text,
    inspection_type_description text,
    y_description text,
    latitude text,
    longitude text,
    y int,
    y_fail int,
    reinspection int,
    recent_inspection int,
    task_force int,
    special_event int,
    canvass int,
    fire int,
    liquor int,
    complaint int,
    license_related int,
    inspection_type int,
    risk int,
    inspection_dt date,
    prev_fail int,
    cumulative_failures int,
    weekday_description text,
    month int,
    weekday int,
    ever_failed int,
    cumulative_inspections int,
    proportion_past_failures double,
    days_since_last_inspection int,
    ward int,
    police_district int,
    PRIMARY KEY (city_grid, license_id, inspection_dt));
```

In [85]:
df_inspections5.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="inspections_by_city_grid", keyspace="chicago_data")\
    .save()

The advantage is that we can now push back some aggregation to cassandra.

## Crime

We'll do the same thing for crime

In [30]:
df_crime = sqlContext.sql("select * from chicago_data.crime")

In [17]:
df_crime.filter(col("longitude").isNull()).count()

0

In [29]:
df_crime.columns

['longitude', 'latitude', 'id']

We don't need to retrain the model. The grid spots are correct. We'll use that model to "predict" these.

In [19]:
#x_test = df_crime.toPandas()[["longitude", "latitude", "id"]].values
x_test = df_crime.toPandas().values
crime_gridspots = pd.DataFrame(fit_knn.predict(x_test[:,0:2])).values

In [21]:
df_crime.count()

1822721

In [20]:
crime_gridspots.shape

(1822721, 1)

In [22]:
df_crime2 = pd.DataFrame(np.concatenate((x_test,crime_gridspots), axis=1))
df_crime2.columns=["longitude", "latitude", "id", "city_grid"]

In [23]:
df_crime2 = df_crime2.drop_duplicates()

In [24]:
df_crime2 = sqlContext.createDataFrame(df_crime2)

In [31]:
df_crime_final = df_crime.join(df_crime2.select("id", "city_grid"), on="id", how="left_outer")

After joining, we'll add this data to our new table.

```cql
CREATE TABLE chicago_data.crime_by_city_grid (
    city_grid int,
    id text,
    case_number text,
    date text,
    block text,
    iucr text,
    primary_type text,
    arrest boolean,
    beat text,
    district text,
    ward text,
    community_area text,
    fbi_code text,
    year text,
    latitude float,
    longitude float,
    PRIMARY KEY (city_grid, id));
```

In [32]:
df_crime_final.count()

1822721

In [89]:
df_crime_final.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="crime_by_city_grid", keyspace="chicago_data")\
    .save()

We're also going to store the data by type of crime (so that we can use that in aggregation later).

```cql
CREATE TABLE chicago_data.crime_by_type (
    primary_type text,
    city_grid int,
    id int,
    PRIMARY KEY (primary_type, city_grid, id));
```

In [33]:
df_crime_final.select("id", "city_grid", "primary_type").write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="crime_by_type", keyspace="chicago_data")\
    .save()

## Sanitation

Sanitation data is the same, again.

In [90]:
df_sanitation = sqlContext.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="chicago_data", table="sanitation")

In [91]:
df_sanitation.filter(col("longitude").isNull()).count()

0

In [92]:
df_sanitation.columns

['service_request_number',
 'community_area',
 'completion_date',
 'creation_date',
 'latitude',
 'longitude',
 'police_district',
 'status',
 'street_address',
 'type_of_service_request',
 'ward',
 'what_is_the_nature_of_this_code_violation?',
 'zip_code']

In [95]:
x_test = df_sanitation.toPandas()[["longitude", "latitude", "service_request_number"]].values
sanitation_gridspots = pd.DataFrame(fit_knn.predict(x_test[:,0:2])).values

In [96]:
df_sanitation.count()

112086

In [97]:
df_sanitation2 = pd.DataFrame(np.concatenate((x_test,sanitation_gridspots), axis=1))
df_sanitation2.columns=["longitude", "latitude", "service_request_number", "city_grid"]

In [99]:
df_sanitation2 = sqlContext.createDataFrame(df_sanitation2)

In [101]:
df_sanitation_final = df_sanitation.join(df_sanitation2.select("service_request_number", "city_grid"), on="service_request_number", how="left_outer")

In [104]:
df_sanitation_final.columns

['service_request_number',
 'community_area',
 'completion_date',
 'creation_date',
 'latitude',
 'longitude',
 'police_district',
 'status',
 'street_address',
 'type_of_service_request',
 'ward',
 'what_is_the_nature_of_this_code_violation?',
 'zip_code',
 'city_grid']

```cql
CREATE TABLE chicago_data.sanitation_by_city_grid (
    city_grid int,
    creation_date text,
    status text,
    completion_date text,
    service_request_number text,
    type_of_service_request text,
    "what_is_the_nature_of_this_code_violation?" text,
    street_address text,
    zip_code text,
    ward text,
    police_district double,
    community_area double,
    latitude double,
    longitude double,
    PRIMARY KEY (city_grid, service_request_number));
```

In [105]:
df_sanitation_final.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="sanitation_by_city_grid", keyspace="chicago_data")\
    .save()