# Lesson 8: Cross-validation

So far, we've learned about splitting our data into training and testing sets to validate our models. This helps ensure that the model we create on one sample performs well on another sample we want to predict. 

However, we don't have to use just TWO samples to train and test our models. Instead, we can split our data up into MULTIPLE samples to try train and test on multiple segments of the data. This is called CROSS-VALIDATION.

Let's begin by importing our packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import geopandas as gpd
from shapely.geometry import Point, Polygon

from sklearn.linear_model import LogisticRegression

In [2]:
import os

os.chdir('C:\\Users\\peter.casey\\Documents\\dspp')

Today we'll be looking at 311 service requests for rodent inspection and abatement aggregated at the Census block level. The data sets are already prepared for you and available in the same folder as this assignment. Census blocks are a geographic level to analyze rodent infestations, because they are drawn along natural and human-made boundaries, like rivers and roads, that rats tend not to cross. 

In [3]:
blocks = gpd.read_file('C:\\Users\\peter.casey\\Downloads\\Census_Blocks__2010\\Census_Blocks__2010.shp')
blocks = blocks[['GEOID', 'P0010001', 'SqMiles', 'geometry']]
blocks['tot_pop'] = blocks['P0010001']
blocks['pop_density'] = blocks['tot_pop']*1.0/blocks['SqMiles']
blocks = blocks.drop(['P0010001', 'SqMiles'], axis=1)
blocks.head().T

Unnamed: 0,0,1,2,3,4
GEOID,110010003003006,110010003003007,110010003003008,110010003003009,110010003003010
geometry,"POLYGON ((-77.07863030944546 38.9161507873491,...",POLYGON ((-77.07698430927661 38.91551178808869...,POLYGON ((-77.07674330851466 38.91301678672006...,POLYGON ((-77.07412130781374 38.91268078711014...,"POLYGON ((-77.0750153069756 38.9144277868987, ..."
tot_pop,251,24,22,0,108
pop_density,6441.11,4827.2,12746,0,22601.7


In [14]:
requests = pd.read_csv('C:\\Users\\peter.casey\\Downloads\\City_Service_Requests_in_2016.csv')
rats = requests[requests['SERVICECODEDESCRIPTION']=='Rodent Inspection and Treatment']

In [15]:
bbls = pd.read_csv('C:\\Users\\peter.casey\\Downloads\\Basic_Business_License_in_2016.csv')
bbls = bbls[bbls.LICENSESTATUS=='ACTIVE']

  interactivity=interactivity, compiler=compiler, result=result)


In [19]:
geometry = [Point(xy) for xy in zip(rats.LONGITUDE.apply(float), rats.LATITUDE.apply(float))]
crs = {'init': 'epsg:4326'}
points = gpd.GeoDataFrame(rats, crs=crs, geometry=geometry)

geo_rats = gpd.sjoin(blocks, points, how='left', op='intersects')
cols = ['GEOID', 'tot_pop', 'pop_density', 'SERVICEORDERDATE', 'RESOLUTIONDATE', 'SERVICEORDERSTATUS', 'WARD']
geo_rats.to_csv('rats_to_blocks.csv')

In [21]:
geometry = [Point(xy) for xy in zip(bbls.LONGITUDE.apply(float), bbls.LATITUDE.apply(float))]
crs = {'init': 'epsg:4326'}
points = gpd.GeoDataFrame(bbls, crs=crs, geometry=geometry)

geo_bbls = gpd.sjoin(blocks, points, how='left', op='intersects')
cols = ['GEOID', 'LICENSESTATUS', 'LICENSECATEGORY', 'LICENSE_START_DATE', 'LICENSE_EXPIRATION_DATE', 'LICENSE_ISSUE_DATE', 'WARD']
geo_bbls[cols].to_csv('bbls_to_blocks.csv')

In [22]:
geo_mar = pd.read_csv('C:\\Users\\peter.casey\\Documents\\RodentAbatement\\address_units_to_blocks.csv')
geo_mar.to_csv('address_units_to_blocks.csv')

geo_cama = pd.read_csv('C:\\Users\\peter.casey\\Documents\\RodentAbatement\\cama_to_blocks.csv')
geo_cama.to_csv('cama_to_blocks.csv')

In [23]:
rats = pd.read_csv('rats_to_blocks.csv')
bbls = pd.read_csv('bbls_to_blocks.csv')
mar = pd.read_csv('address_units_to_blocks.csv')
cama = pd.read_csv('cama_to_blocks.csv')

Recall from last week that, when we do predictive analysis, we usually are not interested in the relationship between two different variables as we are when we do traditional hypothesis testing. Instead, we're interested in training a model that generates predictions that best fit our target population. 

When we do cross-validation, the most important decision we make is how we split the data. A key concern is that the subsamples of our data are INDEPENDENT of each other. That is, just like when we split our data into training and testing sets, we want to make sure we're not predicting outcomes for observations with a model trained on data about that observation. This can be complicated with data about an observation that appears more than once, such as one that appears repeatedly over time. We'll discuss this further as we go along. 