# Lesson 8: Cross-validation

So far, we've learned about splitting our data into training and testing sets to validate our models. This helps ensure that the model we create on one sample performs well on another sample we want to predict. 

However, we don't have to use just TWO samples to train and test our models. Instead, we can split our data up into MULTIPLE samples to try train and test on multiple segments of the data. This is called CROSS-VALIDATION.

Let's begin by importing our packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import geopandas as gpd
from shapely.geometry import Point, Polygon

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold

In [2]:
import os

os.chdir('C:\\Users\\peter.casey\\Documents\\dspp')

Today we'll be looking at 311 service requests for rodent inspection and abatement aggregated at the Census block level. The data sets are already prepared for you and available in the same folder as this assignment. Census blocks are a geographic level to analyze rodent infestations, because they are drawn along natural and human-made boundaries, like rivers and roads, that rats tend not to cross. 

In [17]:
data = pd.read_csv('rat_data_2016.csv')

In [18]:
cols = ['tot_pop', 'pop_density', 'month', 'activity', 'bbl_food', 'bbl_hotel',
       'bbl_multifamily_rental', 'bbl_restaurant', 'bbl_single_family_rental',
       'bbl_storage', 'bbl_two_family_rental', 'WARD']
data = data[cols]

In [19]:
data.isnull().sum()

tot_pop                     0
pop_density                 0
month                       0
activity                    0
bbl_food                    0
bbl_hotel                   0
bbl_multifamily_rental      0
bbl_restaurant              0
bbl_single_family_rental    0
bbl_storage                 0
bbl_two_family_rental       0
WARD                        0
dtype: int64

Recall from last week that, when we do predictive analysis, we usually are not interested in the relationship between two different variables as we are when we do traditional hypothesis testing. Instead, we're interested in training a model that generates predictions that best fit our target population. Therefore, when we are doing any kind of validation, including cross-validation, it is important for us to choose the metric by which we will evaluate the performance of our models. 

For this model, we will predict the locations of requests for rodent inspection and abatement in the District of Columbia. When we select a validation metric, it's important for us to think about what we want to optimize. For example, do we want to make sure that our top predictions accurately identify places with rodent infestations, so we don't send our inspectors on a wild goose chase? Then we may to look at the models precision, or what proportion of its positive predictions turn out to be positive. Or do we want to make sure we don't miss any infestations? If so, we may want to look at recall, or the proportion of positive cases that are correctly categorized by the model. If we care a lot about how the model ranks our observations, then we may want to look at the area under the ROC curve, or ROC-AUC, while if we care more about how well the model fits the data, or its "calibration," we may want to look at Brier score or logarithmic loss (log-loss).

In the case of rodent inspections, we most likely want to make sure that we send our inspectors to places where they are most likely to find rats, so we will optimize precision. 


The next important decision we need to make is how we split the data. Oftentimes, people cross-validate using random subsamples of their training data. 

In [20]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import precision_score

In [21]:
rs = ShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
for train_index, test_index in rs.split(data):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [5483  379 4900 ..., 1653 2607 2732] TEST: [4111  578 1363 4724 4046 1122 4442 2450 2124 5202 1294 4494 2384 3303 3042
 4385  495 2213 6511 5508 4063 3145  593 3487 2875 4829 4623  318 2229 4627
 3111  751 5511 3443 1778 6212 1712 1766 4376 1161  572 4565 3523 2179 4906
  502 4358 5248 3649 6580 1870  202 1290 4474  451 2200 2217 2443  170 4416
 6088 3744 5166 5029 2183 5141 2420 3099 2976 2131 5999 2542 3658 4390 2946
 1644 1533  683 4700 5132 4890  296 6586 5417 2623 1989 2137 5423 5809 4476
   49 3241 6502 4523 1484 5291 2455 2709 5899 3452 2990 5631  733 4197 3029
 3830 4665 4539 1607 2905 5123 2892 3930   48 1433 6355 1095 3736 2194 3220
 6003  991 3167 4346 5158 1314 1921 3731 5313 3746 4411 2793 6601 1855 2214
 2602 6343 5153 4504  311 4033 4162  875 1854  980 1399   33 2448 4663 4739
  444 1086 3784  543 3594 4911 4338 1066 4340 3789 2484  119 4286 5020 4362
 6081 3294  818 1705 6397 2615 5078 5749 2078 3343  812 4146 4473  154 4113
 3508 5354 2036 2948 2014 1022 5849 12

In [22]:
rs = ShuffleSplit(n_splits=10, random_state=0, test_size=0.1, train_size=None)
for train_index, test_index in rs.split(data):
    X_train = data.loc[train_index].drop(['activity', 'month'], axis=1)
    y_train = data.loc[train_index]['activity']
    X_test = data.loc[test_index].drop(['activity', 'month'], axis=1)
    y_test = data.loc[test_index]['activity']
    
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    scores = lr.predict(X_test)
    print('Precision: '+str(100 * round(precision_score(y_test, scores),3)))

Precision: 60.0
Precision: 57.7
Precision: 58.6
Precision: 58.6
Precision: 59.1
Precision: 56.8
Precision: 57.0
Precision: 60.7
Precision: 63.3
Precision: 65.0


A key concern is that the subsamples of our data are INDEPENDENT of each other. That is, just like when we split our data into training and testing sets, we want to make sure we're not predicting outcomes for observations with a model trained on data about that observation. This can be complicated with data about an observation that appears more than once, such as one that appears repeatedly over time. We'll discuss this further as we go along. 

## Cross-validate by Ward

In [28]:
data.WARD.value_counts().sort_index()

1     940
2    1159
3     283
4    1212
5     967
6    1197
7     431
8     413
Name: WARD, dtype: int64

In [29]:
for ward in np.sort(data.WARD.unique()):

    test = data[data.WARD == ward]
    train = data[data.WARD != ward]
    X_test = test.drop(['activity', 'month'], axis=1)
    y_test = test['activity']
    X_train = test.drop(['activity', 'month'], axis=1)
    y_train = test['activity']
    
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    scores = lr.predict(X_test)
    print('Precision for Ward '+str(ward)+': '+str(100*round(precision_score(y_test, scores),3)))

Precision for Ward 1: 58.1
Precision for Ward 2: 60.2
Precision for Ward 3: 57.4
Precision for Ward 4: 61.0
Precision for Ward 5: 63.1
Precision for Ward 6: 50.0
Precision for Ward 7: 0.0
Precision for Ward 8: 0.0


  'precision', 'predicted', average, warn_for)


## Cross-validate by Month

In [30]:
data.month.value_counts().sort_index()

1     319
2     402
3     504
4     572
5     592
6     615
7     645
8     707
9     689
10    601
11    489
12    467
Name: month, dtype: int64

In [31]:
months = np.sort(data.month.unique())

for month in months:

    test = data[data.month==month]
    train = data[data.month < month]
    X_test = test.drop(['activity', 'month'], axis=1)
    y_test = test['activity']
    X_train = test.drop(['activity', 'month'], axis=1)
    y_train = test['activity']
    
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    scores = lr.predict(X_test)
    print('Precision for Month '+str(month)+': '+str(100*round(precision_score(y_test, scores),3)))

Precision for Month 1: 59.8
Precision for Month 2: 57.6
Precision for Month 3: 58.1
Precision for Month 4: 63.9
Precision for Month 5: 67.1
Precision for Month 6: 65.8
Precision for Month 7: 62.6
Precision for Month 8: 60.9
Precision for Month 9: 58.4
Precision for Month 10: 59.9
Precision for Month 11: 55.5
Precision for Month 12: 58.3
