
# **Geotab Bootcamp: Intelligent Zoning**
**Hands-on Tutorial**

In this session, you'll learn how to leverage machine learning to automatically create geo-fenced zones based on a vehicle's driving patterns, and predict the type of location.  We'll be leveraging data from 12 trucks in our demo database provided by Second Harvest.  Second harvest has a fleet of primarily Kenworths and Freightliners that pickup food from donor locations (such as grocery stores, food distribution centres, etc) and deliver this food to entities in need having centres at schools, churches, community centres and more.  In this notebook, we'll use two forms of unsupervised machine learning (DBSCAN and HDBSCAN) to automatically cluster the stop locations and create zones based on their stop frequency.  We'll then use supervised learning to predict the type of zone (i.e. a consumer or a donor).

##The key learning objectives for this tutorial are:

*  Learn how to use unsupervised learning to geographically cluster coordinates and create boundaries that represent a zone
*   Learn how to feature engineer and create a supervised learning model to predict the type of stop location






In [None]:
#@title Package Installs
%%capture
!pip install hdbscan

In [None]:
#@title Imports
import numpy as np
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
from sklearn.cluster import DBSCAN
import folium
from folium import plugins
from folium import PolyLine
import itertools
import hdbscan
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from scipy.spatial import ConvexHull

In [None]:
#@title Client credentials and BQ client config
try:
  from google.colab import auth
  auth.authenticate_user()
except ImportError:
  pass
client = bigquery.Client(project='geotab-bootcamp')

##Retrieve Data from BigQuery
This will pull in all trip data for 2018 from the demo dataset.

In [None]:
#@title Query the data
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = False
allData = '''SELECT 
                t.*, 
                v.weightclass,
                StopDuration / 60 as StopDurationMinutes
             FROM 
                `geotab-bootcamp.DemoData.IntelligentZoningTrips` t 
             INNER JOIN 
                `geotab-bootcamp.DemoData.Vin` v on v.hardwareid=t.hardwareid
             WHERE 
                extract(year from t.starttime) = 2018 
             ORDER BY 
                t.UniqueId'''
allData = client.query(allData, job_config=job_config).to_dataframe()

## Data Examination
Let's filter out the data to only look at trips less than 3 minutes in length and let's examine the structure of the dataframe

### Questions:


1.  How many records are we initially working with (from trips greater than 3 minutes in length)?
2.   What is the average and maximum speed for trips (in minutes)?
3.   What is the maximum driving duration of trips?

In [None]:
# Let's only look at all stops over 3 minutes and examine the data structure
dfNormalStops = allData[allData["StopDurationMinutes"] > 3]
dfNormalStops.describe()

Unnamed: 0,HardwareId,StopLatitude,StopLongitude,HourStart,HourStop,DrivingDuration,StopDuration,IdlingDuration,MaxSpeed,Distance,Year,Month,Day,PreviousStopDuration,StopDurationMinutes
count,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22586.0,22583.0,22586.0
mean,550926100.0,43.708392,-79.470771,10.398565,10.628531,843.330567,11185.5,115.546179,49.030506,7.025791,2018.0,6.686487,15.598911,10966.68,186.424917
std,409885.1,0.18491,0.774639,2.230429,2.257964,955.001002,47104.54,301.657077,32.977205,14.556884,0.0,3.249843,8.652793,48476.1,785.075684
min,550602000.0,36.062668,-115.314331,0.0,0.0,0.0,180.063,0.0,0.0,0.0,2018.0,1.0,1.0,0.06,3.00105
25%,550606600.0,43.658264,-79.49453,9.0,9.0,162.937,463.063,6.0,22.0,0.287409,2018.0,4.0,8.0,408.127,7.717717
50%,550610500.0,43.726284,-79.468544,10.0,11.0,532.873,772.063,19.0,48.0,2.233668,2018.0,7.0,15.0,727.063,12.867717
75%,551021800.0,43.76123,-79.37779,12.0,12.0,1213.98425,1479.063,88.0,70.0,8.447706,2018.0,10.0,23.0,1409.063,24.65105
max,551928000.0,44.61061,-76.48407,23.0,23.0,15536.873,1981215.0,8595.0,149.0,366.9011,2018.0,12.0,31.0,1981215.0,33020.25


In [None]:
print('1. Number of Records:                 {:,}'.format(len(dfNormalStops)))
print('2. Average Maximum Speed of Trips:    {:,.0f} km/h'.format(dfNormalStops['MaxSpeed'].mean()))
print('3. Maximum Driving Duration of Trips: {:,.0f} minutes'.format(dfNormalStops['DrivingDuration'].max()/60))

1. Number of Records:                 22,586
2. Average Maximum Speed of Trips:    49 km/h
3. Maximum Driving Duration of Trips: 259 minutes


# 1.0 : Hands-On Exercise #1: Unsupervised Learning via Zone Identification

## 1.1 : Exploring HDBSCAN
We'll now look at implementing the HDBSCAN algorithm choosing to have at least 5 samples in an area to have a cluster.  We'll also select haversine as our distance metric which gives us the appropriate distance metric when looking at geography.

When running HDBSCAN, we would like to bring in only points in a cluster that have a strong probability of belonging to that cluster.  Let's consider only points that have an 80% probabilty of being in a cluster.

### Questions:

1.   How many clusters were formed?
2.   Examine ths structure of the clusters.
3.   What are the elements of cluster #3?



In [None]:
#@title Clustering Step 1: HDBSCAN
min_sample = 5

coords = dfNormalStops[['StopLatitude', 'StopLongitude']].values
db = hdbscan.HDBSCAN(algorithm='prims_balltree', metric='haversine', min_cluster_size=min_sample).fit(np.radians(coords))

cluster_labels = db.labels_
probs = db.probabilities_
num_clusters = len(set(cluster_labels))
hdclusters = pd.Series([coords[(cluster_labels == n) & (probs > 0.8)] for n in range(num_clusters)])

In [None]:
# Cluster Structure
hdclusters.head()

0    [[36.1560822, -115.314331], [36.16082, -115.28...
1    [[42.31862, -83.00375], [42.31862, -83.00378],...
2    [[44.10893, -77.66032], [44.1022453, -77.64027...
3    [[42.062542, -82.64591], [42.0625153, -82.6457...
4    [[43.12786, -80.33454], [43.1286, -80.3345947]...
dtype: object

## Plotting HDBSCAN Results
Let's have a look at plotting the results from HDBSCAN.  Instead of displaying the results inline, we'll have to export to an HTML file (as the results will not plot in Colab)

### Questions:


1.   What does it do well?
2.   What doesn't it do well?



In [None]:
#@title Explore HDBSCAN.HTML
colors = itertools.cycle([ "#E88C07", "#FF005B", "#150CE8", "#00FFC2","#040403"])
map = folium.Map([43.6944122, -79.4920959], zoom_start=10,tiles='OpenStreetMap') 

for i in range(len(hdclusters)):
  color = next(colors)
  for j in range(len(hdclusters[i])):
    folium.CircleMarker(location=(hdclusters[i][j][0], hdclusters[i][j][1]), radius=10, color=color, fill_color=color).add_to(map)
 
map.save('hdbscan.html')
from google.colab import files
files.download('hdbscan.html') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 1.2 : Exploring DBSCAN
Let's have a look at what DBSCAN would have done and how that differs from HDBSCAN.
Experiment with the parameters: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

In [None]:
#@title What would that look like with DBSCAN?
kms_per_radian = 6371.0088
epsilon = 0.05 / kms_per_radian  # 50 metres
min_sample = 5

coords = dfNormalStops[['StopLatitude', 'StopLongitude']].values
db = DBSCAN(eps=epsilon, min_samples=min_sample, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
dbclusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])

In [None]:
#@title Explore DBSCAN.HTML
colors = itertools.cycle([ "#E88C07", "#FF005B", "#150CE8", "#00FFC2","#040403"])
map = folium.Map([43.6944122, -79.4920959], zoom_start=10,tiles='OpenStreetMap') 

for i in range(len(dbclusters)):
  color = next(colors)
  for j in range(len(dbclusters[i])):
    folium.CircleMarker(location=(dbclusters[i][j][0], dbclusters[i][j][1]), radius=10, color=color, fill_color=color).add_to(map)
 
map.save('dbscan.html')
from google.colab import files
files.download('dbscan.html') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 1.3 : Exploring the Combined Model
Now that we've used HDBSCAN to pull together a set of clusters, we have the problem that clusters that are nearby and should be grouped together are not.  We'll apply DBSCAN to the resutls to pull in all points that are within 50 metres of each other.  Unlike HDBSCAN, wtih DBSCAN, we must supply an EPSILON (or in this case a distance within which points will be clustered together).  We must also first convert to radians.

Here, we are essentially running a clustering algorithm on the results of the previous clustering algorithm.

In [None]:
#@title Clustering Step 2: DBSCAN
kms_per_radian = 6371.0088
epsilon = 0.05 / kms_per_radian  # 50 metres
min_sample = 5

#Flattens out the HDBSCAN clusters into a 1D list
flat_list = [item for sublist in hdclusters for item in sublist]
coords_dbscan = pd.DataFrame(flat_list).values
db = DBSCAN(eps=epsilon, min_samples=min_sample, algorithm='ball_tree', metric='haversine').fit(np.radians(coords_dbscan))

cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords_dbscan[cluster_labels == n] for n in range(num_clusters)])

In [None]:
#@title Explore COMBINED.HTML
colors = itertools.cycle([ "#E88C07", "#FF005B", "#150CE8", "#00FFC2","#040403"])
map = folium.Map([43.6944122, -79.4920959], zoom_start=10,tiles='OpenStreetMap') 

for i in range(len(clusters)):
  color = next(colors)
  for j in range(len(clusters[i])):
    folium.CircleMarker(location=(clusters[i][j][0], clusters[i][j][1]), radius=10, color=color, fill_color=color).add_to(map)
 
map.save('combined.html')
from google.colab import files
files.download('combined.html') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 1.4: Mapping the Results in Polygons
Now that we've run clustering leveraging HDBSCAN and DBSCAN, we have a series of point-to-cluster definitions and can map all of the respective polygons leveraging the ConvexHull tool.

In [None]:
#@title Automated Zone Creation
colors = itertools.cycle([ "#E88C07", "#FF005B", "#150CE8", "#00FFC2","#040403"])

map = folium.Map([43.6944122, -79.4920959], zoom_start=10,tiles='OpenStreetMap') 
 
for i in range(len(clusters)-1):
  color = next(colors)
  try:
    hull = ConvexHull(clusters[i])
    points = [[clusters[i][vertex,0], clusters[i][vertex,1]] for vertex in hull.vertices]
    points.append([clusters[i][hull.vertices[0],0], clusters[i][hull.vertices[0],1]])
    p = PolyLine(pd.DataFrame(points).values, color = color, weight=5, opacity=1)
    map.add_children(p)
  except:
    points = list()
    points.append([clusters[i][0][0]-0.0001, clusters[i][0][1]-0.0001])
    points.append([clusters[i][0][0]-0.0001, clusters[i][0][1]+0.0001])
    points.append([clusters[i][0][0]+0.0001, clusters[i][0][1]+0.0001])
    points.append([clusters[i][0][0]+0.0001, clusters[i][0][1]-0.0001])
    points.append([clusters[i][0][0]-0.0001, clusters[i][0][1]-0.0001])
    p = PolyLine(pd.DataFrame(points).values, color = color, weight=5, opacity=1)
    map.add_children(p)

display(map)

  del sys.path[0]


## Associating the Cluster
All we're doing here is associating the correct cluster back to the original trips dataframe (dfNormalStops)

In [None]:
# Create a dataframe having a mapping of points to the clusters to which they belong
dfClusterPoint = pd.DataFrame(data=cluster_labels, columns=['Cluster'])
dfClusterPoint['Point'] = flat_list
dfClusterPoint['Lat'] = dfClusterPoint.apply(lambda x: x['Point'][0], axis=1)
dfClusterPoint['Lng'] = dfClusterPoint.apply(lambda x: x['Point'][1], axis=1)
dfClusterPoint = dfClusterPoint.groupby(['Lat','Lng']).agg({'Cluster': np.max}).reset_index()

# Maps the cluster back to the associated longitude/latitude in the NormalStops dataframe
dfNormalStopsWithCluster = pd.merge(dfNormalStops, dfClusterPoint, how='right', left_on=['StopLatitude','StopLongitude'], right_on = ['Lat','Lng'])

# 2.0: Supervised Learning: Zone Classification

## 2.1: Data Preparation
Now that we have the clusters associated with each trip, we can start to prepare the data for inclusion into our supervised model for predicting the type of stop location.

In [None]:
#@title One-hot encodes the weight class variable
dfOneHot = pd.get_dummies(dfNormalStopsWithCluster['weightclass'])
dfNormalStopsWithCluster = pd.concat([dfNormalStopsWithCluster, dfOneHot], axis=1)

In [None]:
#@title Insepect the one-hot encoding
dfNormalStopsWithCluster.head()

Unnamed: 0,UniqueId,HardwareId,StartTime,StopTime,StopLatitude,StopLongitude,StopGeohash,HourStart,HourStop,DrivingDuration,StopDuration,IdlingDuration,MaxSpeed,Distance,Year,Month,Day,PreviousStopDuration,weightclass,StopDurationMinutes,Lat,Lng,Cluster,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs)
0,1905727_551021825_2018-01-02_13:07:22UTC,551021825,2018-01-02 13:07:22+00:00,2018-01-02 13:09:52+00:00,43.758003,-79.46634,dpz2ztyku,8,8,150.0,411.127,4.0,35.0,0.621264,2018,1,2,3267.0,Class 8 (33001 - Over lbs),6.852117,43.758003,-79.46634,353,0,1,0,0
1,1971788_551020463_2018-01-14_14:31:04UTC,551020463,2018-01-14 14:31:04.127000+00:00,2018-01-14 14:36:23+00:00,43.758003,-79.46634,dpz2ztyku,9,9,318.873,307.063,4.0,70.0,2.26726,2018,1,14,270.127,Class 7 (26001 - 33000 lbs),5.117717,43.758003,-79.46634,353,1,0,0,0
2,3157557_551020463_2018-08-22_11:54:35UTC,551020463,2018-08-22 11:54:35.063000+00:00,2018-08-22 12:03:38+00:00,43.758003,-79.46634,dpz2ztyku,7,8,542.937,730.127,6.0,29.0,0.586836,2018,8,22,64408.063,Class 7 (26001 - 33000 lbs),12.168783,43.758003,-79.46634,353,1,0,0,0
3,3366222_551400270_2018-09-28_12:15:09UTC,551400270,2018-09-28 12:15:09+00:00,2018-09-28 12:20:23+00:00,43.758003,-79.46634,dpz2ztyku,8,8,314.0,519.063,63.0,28.0,0.51176,2018,9,28,65056.0,Class 7 (26001 - 33000 lbs),8.65105,43.758003,-79.46634,353,1,0,0,0
4,3491002_550602017_2018-07-21_13:00:57UTC,550602017,2018-07-21 13:00:57.127000+00:00,2018-07-21 13:05:03+00:00,43.758003,-79.46634,dpz2ztyku,9,9,245.873,401.0,401.0,41.0,0.546758,2018,7,21,69615.127,Class 7 (26001 - 33000 lbs),6.683333,43.758003,-79.46634,353,1,0,0,0


In [None]:
# Ignore any stops that weren't in a cluster
dfClusters = dfNormalStopsWithCluster[dfNormalStopsWithCluster['Cluster'] != -1].groupby(['Cluster'])

## 2.1: Pulling together Features by Cluster
Previously, we had been looking at our data from a trip-centric angle (i.e. every row of data represented a trip).  Now, we're going to transform that data whereby every row represents a specific zone/cluster.  We will aggregate and bucketize all of the metrics associated with the zone:

*   Histogram of quantity of time stopping at the zone in 5 minute buckets up to 60 minutes
*   Histogram of distribution of visits to the zone by hour of day
*   Count of distinct vehicles
*   Percentage visit distribution by each of the types of vehicles

In [None]:
#@title Aggregation Functions
def AggStopDuration(x):
  hst = np.histogram(x, bins=[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65])
  numOver60 = len(x) - hst[0].sum()
  hst[0][12] = hst[0][12] + numOver60
  return hst
def AggStopHour(x):
  hst = np.histogram(x, bins=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])
  for i in range(24):
     hst[0][i] =  hst[0][i] / hst[0].sum() * 100
  return hst
def CountDistinctVehicles(x):
  numVehicles = len(x.groupby(x))
  return numVehicles

In [None]:
#@title Aggregates Trips and Groups by Cluster/Zone
dfFlattenedClusters = dfClusters.agg({'StopLatitude': np.mean,
                                   'StopLongitude': np.mean,
                                   'StopDurationMinutes': lambda x: AggStopDuration(x),
                                   'HourStop': lambda x: AggStopHour(x),
                                   'UniqueId': np.count_nonzero,
                                   'HardwareId' : lambda x: CountDistinctVehicles(x),
                                   'Class 7 (26001 - 33000 lbs)' : np.sum,
                                   'Class 8 (33001 - Over lbs)' : np.sum,
                                   'Class C (4001 - 5000 lbs)' : np.sum,
                                   'Class G (8001 - 9000 lbs)' : np.sum}
                                 ).reset_index()

In [None]:
#@title Pulls array data out into its own columns
dfFlattenedClusters['Stop_0_5'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][0], axis=1)
dfFlattenedClusters['Stop_5_10'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][1], axis=1)
dfFlattenedClusters['Stop_10_15'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][2], axis=1)
dfFlattenedClusters['Stop_15_20'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][3], axis=1)
dfFlattenedClusters['Stop_20_25'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][4], axis=1)
dfFlattenedClusters['Stop_25_30'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][5], axis=1)
dfFlattenedClusters['Stop_30_35'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][6], axis=1)
dfFlattenedClusters['Stop_35_40'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][7], axis=1)
dfFlattenedClusters['Stop_40_45'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][8], axis=1)
dfFlattenedClusters['Stop_45_50'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][9], axis=1)
dfFlattenedClusters['Stop_50_55'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][10], axis=1)
dfFlattenedClusters['Stop_55_60'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][11], axis=1)
dfFlattenedClusters['Stop_60_plus'] = dfFlattenedClusters.apply(lambda x: x['StopDurationMinutes'][0][12], axis=1)

dfFlattenedClusters['StopHour_0_1'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][0], axis=1)
dfFlattenedClusters['StopHour_1_2'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][1], axis=1)
dfFlattenedClusters['StopHour_2_3'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][2], axis=1)
dfFlattenedClusters['StopHour_3_4'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][3], axis=1)
dfFlattenedClusters['StopHour_4_5'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][4], axis=1)
dfFlattenedClusters['StopHour_5_6'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][5], axis=1)
dfFlattenedClusters['StopHour_6_7'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][6], axis=1)
dfFlattenedClusters['StopHour_7_8'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][7], axis=1)
dfFlattenedClusters['StopHour_8_9'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][8], axis=1)
dfFlattenedClusters['StopHour_9_10'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][9], axis=1)
dfFlattenedClusters['StopHour_10_11'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][10], axis=1)
dfFlattenedClusters['StopHour_11_12'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][11], axis=1)
dfFlattenedClusters['StopHour_12_13'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][12], axis=1)
dfFlattenedClusters['StopHour_13_14'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][13], axis=1)
dfFlattenedClusters['StopHour_14_15'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][14], axis=1)
dfFlattenedClusters['StopHour_15_16'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][15], axis=1)
dfFlattenedClusters['StopHour_16_17'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][16], axis=1)
dfFlattenedClusters['StopHour_17_18'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][17], axis=1)
dfFlattenedClusters['StopHour_18_19'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][18], axis=1)
dfFlattenedClusters['StopHour_19_20'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][19], axis=1)
dfFlattenedClusters['StopHour_20_21'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][20], axis=1)
dfFlattenedClusters['StopHour_21_22'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][21], axis=1)
dfFlattenedClusters['StopHour_22_23'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][22], axis=1)
dfFlattenedClusters['StopHour_23_24'] = dfFlattenedClusters.apply(lambda x: x['HourStop'][0][23], axis=1)

dfFlattenedClusters['Class 7 (26001 - 33000 lbs)'] = dfFlattenedClusters['Class 7 (26001 - 33000 lbs)'] / dfFlattenedClusters['UniqueId']
dfFlattenedClusters['Class 8 (33001 - Over lbs)'] = dfFlattenedClusters['Class 8 (33001 - Over lbs)'] / dfFlattenedClusters['UniqueId']
dfFlattenedClusters['Class C (4001 - 5000 lbs)'] = dfFlattenedClusters['Class C (4001 - 5000 lbs)'] / dfFlattenedClusters['UniqueId']
dfFlattenedClusters['Class G (8001 - 9000 lbs)'] = dfFlattenedClusters['Class G (8001 - 9000 lbs)'] / dfFlattenedClusters['UniqueId']

In [None]:
#@title Inspect the flattened table of clusters
dfFlattenedClusters.head()

Unnamed: 0,Cluster,StopLatitude,StopLongitude,StopDurationMinutes,HourStop,UniqueId,HardwareId,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs),Stop_0_5,Stop_5_10,Stop_10_15,Stop_15_20,Stop_20_25,Stop_25_30,Stop_30_35,Stop_35_40,Stop_40_45,Stop_45_50,Stop_50_55,Stop_55_60,Stop_60_plus,StopHour_0_1,StopHour_1_2,StopHour_2_3,StopHour_3_4,StopHour_4_5,StopHour_5_6,StopHour_6_7,StopHour_7_8,StopHour_8_9,StopHour_9_10,StopHour_10_11,StopHour_11_12,StopHour_12_13,StopHour_13_14,StopHour_14_15,StopHour_15_16,StopHour_16_17,StopHour_17_18,StopHour_18_19,StopHour_19_20,StopHour_20_21,StopHour_21_22,StopHour_22_23,StopHour_23_24
0,0,42.062801,-82.645672,"([4, 3, 2, 0, 1, 0, 1, 0, 0, 0, 1, 0, 4], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 33, 17, 0, 0, ...",16,1,0.0,1.0,0.0,0.0,4,3,2,0,1,0,1,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,6,33,17,0,0,0,0,0,0,0,0,0,0,0,0
1,1,43.127895,-80.334633,"([0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 20, 8, 0, 6, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,1,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,8,0,6,0,0,0,0,0,0,0,0,0,0,0,0
2,2,43.128893,-80.334521,"([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 6, 0, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,0,0,0,0,0,0,0,1,1,1,0,2,0,0,0,0,0,0,0,0,0,40,6,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,44.121954,-79.557962,"([1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 60, 3, 0, 0, 0, 0, 0, 0...",5,2,0.0,1.0,0.0,0.0,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,44.125615,-79.5561,"([0, 1, 6, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 8, 0, 26, 2, 0, 2, 9, 0...",12,2,0.0,1.0,0.0,0.0,0,1,6,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,26,2,0,2,9,0,0,0,0,0,0,0,0,0,0


## 2.2: Labeling Data through Visual Inspection
We leveraged the data on a map to label zones as either an Agency or Donor.

In [None]:
colors = itertools.cycle([ "#E88C07", "#FF005B", "#150CE8", "#00FFC2","#040403"])

map = folium.Map([43.6944122, -79.4920959], zoom_start=10,tiles='OpenStreetMap') 

for index, row in dfFlattenedClusters.iterrows():
  color = next(colors)
  folium.CircleMarker(location=(row['StopLatitude'], row['StopLongitude']), radius=5, color=color, fill_color=color, popup='Cluster: ' + str(row['Cluster'])).add_to(map)

display(map)

In [None]:
#@title Import Label Data
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = False
labelData = '''
    SELECT 
      Cluster, 
      (case when Type='Donor' then 0 else 1 end) as Type 
    FROM `geotab-bootcamp.DemoData.IntelligentZoningLabelData` 
    WHERE Type NOT IN ('Gas', 'Service Centre')
'''

dfLabelData = client.query(labelData, job_config=job_config).to_dataframe()

In [None]:
#@title Verifying Balance of Data
print('Number of Donors in Label Data: ' + str(len(dfLabelData[dfLabelData['Type']==0])))
print('Number of Agencies in Label Data: ' + str(len(dfLabelData[dfLabelData['Type']==1])))

Number of Donors in Label Data: 55
Number of Agencies in Label Data: 60


In [None]:
#@title Join label data with input data
dfTraining = pd.merge(dfLabelData, dfFlattenedClusters.copy(), on='Cluster', how='inner')

In [None]:
#@title Inspecting Training Data
dfTraining.head()

Unnamed: 0,Cluster,Type,StopLatitude,StopLongitude,StopDurationMinutes,HourStop,UniqueId,HardwareId,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs),Stop_0_5,Stop_5_10,Stop_10_15,Stop_15_20,Stop_20_25,Stop_25_30,Stop_30_35,Stop_35_40,Stop_40_45,Stop_45_50,Stop_50_55,Stop_55_60,Stop_60_plus,StopHour_0_1,StopHour_1_2,StopHour_2_3,StopHour_3_4,StopHour_4_5,StopHour_5_6,StopHour_6_7,StopHour_7_8,StopHour_8_9,StopHour_9_10,StopHour_10_11,StopHour_11_12,StopHour_12_13,StopHour_13_14,StopHour_14_15,StopHour_15_16,StopHour_16_17,StopHour_17_18,StopHour_18_19,StopHour_19_20,StopHour_20_21,StopHour_21_22,StopHour_22_23,StopHour_23_24
0,3,0,44.121954,-79.557962,"([1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 60, 3, 0, 0, 0, 0, 0, 0...",5,2,0.0,1.0,0.0,0.0,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,4,0,44.125615,-79.5561,"([0, 1, 6, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 8, 0, 26, 2, 0, 2, 9, 0...",12,2,0.0,1.0,0.0,0.0,0,1,6,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,26,2,0,2,9,0,0,0,0,0,0,0,0,0,0
2,1,0,43.127895,-80.334633,"([0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 20, 8, 0, 6, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,1,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,8,0,6,0,0,0,0,0,0,0,0,0,0,0,0
3,2,0,43.128893,-80.334521,"([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 6, 0, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,0,0,0,0,0,0,0,1,1,1,0,2,0,0,0,0,0,0,0,0,0,40,6,0,0,0,0,0,0,0,0,0,0,0,0,0
4,14,0,43.536372,-79.907267,"([5, 4, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 6, 5, 12, 15, 4, 2, 2, ...",15,3,0.133333,0.866667,0.0,0.0,5,4,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,6,5,12,15,4,2,2,2,0,0,0,0,0,0,0,0,0


In [None]:
#@title Clean up Features
labels = np.array(dfTraining['Type'])
dfFeatures = dfTraining.drop('Cluster', axis = 1)
dfFeatures = dfFeatures.drop('Type', axis = 1)
dfFeatures = dfFeatures.drop('StopLatitude', axis = 1)
dfFeatures = dfFeatures.drop('StopLongitude', axis = 1)
dfFeatures = dfFeatures.drop('StopDurationMinutes', axis = 1)
dfFeatures = dfFeatures.drop('HourStop', axis = 1)
feature_list = list(dfFeatures.columns)
features = np.array(dfFeatures)

In [None]:
#@title Inspect Features Dataframe
dfFeatures.head()

Unnamed: 0,UniqueId,HardwareId,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs),Stop_0_5,Stop_5_10,Stop_10_15,Stop_15_20,Stop_20_25,Stop_25_30,Stop_30_35,Stop_35_40,Stop_40_45,Stop_45_50,Stop_50_55,Stop_55_60,Stop_60_plus,StopHour_0_1,StopHour_1_2,StopHour_2_3,StopHour_3_4,StopHour_4_5,StopHour_5_6,StopHour_6_7,StopHour_7_8,StopHour_8_9,StopHour_9_10,StopHour_10_11,StopHour_11_12,StopHour_12_13,StopHour_13_14,StopHour_14_15,StopHour_15_16,StopHour_16_17,StopHour_17_18,StopHour_18_19,StopHour_19_20,StopHour_20_21,StopHour_21_22,StopHour_22_23,StopHour_23_24
0,5,2,0.0,1.0,0.0,0.0,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,12,2,0.0,1.0,0.0,0.0,0,1,6,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,26,2,0,2,9,0,0,0,0,0,0,0,0,0,0
2,5,2,0.4,0.6,0.0,0.0,0,1,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,8,0,6,0,0,0,0,0,0,0,0,0,0,0,0
3,5,2,0.4,0.6,0.0,0.0,0,0,0,0,0,0,0,0,1,1,1,0,2,0,0,0,0,0,0,0,0,0,40,6,0,0,0,0,0,0,0,0,0,0,0,0,0
4,15,3,0.133333,0.866667,0.0,0.0,5,4,1,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,6,5,12,15,4,2,2,2,0,0,0,0,0,0,0,0,0


## 2.3: Creating the Supervised ML Model
Here we will do a train/test/split and train our Random Forest Model model looking at the accuracy score.
Documentation:  https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
#@title 2.3.1: Random Forest Implementation
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.3, random_state = 42)
rf = RandomForestClassifier(n_estimators = 3, random_state = 42)
rf.fit(train_features, train_labels);
predictions = rf.predict(test_features)

print("Train Accuracy :: " + str(accuracy_score(train_labels, rf.predict(train_features))))
print("Test Accuracy  :: " + str(accuracy_score(test_labels, predictions)))

Train Accuracy :: 0.9625
Test Accuracy  :: 0.7714285714285715


### Which Features are Important?

Have a look at identifying feature the model believes to be important.

In [None]:
#@title Feature Importance

featureImportance = pd.concat([pd.Series(feature_list), pd.Series(rf.feature_importances_)], axis=1)
featureImportance

Unnamed: 0,0,1
0,UniqueId,0.025707
1,HardwareId,0.024202
2,Class 7 (26001 - 33000 lbs),0.04969
3,Class 8 (33001 - Over lbs),0.262417
4,Class C (4001 - 5000 lbs),0.0
5,Class G (8001 - 9000 lbs),0.055123
6,Stop_0_5,0.0
7,Stop_5_10,0.0
8,Stop_10_15,0.073319
9,Stop_15_20,0.012296


### 2.3.2: Experiment with Logistic Regression
Here is the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
lr = LogisticRegression(random_state = 42)
lr.fit(train_features, train_labels);
predictions = lr.predict(test_features)

print("Train Accuracy :: " + str(accuracy_score(train_labels, lr.predict(train_features))))
print("Test Accuracy  :: " + str(accuracy_score(test_labels, predictions)))

Train Accuracy :: 0.9375
Test Accuracy  :: 0.8285714285714286


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 2.3.3: Experiment with Naive Bayes
Here is the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

In [None]:
nb = GaussianNB()
nb.fit(train_features, train_labels);
predictions = lr.predict(test_features)

print("Train Accuracy :: " + str(accuracy_score(train_labels, nb.predict(train_features))))
print("Test Accuracy  :: " + str(accuracy_score(test_labels, predictions)))

Train Accuracy :: 0.7125
Test Accuracy  :: 0.8285714285714286


###2.3.4: Revise Feature Selection and Re-train Random Forest

In [None]:
labels = np.array(dfTraining['Type'])
dfFeatures = dfTraining.drop('Cluster', axis = 1)
dfFeatures = dfFeatures.drop('Type', axis = 1)
dfFeatures = dfFeatures.drop('StopLatitude', axis = 1)
dfFeatures = dfFeatures.drop('StopLongitude', axis = 1)
dfFeatures = dfFeatures.drop('StopDurationMinutes', axis = 1)
dfFeatures = dfFeatures.drop('HourStop', axis = 1)


dfFeatures = dfFeatures.drop('Stop_0_5', axis = 1)
dfFeatures = dfFeatures.drop('Stop_5_10', axis = 1)
dfFeatures = dfFeatures.drop('Stop_40_45', axis = 1)
dfFeatures = dfFeatures.drop('Stop_50_55', axis = 1)
dfFeatures = dfFeatures.drop('Stop_55_60', axis = 1)
dfFeatures = dfFeatures.drop('Stop_60_plus', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_0_1', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_1_2', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_2_3', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_3_4', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_4_5', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_5_6', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_6_7', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_7_8', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_15_16', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_16_17', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_17_18', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_20_21', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_21_22', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_22_23', axis = 1)
dfFeatures = dfFeatures.drop('StopHour_23_24', axis = 1)
dfFeatures = dfFeatures.drop('Class C (4001 - 5000 lbs)', axis = 1)
dfFeatures = dfFeatures.drop('Class G (8001 - 9000 lbs)', axis = 1)
feature_list = list(dfFeatures.columns)
features = np.array(dfFeatures)

#Re-train the model
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.3, random_state = 42)
rf = RandomForestClassifier(n_estimators = 3, random_state = 42)
rf.fit(train_features, train_labels);
predictions = rf.predict(test_features)

print("Train Accuracy :: " + str(accuracy_score(train_labels, rf.predict(train_features))))
print("Test Accuracy  :: " + str(accuracy_score(test_labels, predictions)))

Train Accuracy :: 0.9625
Test Accuracy  :: 0.8571428571428571


### 2.3.5: Hyper Parameter Tuning and Cross-Validation
We're going to do 5-fold Cross Validation.  We have a train/test split of 70/30 and we need and we need to validate the performance leveraging K-fold cross-validation.  We'll also leverage GridSearch for hyper parameter tuning and have selected some parameters to tune on Random Forest.

Question: Which parameters yield the best model?

In [None]:
#@title Grid Search and Cross-Validation
param_grid = {
    'bootstrap': [True],
    'max_depth': [50, 60, 80],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [3, 10, 100]
}
rf = RandomForestClassifier(random_state = 42)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(train_features, train_labels)
grid_search.best_params_

Fitting 5 folds for each of 162 candidates, totalling 810 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 558 tasks      | elapsed:   25.1s
[Parallel(n_jobs=-1)]: Done 810 out of 810 | elapsed:   36.6s finished


{'bootstrap': True,
 'max_depth': 50,
 'max_features': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 8,
 'n_estimators': 100}

In [None]:
#@title Model Accuracy Metrics
best_grid = grid_search.best_estimator_
predictions = best_grid.predict(test_features)

print("Train Accuracy :: " + str(accuracy_score(train_labels, best_grid.predict(train_features))))
print("Test Accuracy  :: " + str(accuracy_score(test_labels, predictions)))

Train Accuracy :: 0.95
Test Accuracy  :: 0.8571428571428571


## 2.4: Model Evaluation over All Zones

In [None]:
#@title Making Prediction Over Entire Dataset

dfEvaluate = dfFlattenedClusters.copy()

evalFeatures = dfEvaluate.drop('Cluster', axis = 1)
evalFeatures = evalFeatures.drop('StopLatitude', axis = 1)
evalFeatures = evalFeatures.drop('StopLongitude', axis = 1)
evalFeatures = evalFeatures.drop('StopDurationMinutes', axis = 1)
evalFeatures = evalFeatures.drop('HourStop', axis = 1)
evalFeatures = evalFeatures.drop('Stop_0_5', axis = 1)
evalFeatures = evalFeatures.drop('Stop_5_10', axis = 1)
evalFeatures = evalFeatures.drop('Stop_40_45', axis = 1)
evalFeatures = evalFeatures.drop('Stop_50_55', axis = 1)
evalFeatures = evalFeatures.drop('Stop_55_60', axis = 1)
evalFeatures = evalFeatures.drop('Stop_60_plus', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_0_1', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_1_2', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_2_3', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_3_4', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_4_5', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_5_6', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_6_7', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_7_8', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_15_16', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_16_17', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_17_18', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_20_21', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_21_22', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_22_23', axis = 1)
evalFeatures = evalFeatures.drop('StopHour_23_24', axis = 1)
evalFeatures = evalFeatures.drop('Class C (4001 - 5000 lbs)', axis = 1)
evalFeatures = evalFeatures.drop('Class G (8001 - 9000 lbs)', axis = 1)

feature_list_eval = list(evalFeatures.columns)
evalFeatures = np.array(evalFeatures)

#Make Predictions
evalPredictions = best_grid.predict(evalFeatures)
evalPredictions = best_grid.predict(evalFeatures)
dfEvaluate['PredictedType'] = evalPredictions

## 2.5: Verify Classificaiton of Agencies and Donors
Plug the latitude and longitude of each of the zones into Google Maps to verify whether or not the donors and agencies were accurately classified.  Keep in mind that the accuracy metric in test was about 86% so not everything will be classified well.

For Agencies, look at the Longitude/Latitude for Clusters #68 and #72 --> neither were in the labeled dataset but were correctly identified.  For Donors look at Clusters 8 and 11.

In [None]:
#@title Agency Verification
dfEvaluate[dfEvaluate['PredictedType']==1].head(20)

Unnamed: 0,Cluster,StopLatitude,StopLongitude,StopDurationMinutes,HourStop,UniqueId,HardwareId,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs),Stop_0_5,Stop_5_10,Stop_10_15,Stop_15_20,Stop_20_25,Stop_25_30,Stop_30_35,Stop_35_40,Stop_40_45,Stop_45_50,Stop_50_55,Stop_55_60,Stop_60_plus,StopHour_0_1,StopHour_1_2,StopHour_2_3,StopHour_3_4,StopHour_4_5,StopHour_5_6,StopHour_6_7,StopHour_7_8,StopHour_8_9,StopHour_9_10,StopHour_10_11,StopHour_11_12,StopHour_12_13,StopHour_13_14,StopHour_14_15,StopHour_15_16,StopHour_16_17,StopHour_17_18,StopHour_18_19,StopHour_19_20,StopHour_20_21,StopHour_21_22,StopHour_22_23,StopHour_23_24,PredictedType
13,13,43.389497,-79.789839,"([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5], [0, ...","([0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0...",5,1,1.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,20,0,0,0,0,0,0,0,0,4,0,0,0,0,3,0,0,0,0,0,3,3,0,1
36,36,43.846861,-79.542738,"([1, 1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 23, 0, ...",6,1,0.0,0.0,0.0,1.0,1,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,23,0,0,0,0,0,0,0,0,0,0,0,1
41,41,43.780682,-79.554936,"([3, 2, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 33, 12, 0, 2, 0, ...",9,2,0.888889,0.111111,0.0,0.0,3,2,2,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,12,0,2,0,0,0,0,0,0,0,0,0,0,0,1
49,49,43.742402,-79.65927,"([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 4, 0, 3, 0, 0...",5,1,1.0,0.0,0.0,0.0,1,1,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,20,0,4,0,3,0,0,0,6,0,0,0,0,0,0,0,1
57,57,43.746716,-79.214536,"([0, 0, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 80, 0, 1, 0, 0, 0...",5,1,1.0,0.0,0.0,0.0,0,0,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
59,59,43.737344,-79.218866,"([2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 50, 5, 0, 0, 0, 0...",6,1,0.0,1.0,0.0,0.0,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,50,5,0,0,0,0,0,0,0,0,0,0,0,0,0,1
60,60,43.757867,-79.223877,"([4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 28, 15, 0, 0, ...",7,1,1.0,0.0,0.0,0.0,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,28,15,0,0,0,0,0,0,0,0,0,0,0,0,1
66,66,43.686001,-79.597346,"([2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 1...",5,1,1.0,0.0,0.0,0.0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,16,0,0,0,0,0,0,0,0,0,1
68,68,43.744356,-79.222262,"([3, 7, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 23, 2, 0, 0, ...",12,2,0.833333,0.0,0.0,0.166667,3,7,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,23,2,0,0,0,0,0,0,0,0,0,0,0,0,1
70,70,43.759544,-79.217501,"([0, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 60, 3, 0, 0, 0, 0...",5,2,0.8,0.0,0.0,0.2,0,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [None]:
#@title Donor Verification
dfEvaluate[dfEvaluate['PredictedType']==0].head(20)

Unnamed: 0,Cluster,StopLatitude,StopLongitude,StopDurationMinutes,HourStop,UniqueId,HardwareId,Class 7 (26001 - 33000 lbs),Class 8 (33001 - Over lbs),Class C (4001 - 5000 lbs),Class G (8001 - 9000 lbs),Stop_0_5,Stop_5_10,Stop_10_15,Stop_15_20,Stop_20_25,Stop_25_30,Stop_30_35,Stop_35_40,Stop_40_45,Stop_45_50,Stop_50_55,Stop_55_60,Stop_60_plus,StopHour_0_1,StopHour_1_2,StopHour_2_3,StopHour_3_4,StopHour_4_5,StopHour_5_6,StopHour_6_7,StopHour_7_8,StopHour_8_9,StopHour_9_10,StopHour_10_11,StopHour_11_12,StopHour_12_13,StopHour_13_14,StopHour_14_15,StopHour_15_16,StopHour_16_17,StopHour_17_18,StopHour_18_19,StopHour_19_20,StopHour_20_21,StopHour_21_22,StopHour_22_23,StopHour_23_24,PredictedType
0,0,42.062801,-82.645672,"([4, 3, 2, 0, 1, 0, 1, 0, 0, 0, 1, 0, 4], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 33, 17, 0, 0, ...",16,1,0.0,1.0,0.0,0.0,4,3,2,0,1,0,1,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,6,33,17,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,43.127895,-80.334633,"([0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 20, 8, 0, 6, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,1,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,8,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,43.128893,-80.334521,"([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 2], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 6, 0, 0, 0, 0...",5,2,0.4,0.6,0.0,0.0,0,0,0,0,0,0,0,0,1,1,1,0,2,0,0,0,0,0,0,0,0,0,40,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,44.121954,-79.557962,"([1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 60, 3, 0, 0, 0, 0, 0, 0...",5,2,0.0,1.0,0.0,0.0,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,44.125615,-79.5561,"([0, 1, 6, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 8, 0, 26, 2, 0, 2, 9, 0...",12,2,0.0,1.0,0.0,0.0,0,1,6,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,26,2,0,2,9,0,0,0,0,0,0,0,0,0,0,0
5,5,43.40116,-80.4628,"([0, 4, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 80, 1, 0, 0, 0, 0, 0...",5,2,0.0,1.0,0.0,0.0,0,4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,6,43.170743,-79.835954,"([0, 0, 1, 1, 0, 0, 0, 1, 2, 0, 1, 0, 2], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 25, 6, 2, 0, 5, 2, 0...",8,1,0.0,1.0,0.0,0.0,0,0,1,1,0,0,0,1,2,0,1,0,2,0,0,0,0,0,0,0,0,25,6,2,0,5,2,0,0,0,0,0,0,0,0,0,0,0
7,7,43.463848,-80.135276,"([2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 80, 0, 0, 0, 1, 0, 0...",5,1,0.0,1.0,0.0,0.0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
8,8,43.237682,-79.733815,"([4, 0, 1, 1, 0, 4, 0, 0, 0, 1, 1, 1, 1], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 21, 9, 5, 7, 0, 6, 0...",14,1,0.0,1.0,0.0,0.0,4,0,1,1,0,4,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,21,9,5,7,0,6,0,0,0,0,0,0,0,0,0,0,0
9,9,43.312094,-79.951381,"([0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 2], [0, ...","([0, 0, 0, 0, 0, 0, 0, 0, 33, 2, 0, 0, 0, 0, 0...",6,1,0.0,0.0,0.0,1.0,0,1,0,1,0,1,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0,33,2,0,0,0,0,0,0,0,2,2,2,0,0,0,0,0
