### Spatial Cross-Validation
This script introduces a method for implementing spatial cross-validation, which is especially relevant for projects requiring geographic data partitioning. Spatial cross-validation is crucial when your data points' spatial location significantly influences the model's performance. By acknowledging spatial autocorrelation, this approach helps in evaluating the model more accurately.

In [1]:
# Import arcgis libaries to get accident data
from arcgis import GIS,GeoSeriesAccessor, GeoAccessor
from arcgis.features import FeatureLayer
from arcgis.gis import GIS

gis = GIS("home")

In [2]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GroupKFold, cross_val_predict

In [4]:
#get Featureset

accidents_item = gis.content.get('c11963684aa2439aacbba8521b03c02a')
flayer = accidents_item.layers[0]

# Convert FeatureSet to a DataFrame
accident_sdf = pd.DataFrame.spatial.from_layer(flayer)

In [8]:
# drop columns that are not needed
accident_df = accident_sdf[['OBJECTID',
                            'UJAHR',
                            'UKATEGORIE',
                            'UMONAT',
                            'USTUNDE',
                            'UWOCHENTAG',
                            'UKREIS',
                            'IstRad',
                            'IstPKW',
                            'IstFuss',
                            'XGCSWGS84',
                            'YGCSWGS84']]

# Rename columns
accident_df = accident_df.rename(columns={'OBJECTID': 'id',
                                          'UJAHR': 'year',
                                          'UKATEGORIE': 'category',
                                          'UMONAT': 'month',
                                          'USTUNDE': 'hour',
                                          'UWOCHENTAG': 'day', 
                                          'UKREIS': 'district',
                                          'IstRad': 'bike',
                                          'IstPKW': 'car',
                                          'IstFuss': 'pedestrian',
                                          'XGCSWGS84': 'longitude',
                                          'YGCSWGS84': 'latitude'})

In [9]:
accident_df

Unnamed: 0,id,year,category,month,hour,day,district,bike,car,pedestrian,longitude,latitude
0,1,2022,3,02,19,6,54,1,1,0,9.093886,54.463396
1,2,2022,2,05,11,1,57,0,0,0,10.440636,54.268304
2,3,2022,3,05,12,1,59,0,1,0,9.624949,54.555986
3,4,2022,3,05,08,3,03,1,0,0,10.67249,53.870453
4,5,2022,3,04,19,3,61,1,0,0,9.509079,53.929809
...,...,...,...,...,...,...,...,...,...,...,...,...
256487,256488,2022,3,10,11,2,53,0,1,0,11.50844,50.959446
256488,256489,2022,3,12,16,4,55,1,1,0,11.323387,50.988056
256489,256490,2022,3,11,07,4,51,1,1,0,11.003714,50.984505
256490,256491,2022,3,11,06,3,63,0,1,1,10.375033,50.905188


In [24]:
# Assuming df is your DataFrame
columns_to_convert = ['year', 'category', 'month', 'hour', 'day', 'district', 'bike', 'car', 'pedestrian']

for column in columns_to_convert:
    # Convert each column to numeric, setting errors='coerce' to handle any conversion problems by setting invalid parsing to NaN
    accident_df[column] = pd.to_numeric(accident_df[column], errors='coerce')

In [21]:
# Your feature matrix X and target variable y
# Combine year, month, day, and hour into a DateTime column
X = accident_df[['month']]  # Drop non-feature columns
y = accident_df['car']  # Replace with your target column

## Create spatial cross validation by groups 
Spatial Grouping: Ability to perform cross-validation by spatial groups, such as cities or districts, ensuring that the validation process respects spatial boundaries.

In [22]:
district = accident_df['district'].values
group_kfold = GroupKFold(n_splits=5) 

# Generator for the train/test indices
district_kfold = group_kfold.split(X, y, district) 

# Initialize your model
model = RandomForestRegressor(random_state=42)

# Create a nested list of train and test indices for each fold
train_indices, test_indices = [list(traintest) for traintest in zip(*district_kfold)]
district_cv = [*zip(train_indices,test_indices)]
predictions = cross_val_predict(model, X, y, cv=district_cv)

# Calculate the Mean Absolute Error
mae = mean_absolute_error(y, predictions)
print(f'Mean Absolute Error: {mae}')

Mean Absolute Error: 0.36461193924005747


## Create spatial cross validation by a grid
Grid-Based Validation: An advanced method that divides the geographic area into a grid, using each grid cell as a fold in the cross-validation. This technique is beneficial for projects with uniformly distributed data across a large area.

In [23]:
# Define the number of blocks along each dimension
n_blocks = 5

# Create spatial blocks by binning latitude and longitude
accident_df['lon_block'] = pd.cut(accident_df['longitude'], bins=n_blocks, labels=False)
accident_df['lat_block'] = pd.cut(accident_df['latitude'], bins=n_blocks, labels=False)

# Combine latitude and longitude block labels to create a unique block ID for each area
accident_df['block_id'] = accident_df['lon_block'].astype(str) + "-" + accident_df['lat_block'].astype(str)

# Your feature matrix X and target variable y
# Combine year, month, day, and hour into a DateTime column
X = accident_df[['month']]  # Drop non-feature columns
y = accident_df['car']  # Replace with your target column

# Initialize GroupKFold with the number of splits
group_kfold = GroupKFold(n_splits=5)

# Use the unique block ID for grouping in cross-validation
groups = accident_df['block_id']

# Initialize your model
model = RandomForestRegressor(random_state=42)

# Perform cross-validation
predictions = cross_val_predict(model, X, y, groups=groups, cv=group_kfold)

# Calculate the Mean Absolute Error
mae = mean_absolute_error(y, predictions)
print(f'Mean Absolute Error: {mae}')

Mean Absolute Error: 0.3646883294652143
