# Model

In this notebook we’ll show some modeling examples with UrbanMapper.

In [1]:
import urban_mapper as um
from urban_mapper.pipeline import UrbanPipeline

# Start UrbanMapper
mapper = um.UrbanMapper()

## Preparing data

First, let’s grab some taxi data and set up a street intersections layer for Manhattan.

Note that:

- Loader example can be seen in `examples/1-Per-Module/1-loader.ipynb`
- Urban Layer example can be seen in `examples/1-Per-Module/2-urban_layer.ipynb`
- Imputer example can be seen in `examples/1-Per-Module/3-imputer.ipynb`
- Filter example can be seen in `examples/1-Per-Module/4-filter.ipynb`
- Enricher example can be seen in `examples/1-Per-Module/5-enricher.ipynb`

In [3]:
#Layer from Manhattan
urban_layer = (
    mapper.urban_layer.with_type("streets_intersections")
    .from_place("Downtown, Brooklyn, New York, USA", network_type="drive")
    .with_mapping(longitude_column="pickup_longitude", latitude_column="pickup_latitude", output_column="pickup_segment")
    .build()
)

#Taxi data
loader = (
  mapper
  .loader
  .from_huggingface("oscur/taxisvis1M", number_of_rows=25000)
  .with_columns("pickup_longitude", "pickup_latitude")
  .build()
)

#Imputer
imputer = (
  mapper
  .imputer
  .with_type("SimpleGeoImputer")
  .on_columns("pickup_longitude", "pickup_latitude")
  .build()
)

#Filter
filter = mapper.filter.with_type("BoundingBoxFilter").build()

#Some enrichments used as information for prediction
enricher1 = mapper.enricher.with_data(group_by="pickup_segment").count_by(output_column="pickup_count").build()
enricher2 = mapper.enricher.with_data(group_by="pickup_segment", values_from="total_amount").aggregate_by(method="sum", output_column="sum_amount").build()
enricher3 = mapper.enricher.with_data(group_by="pickup_segment", values_from="trip_distance").aggregate_by(method="mean", output_column="avg_distance").build()
enricher4 = mapper.enricher.with_data(group_by="pickup_segment", values_from="passenger_count").aggregate_by(method="mean", output_column="passenger_avg").build()
enricher5 = mapper.enricher.with_data(group_by="pickup_segment", values_from="payment_type").aggregate_by(method="mode", output_column="payment_type_mode").build()

pipe = UrbanPipeline([
  ("urban_layer", urban_layer),    
  ("loader", loader),
  ("imputer", imputer),
  ("filter", filter),            
  ("enrich1", enricher1),    
  ("enrich2", enricher2),
  ("enrich3", enricher3),    
  ("enrich4", enricher4),    
  ("enrich5", enricher5),    
])

_, enriched_layer = pipe.compose_transform()

🗺️ Successfully composed pipeline with 9 steps! |███████████████████████████████


## Model configuration

### 1. Clustering

The user can create a very basic model without defining any model, target, or feature transformation.

In this scenario, `UrbanMapper` runs a default scikit-learn `KMeans` clustering.

The data features are scaled with a scikit-learn `StandardScaler` by default.


In [4]:
## Default clustering method: KMeans
model = (
    mapper
    .model
    .build()
)

#### 1.2 Simple configurations

Some features can be mapped to another domain, e.g., a categorical variable to one-hot encoding, as seen in the following test for `payment_type_mode`.

`UrbanMapper` uses `.with_transform` method to define a `feature_mapper` scikit-learn transformer.

It can also ignore some features during the scaling process with `ignore_feature_on_scaler`, as seen in the following test for `payment_type_mode` values.

The `fit_predict` method adds two data columns:`subset` (filled with train) and `cluster_predicted`.


In [5]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class MyFeatureMapping(TransformerMixin, BaseEstimator):
   def __init__(self, map_payment = True):
      ##Only to use the same class with different examples
      self.map_payment = map_payment

   def fit(self, X, y=None):
     return self
   
   def transform(self, X):
      ##Mapping integer to float
      X["street_count"] = X["street_count"].astype(float)

      ##Mapping categorical to one-hot-encoding
      if self.map_payment and 'payment_type_mode' in X:
        X = pd.get_dummies(X, columns=['payment_type_mode',], dtype = int)

      ##Discarding id feature 
      X = X.drop(["osmid"], axis = 1)

      return X

   def __sklearn_tags__(self):
      tags = super().__sklearn_tags__()
      tags.input_tags.allow_nan = True

      return tags

model = (
    mapper
    .model
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: KMeans


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,cluster_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,train,5
1,42464823,40.692170,-73.989126,traffic_signals,4,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0
2,42464824,40.691802,-73.988213,traffic_signals,4,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,2
3,42464827,40.691455,-73.987339,traffic_signals,4,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,2
4,42464832,40.690663,-73.985353,traffic_signals,3,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,train,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,0
73,10001058440,40.694936,-73.988697,,3,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,0
74,11381945329,40.695257,-73.988681,,3,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,4
75,11381945330,40.696111,-73.988586,,4,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0


### 2. Classification

The user can define a target data column that tells `UrbanMapper` to perform supervised tasks.

In this scenario, `UrbanMapper` runs a default scikit-learn `SVC` (SVM) classifier if the `target column` values are `integers`.

The `fit_predict` method adds two data columns:`subset` (filled with train or test) and `<target_column>_predicted`.

Note: check `Split data subsets` section.

In [6]:
model = (
    mapper
    .model
    .with_columns(target_column="payment_type_mode")
    .with_transform(
      feature_mapper=MyFeatureMapping(False),
    )
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: SVC


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,payment_type_mode_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,0
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,0
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,0
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,0
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,0
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,2
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0


### 3. Regression

`UrbanMapper` runs a default regressor if the `target column` values are `floats`.

#### 3.1 Regression with longitude/latitude information

The default regressor depends on the data and configurations. If a user defines `longitude` and `latitude` columns, `UrbanMapper` creates `RegressionKriging` from the `pykrige` library.

`UrbanMapper` also rescales features before fitting the models. However, some features should not be rescaled, such as categorial ones. The user can send a list of those feature names in `ignore_feature_on_scaler` argument of `with_transform`, as seen in the next example.

Note: in the example, `ignore_feature_on_scaler` values were created by `feature_mapper`.

In [7]:
model = (
    mapper
    .model
    .with_columns(target_column="pickup_count", longitude_column="x", latitude_column="y")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )    
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: Kriging_Adapter
Finished learning regression model
Finished kriging residuals


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,1.133477e-01
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,5.551115e-17
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,5.551115e-17
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,5.551115e-17
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,-6.262990e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,5.551115e-17
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,5.551115e-17
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,1.000000e+00
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,5.551115e-17


#### 3.2 Regression with geometry information

If a user defines `geometry` columns, `UrbanMapper` creates `ML_Lag` from the `spreg` library (part of the `pysal` library)

Note: Even if no `geometry` is defined by the user, but the `GeoDataFrame` has an `active geometry`, `UrbanMapper` identifies and uses it.

In [8]:
model = (
    mapper
    .model
    .with_columns(target_column="pickup_count", geometry_column="geometry")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )    
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: ML_Lag_Adapter


  se_result = np.sqrt(variance)
  ) / np.sqrt(variance)


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,0.174323
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0.200967
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,0.189335
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,0.219941
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,0.188698
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,0.164875
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,0.155067
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,1.349206
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0.185912


#### 3.3 Simple regression

If no `longitude` and `latitude` or `geometry` columns are defined or cannot even be inferred, `UrbanMapper` creates an `SVR` regressor from `scikit-learn`.

In [9]:
model = (
    mapper
    .model
    .with_columns(target_column="pickup_count")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )    
    .build()
)    

data_without_geometry = enriched_layer.layer.drop("geometry", axis = 1)

predicted_data = model.fit_predict(data_without_geometry)
predicted_data

Autoconfig model: SVR


Unnamed: 0,osmid,y,x,highway,street_count,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,0.0,0.00,0.00,0.0,0,test,0.045643
1,42464823,40.692170,-73.989126,traffic_signals,4.0,0.0,0.00,0.00,0.0,0,train,-0.041621
2,42464824,40.691802,-73.988213,traffic_signals,4.0,0.0,0.00,0.00,0.0,0,train,-0.050006
3,42464827,40.691455,-73.987339,traffic_signals,4.0,0.0,0.00,0.00,0.0,0,train,-0.049528
4,42464832,40.690663,-73.985353,traffic_signals,3.0,0.0,0.00,0.00,0.0,0,test,-0.045645
...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,0.0,0.00,0.00,0.0,0,train,0.014826
73,10001058440,40.694936,-73.988697,,3.0,0.0,0.00,0.00,0.0,0,train,0.009642
74,11381945329,40.695257,-73.988681,,3.0,1.0,6.80,0.84,1.0,2,train,0.939635
75,11381945330,40.696111,-73.988586,,4.0,0.0,0.00,0.00,0.0,0,train,0.039793


### 4. Other transformations

`UrbanMapper` model applies transformations defined by `with_transform` in the following order:

1. Extract target from the dataframe (Only for classification and regression)
2. Extract longitude-latitude or geometry from the dataframe. If no column configuration exists, it will return the active_geometry_name of the GeoDataFrame
3. Remove data rows based on target outliers (user-defined, but only for regression)
4. Map features to different domains. For example, from categorical to one-hot-encoding or from integer to float (user-defined, with no default action)
5. Select features. If False, no selection is applied (user-defined, select only the numeric columns by default)
6. Scale features (user-defined, `scikit-learn` StandardScaler() by default)

In [10]:
from sklearn.preprocessing import MinMaxScaler
from urban_mapper.modules.model import ThresholdOutlierDetector

## Defining transformations
model = (
    mapper
    .model
    .with_columns(target_column="pickup_count", geometry_column="geometry")
    .with_transform(feature_selector = None, ## default considers only numbers
                    feature_scaler = MinMaxScaler(),   ## default StandardScaler()
                    feature_mapper=MyFeatureMapping(),
                    ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2'],

                    ## applied only on REGRESSION tasks
                    target_filter = ThresholdOutlierDetector(upper = 500),  ## default utils.IQROutlierDetector()
                    target_scaler = MinMaxScaler(),  ## default StandardScaler()

                    ## applied only on CLASSIFICATION tasks
                    target_encoder = None, ## default LabelEncoder()
    ).build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: ML_Lag_Adapter


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,0.239459
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0.390781
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,0.729762
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,0.407984
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,-0.076704
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,0.125015
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,0.232288
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,1.005123
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0.215582


### 5. Split data subsets (classification and regression)

For supervised tasks, `fit_predict` method splits the dataset into training, validation, and test subsets.

By default, it uses a train-test proportion of 90:10, but the user can define other proportions in the `with_data` method.

The `fit_predict` method adds a `subset` column to the input dataframe that defines the subset of each data row.
Rows removed by the `target_filter` transformation do not have `subset` value.

In [11]:
model = (
    mapper
    .model
    .with_columns(target_column="pickup_count", geometry_column="geometry")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )    
    .with_data(train_size = None, 
               validation_size = None,   
               test_size = 0.15,  ## default 0.1    
              )               
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: ML_Lag_Adapter


  se_result = np.sqrt(variance)
  ) / np.sqrt(variance)


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,0.678192
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0.700739
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,0.686134
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,0.759735
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,0.699743
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,0.636697
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,0.615623
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,test,1.748098
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0.713157


### 5. User-defined model

A user can tell `UrbanMapper` to build or use other models.

#### 5.1 Scikit-learn model

A user can send `scikit-learn` model name to `UrbanMapper` in the `with_model`.

In [12]:
model = (
    mapper
    .model
    .with_model("RandomForestRegressor")
    .with_columns(target_column="pickup_count")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )    
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: RandomForestRegressor


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,-2.775558e-16
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,-2.775558e-16
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,-2.775558e-16
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,-2.775558e-16
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,-2.775558e-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,train,-2.775558e-16
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,train,-2.775558e-16
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,9.900000e-01
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,-2.775558e-16


#### 5.2 Scikit-learn pipeline

The user can also use a `scikitlearn Pipeline`. In this case, `with_transform` and `with_data` method configurations will be ignored to use only the pipeline steps.

Besides, only `target_column` argument of `with_columns` will be used.

Therefore, `mapper.model` works as an adapter for the `UrbanMapper` structure.

In [13]:
from sklearn.preprocessing import  MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer, make_column_selector

pipeline = Pipeline([
    ("feature_scaler", ColumnTransformer([
       ("selector", MinMaxScaler(), make_column_selector(dtype_include="number"))
    ])),
    ("model", TransformedTargetRegressor(regressor=SVR(), transformer=MinMaxScaler()))
  ])

model = (
    mapper
    .model
    .with_model(pipeline)
    .with_columns(target_column="pickup_count")
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: Pipeline


array([0.37467886, 0.32810396, 0.31686961, 0.31163368, 0.2329393 ,
       0.23454541, 0.24112805, 0.24625684, 0.34794565, 0.2915289 ,
       0.3987682 , 0.36242601, 1.20060403, 0.24343717, 0.32819167,
       0.25455861, 0.23643548, 0.22754328, 0.29193626, 0.35170453,
       2.79761015, 1.39850512, 0.37562345, 0.33930322, 1.39989569,
       1.59958445, 0.38777655, 0.21979631, 0.31385172, 0.32981985,
       0.35220256, 0.36293478, 0.39985744, 0.27428738, 0.27154486,
       0.40163331, 0.37056001, 0.34137689, 0.31135606, 1.40081604,
       0.40018571, 0.39991128, 0.32131885, 0.37798575, 1.40049508,
       0.35678073, 0.3954483 , 1.74087216, 0.28217062, 0.38175462,
       0.25950547, 0.3082262 , 1.37559435, 0.28694253, 0.3114434 ,
       0.29165038, 0.28848524, 2.26310401, 2.11663513, 1.40002671,
       0.29824376, 1.14845   , 0.20973427, 0.28024289, 2.59938559,
       0.35801441, 3.60005099, 3.00206506, 0.39939089, 1.81096922,
       0.35038853, 1.40163748, 0.31550449, 0.32499205, 1.40087

#### 5.3 Custom clusterer, classifier, or regressor

More flexibly, a user can define a custom model by reimplementing `CustomModel_Adapter` from `UrbanMapper` combined with `scikit-learn RegressorMixin, ClassifierMixin, or ClusterMixin`.

Following this, `fit` method provides important information in `kwargs`, such as `long_lat`, `geometry`, depending on the configurations as seen in the previous examples.

If the user defines a `validation_size` in `with_data` method, `kwargs` also has `validation`, `validation_target`, `validation_long_lat`, and `validation_geometry` 

In [14]:
## Testing a PyTorch MLP regressor

import torch
import torch.nn as nn
import numpy as np
import tqdm

from torch.utils.data import DataLoader

from urban_mapper.modules.model import CustomModel_Adapter
from sklearn.base import RegressorMixin

## An adapter that has and fits (train) a PyTorch MLP and calls it for predictions
class MyRegressor(RegressorMixin, CustomModel_Adapter):
  def __init__(self):
    self.model = None
    self.epochs = 100
    self.batch_size = 25
    self.loss_func = torch.nn.MSELoss() 
    self.optimizer = None

  def fit(self, X, y, **kwargs):
    if self.model is None:
      self.model = SimpleMLP(num_feat=X.shape[1], num_output=1)
      self.model.to("cpu")

    train_loader = DataLoader(MyDataset(X, y), batch_size=self.batch_size, shuffle= True, drop_last = False) 
    self.optimizer = torch.optim.Adam(self.model.parameters(), lr=0.0001, weight_decay=0.0)

    progress = tqdm.tqdm(total = self.epochs, desc = "Epoch ", bar_format='{desc}|{bar:50}| {n_fmt}/{total_fmt} [{elapsed}<{remaining} - {rate_fmt}]{postfix}')
    train_loss = []
    validation_loss = []

    for epoch in range(1, self.epochs + 1):
      self.model.train()

      batch_train_loss = 0

      ##Running model over the training subset 
      for X_train, y_train in train_loader:
        y_train_pred = self.model(X_train)

        loss = self.loss_func(y_train_pred, y_train.float().to(y_train_pred.device))
        batch_train_loss += loss.detach().cpu().item()

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

      train_loss.append(batch_train_loss / len(train_loader))

      ##Running model over the validation subset
      with torch.no_grad():
        if "validation" in kwargs:
          self.model.eval()
          validation = MyDataset(kwargs["validation"], kwargs["validation_target"]) 

          y_val_pred = self.model(validation.data)

          loss = self.loss_func(y_val_pred, validation.label.float().to(y_train_pred.device))
          validation_loss.append(loss.detach().cpu().item())

      progress.update(1)
      progress.set_postfix( {"loss": np.mean(train_loss), "val loss": np.mean(validation_loss) } )

    progress.close()

    return self

  def predict(self, X, **kwargs):
    if self.model is not None:
      X = torch.from_numpy(np.array(X)).to(torch.float32)
      return self.model(X).detach().cpu().numpy()

## A PyTorch MLP model
class SimpleMLP(nn.Module):
  def __init__(self, num_feat, num_output):
    super(SimpleMLP, self).__init__()

    self.projection_size = 128

    self._device = nn.Parameter(torch.empty(0))
    self.hidden_layers = nn.ModuleList([
      nn.Linear(num_feat, 128),
      nn.Linear(128, 256),
      nn.Linear(256, self.projection_size),
      nn.Dropout(0.5)
    ])      
    self.head = nn.Linear(self.projection_size, num_output)

  def forward(self, X):
    x = X.to(self._device.device)

    for layer in self.hidden_layers:
      x = layer(x)

      if isinstance(layer, nn.Linear):
        x = x.relu()

    out = self.head(x)
    return out.sigmoid()  

## An simple PyTorch dataset helper
class MyDataset(torch.utils.data.Dataset):
  def __init__(self, x, y):
    self.data = torch.from_numpy(np.array(x)).to(torch.float32)
    self.label = torch.from_numpy(np.array(y)).to(torch.float32).reshape(-1, 1)    

  def __getitem__(self, index):    
    return self.data[index], self.label[index]
  
  def __len__(self):
    return len(self.data)  

In [15]:
model = (
    mapper
    .model
    .with_model(MyRegressor())
    .with_columns(target_column="pickup_count", longitude_column="x", latitude_column="y")
    .with_transform(
      feature_mapper=MyFeatureMapping(),
      ignore_feature_on_scaler=['payment_type_mode_0', 'payment_type_mode_1', 'payment_type_mode_2']
    )
    .with_data(validation_size = 0.1)
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: MyRegressor


Epoch |██████████████████████████████████████████████████| 100/100 [00:02<00:00 - 40.49it/s], loss=0.697, val loss=0.0556


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode,subset,pickup_count_predicted
0,42464631,40.692056,-73.982623,traffic_signals,4.0,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0,test,0.332759
1,42464823,40.692170,-73.989126,traffic_signals,4.0,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0,train,0.332759
2,42464824,40.691802,-73.988213,traffic_signals,4.0,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0,train,0.332759
3,42464827,40.691455,-73.987339,traffic_signals,4.0,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0,train,0.332759
4,42464832,40.690663,-73.985353,traffic_signals,3.0,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0,test,0.333898
...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3.0,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0,validation,0.333898
73,10001058440,40.694936,-73.988697,,3.0,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0,validation,0.333898
74,11381945329,40.695257,-73.988681,,3.0,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2,train,0.884081
75,11381945330,40.696111,-73.988586,,4.0,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0,train,0.332759


#### 5.4 Beyond clustering, classification, or regression

A user can reimplement `CustomModel_Adapter` without `scikit-learn` classes.

As in the `scikitlearn Pipeline`, the configuration methods `with_transform`, `with_data`, and `with_columns` will be ignored. 

Besides, if defined, only `target_column` argument from `with_columns` will be used.

The implemented class should take care of preprocessing, splitting data, and fitting its own model in the `fit` method, as well as the predictions with `predict` method.

Therefore, `mapper.model` works as an adapter for the `UrbanMapper` structure.

In [16]:
from urban_mapper.modules.model import CustomModel_Adapter

class GeneralModel(CustomModel_Adapter):
  def __init__(self):
    pass

  def fit(self, X, y, **kwargs):
    return self

  def predict(self, X, **kwargs):
    return X

  def load(self, path):
    pass

In [17]:
model = (
    mapper
    .model
    .with_model(GeneralModel())
    .with_columns(target_column="pickup_count")
    .build()
)    

predicted_data = model.fit_predict(enriched_layer.layer)
predicted_data

Autoconfig model: GeneralModel


Unnamed: 0,osmid,y,x,highway,street_count,geometry,pickup_count,sum_amount,avg_distance,passenger_avg,payment_type_mode
0,42464631,40.692056,-73.982623,traffic_signals,4,POINT (-73.98262 40.69206),0.0,0.00,0.00,0.0,0
1,42464823,40.692170,-73.989126,traffic_signals,4,POINT (-73.98913 40.69217),0.0,0.00,0.00,0.0,0
2,42464824,40.691802,-73.988213,traffic_signals,4,POINT (-73.98821 40.6918),0.0,0.00,0.00,0.0,0
3,42464827,40.691455,-73.987339,traffic_signals,4,POINT (-73.98734 40.69145),0.0,0.00,0.00,0.0,0
4,42464832,40.690663,-73.985353,traffic_signals,3,POINT (-73.98535 40.69066),0.0,0.00,0.00,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
72,9664385497,40.694956,-73.988947,crossing,3,POINT (-73.98895 40.69496),0.0,0.00,0.00,0.0,0
73,10001058440,40.694936,-73.988697,,3,POINT (-73.9887 40.69494),0.0,0.00,0.00,0.0,0
74,11381945329,40.695257,-73.988681,,3,POINT (-73.98868 40.69526),1.0,6.80,0.84,1.0,2
75,11381945330,40.696111,-73.988586,,4,POINT (-73.98859 40.69611),0.0,0.00,0.00,0.0,0
