The data scientist expects to be able to later use models.

In [1]:
# import packages we are going to use
from typing import Iterable
import pandas as pd
import pickle
import sklearn.pipeline


In [2]:
# load model
with open('model.pkl', 'rb') as of:
    model = pickle.load(
        file=of
    )

vars(model)

{'fit_intercept': True,
 'normalize': 'deprecated',
 'copy_X': True,
 'n_jobs': None,
 'positive': False,
 'feature_names_in_': array(['Midparent'], dtype=object),
 'n_features_in_': 1,
 'coef_': array([0.63835181]),
 '_residues': 3543.175570052536,
 'rank_': 1,
 'singular_': array([48.75686489]),
 'intercept_': 24.526949000046912}

In [3]:
# example application data
d_app = pd.DataFrame({
    'Midparent': [60, 70, 65]
})

d_app

Unnamed: 0,Midparent
0,60
1,70
2,65


In [4]:
model.predict(d_app)


array([62.82805735, 69.21157541, 66.01981638])

sklearn doesn't use currently use column names, but may be looking towards this. Most systems (SQL, Spark, Excel, R) *have* column names, so most data scientists expect them to be supported.

In [5]:
# example of FutureWarning that double-checked column names

d_wrong = pd.DataFrame({
    'MidparenzZZ': [60, 70, 65]
})

d_wrong

Unnamed: 0,MidparenzZZ
0,60
1,70
2,65


In [6]:
model.predict(d_wrong)

Feature names unseen at fit time:
- MidparenzZZ
Feature names seen at fit time, yet now missing:
- Midparent



array([62.82805735, 69.21157541, 66.01981638])

In [7]:
# example of brittleness of model to column structure

d_very_wrong = pd.DataFrame({
    'Midparent': [60, 70, 65],
    'extra_column': [1, 2, 3]
})

d_very_wrong

Unnamed: 0,Midparent,extra_column
0,60,1
1,70,2
2,65,3


In [8]:

try:
    model.predict(d_very_wrong)
except Exception as ex:
    print(f'caught {ex}')

caught X has 2 features, but LinearRegression is expecting 1 features as input.


Feature names unseen at fit time:
- extra_column
Feature names must be in the same order as they were in fit.



Because of operational brittleness- it can make sense to wrap some smarts into a model.

In [9]:
class ColumnSelect:
    # https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
    def __init__(self, *, cols: Iterable[str]):
        assert not isinstance(cols, str)  # common error
        self.cols = list(cols)  # clean copy

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.loc[:, self.cols]

model_2 = sklearn.pipeline.Pipeline([
    ('restrict_columns', ColumnSelect(cols = ['Midparent'])),
    ('underlying_model', model),
    ])

In [10]:
# fixes warning
model_2.predict(d_very_wrong)

array([62.82805735, 69.21157541, 66.01981638])

Many data sklearn data scientists expect to use some form of pipelining. They may want to wrap the column names into the model interface at the training step.