## Use case: Separate pipelines for each input matrix

### Rationale

1. When using multiple features matrices as inputs for a single ML task, accuracy is sometimes better in practice if a separate model is trained on each feature matrix and the inference is made by voting or model stacking (as opposed to concatanating all of the inputs together into a single matrix).  This may be due to averaging out of overfitting errors from the individual models.  

2. When computational complexity for model training is non-linear with the number of features (e.g. fully-connected neural nets) it is much faster to train models on each input matrix than on a single pooled feature matrix.   

3. The adage "garbage in, garbage out" is still true for machine learning models, even though they are good at minimizing noise.  Sometimes it's useful to winnow out the less predictive feature matrices in order to prevent overfitting.  This winnowing, called "feature matrix selection" here, is simpler when the matrices are kept separate.

4. Workflows that involve optimizing ML pipelines for a large number of different tasks at the same time can be faster if feature matrix selection is included in each pipeline rather than screened independently.  This might be considered semi-auto-ML.  

In [23]:
import pipecaster as pc
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

n_matrices = 100
n_features = 50

X, y = make_classification(n_classes = 2, 
                           n_samples = 200, 
                           n_features= n_features * n_matrices,
                           n_informative= 5 * n_matrices, 
                           n_redundant= 5 * n_matrices)

matrices = []
for i in range(n_matrices):
    matrices.append(X[:, i*n_features : (i+1) * n_features])
    
X_trains, X_tests, y_train, y_test = pc.split(Xs, y, train_test_split)


In [1]:
import pipecaster as pc

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

from sklearn.svm import SVC


# X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

In [18]:
import numpy as np

X = np.random.rand(100,5)
y = np.random.choice(['a','b','c'], 100)

cls = KNeighborsClassifier()
cls.fit(X, y)
cls.predict_proba(X).shape

(100, 3)

In [20]:
cls.predict_proba(X).shape

(100, 3)

In [None]:

pl = pc.PipelineArray(inputs)
pl.add_layer([SimpleImputer() for i in range(5)] + [CountVectorizer()])
pl.add_layer([StandardScaler() for i in range(5)] + [TfidfTransformer()])
pl.add_layer(MatrixSelector(inputs = range(5)) # implicit: PassThrough(input = 6) 
pl.add_layer(SelectKBest(chi2))
pl.add_layer(Concatentor())
pl.split([(0, 3), (1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3)])
cls1 = CvEstimator(estimator = LogisticRegression(), cv = ShuffleSplit, scoring = roc_auc_score)
cls2 = CvEstimator(estimator = KNeighborsClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls3 = CvEstimator(estimator = RandomForestClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls4 = MultinomialNB()
cls5 = SGDClassifier()
pl.add_layer([cls1, cls2, cls3, cls1, cls2, cls3, cls1, cls2, cls3, 
              cls1, cls2, cls3, cls1, cls2, cls3, cls4, cls5])
pl.add_layer(PerformanceSelector(top_n = 2))
pl.add_layer()
cls6 = SVC()
pl.add_layer(cls6)



In [6]:
import numpy as np
x = np.array([[2],[2],None,None], dtype=object)

In [10]:
def f(x):
    x = np.arange(100,110)
    
x = np.arange(10)
f(x)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# architecture 1 (diverse)

In [None]:
pipe_arr = pc.PipelineArray(n_inputs=6)

layer1 = pipe_arr.next_layer() # get new layer array of length n_inputs, all initialized to PassThrough()
layer1[:6] = SimpleImputer() 
layer1[6] = CountVectorizer()

layer2 = pipe_arr.next_layer()
layer2[:6] = StandardScaler()
layer2[6] = TfidfTransformer()

layer3 = pipe_arr.next_layer()
layer3[:7] = pc.SelectKBestMatrices(scoring=roc_auc_score, aggregator='mean', k=3)

layer4 = pipe_arr.next_layer() # automatically add layer[:6] = PassThrough()
layer4[:] = SelectKBest(chi2, k = 100)

layer5 = pipe_arr.next_layer()
cls1 = CvEstimator(estimator = LogisticRegression(), cv = ShuffleSplit, scoring = roc_auc_score)
cls2 = CvEstimator(estimator = KNeighborsClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls3 = CvEstimator(estimator = RandomForestClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls4 = MultinomialNB()
cls5 = SGDClassifier()
layer5[:6] = [cl1, cls2, cls3] # create an ensemble
layer5[6] = [cls4, cls5] # create an ensemble

layer6 = pipe_arr.next_layer()
layer6[:6] = pc.SelectKBestPerformers(scoring=roc_auc_score, k=2)

layer7 = pipe_arr.next_layer()
layer7[:6] = pc.EnsembleConcatenator()

layer8 = pipe_arr.next_layer()
layer8[:7] = pc.PipelineConcatenator()

layer9 = pipe_arr.next_layer()
layer9 = SVC()

pipe_arr.fit(X_trains, y_train)
pipe_arr.predict(Xs)

# architecture 2 (nested)

In [None]:
from sklearn.decomposition import PCA
from sklearn.decomposition import FactorAnalysis

my_classifier = pc.PipelineArray(n_inputs=1)

my_classifier.add_layer(SimpleImputer())
my_classifier.add_layer(StandardScaler())
my_classifier.add_layer([PCA(n_components=10), 
                         FactorAnalysis(n_components=10),
                         SelectKBest(chi2, k = 100)])

my_classifier.next_layer() = 

layer2 = my_classifier.next_layer()
layer2[:6] = StandardScaler()
layer2[6] = TfidfTransformer()

layer3 = my_classifier.next_layer()
layer3[:7] = pc.SelectKBestMatrices(scoring=roc_auc_score, aggregator='mean', k=3)

layer4 = pipe_arr.next_layer() # automatically add layer[:6] = PassThrough()
layer4[:] = SelectKBest(chi2, k = 100)

layer5 = pipe_arr.next_layer()
cls1 = CvEstimator(estimator = LogisticRegression(), cv = ShuffleSplit, scoring = roc_auc_score)
cls2 = CvEstimator(estimator = KNeighborsClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls3 = CvEstimator(estimator = RandomForestClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls4 = MultinomialNB()
cls5 = SGDClassifier()
layer5[:6] = [cl1, cls2, cls3] # create an ensemble
layer5[6] = [cls4, cls5] # create an ensemble

layer6 = pipe_arr.next_layer()
layer6[:6] = pc.SelectKBestPerformers(scoring=roc_auc_score, k=2)

layer7 = pipe_arr.next_layer()
layer7[:6] = pc.EnsembleConcatenator()

layer8 = pipe_arr.next_layer()
layer8[:7] = pc.PipelineConcatenator()

layer9 = pipe_arr.next_layer()
layer9 = SVC()

pipe_arr.fit(X_trains, y_train)
pipe_arr.predict(Xs)

# architecture 3 (diverse, no ensembling)

In [None]:
cls = pc.PipelineArray(n_inputs=6)

layer0 = cls.get_next_layer() # get new layer array of length n_inputs, all initialized to PassThrough()
layer0[:6] = SimpleImputer() 
layer0[6] = CountVectorizer()

layer1 = cls.get_next_layer()
layer1[:6] = StandardScaler()
layer1[6] = TfidfTransformer()

layer2 = cls.get_next_layer()
layer2[:7] = pc.SelectKBestMatrices(scoring=f_classif, aggregator='mean', k=3)

layer3 = cls.get_next_layer() # automatically add layer[:6] = PassThrough()
layer3[:] = SelectKBest(f_classif, k = 100)

layer4 = cls.get_next_layer()
cls1 = CvEstimator(estimator = LogisticRegression(), cv = ShuffleSplit, scoring = roc_auc_score)
cls2 = CvEstimator(estimator = KNeighborsClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls3 = CvEstimator(estimator = RandomForestClassifier(), cv = ShuffleSplit, scoring = roc_auc_score)
cls4 = MultinomialNB()
cls5 = SGDClassifier()
layer4[:6] = [cl1, cls2, cls3] # create an ensemble
layer4[6] = [cls4, cls5] # create an ensemble

layer5 = cls.get_next_layer()
layer5[:6] = pc.SelectKBestPerformers(scoring=roc_auc_score, k=2)

layer6 = cls.get_next_layer()
layer6[:6] = pc.EnsembleConcatenator()

layer7 = cls.get_next_layer()
layer7[:7] = pc.PipelineConcatenator()

layer8 = cls.get_next_layer()
layer8 = SVC()

my_classifier.fit(X_trains, y_train)
my_classifier.predict(Xs)

# architecture 4 (no storing of scores for future steps)

In this architecture, it is assumed that there is no caching or passing of model accuracy scores to allow selection at a subsequent layer.  So model performance scoring and seletion is done in a single step.

In [24]:
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.metrics import make_scorer

from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

from sklearn.svm import SVC

In [43]:
from sklearn.neighbors import KNeighborsRegressor


In [50]:
rgr = KNeighborsRegressor()
rgr.fit(np.arange(300).reshape(100,3), np.random.rand(100))
x = rgr.predict(np.arange(300).reshape(100,3))

In [53]:
len(x.shape)

1

In [54]:
y = x.reshape(-1,1)
len(y.shape)

2

In [37]:
clf.fit(np.arange(300).reshape(100,3), np.random.choice(['a','b','c'], size = 100))

KNeighborsClassifier()

In [38]:
clf.predict(np.arange(300).reshape(100,3)).shape

(100,)

In [42]:
clf.predict(np.arange(300).reshape(100,3))

array(['c', 'c', 'c', 'c', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b',
       'b', 'b', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'a', 'a',
       'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a',
       'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'b', 'c', 'c', 'c', 'b',
       'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'c', 'c', 'c', 'c',
       'c', 'c', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b',
       'b', 'b', 'b', 'b', 'b', 'c', 'b', 'a', 'c', 'a', 'a', 'a', 'b',
       'a', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='<U1')

In [39]:
clf.predict_proba(np.arange(300).reshape(100,3)).shape

(100, 3)

In [34]:
x = np.arange(100)
x.shape

(100,)

In [36]:
y = np.empty((100, 2))
y.shape

(100, 2)

In [48]:
predictions = clf.predict_proba(np.arange(9).reshape(3,3))

In [68]:
x = np.array([1,2,3])
y = list(x)

In [69]:
y

[1, 2, 3]

In [54]:
from sklearn.neighbors import KNeighborsRegressor
rgr = KNeighborsRegressor()

In [57]:
rgr.fit(np.arange(300).reshape(100,3), np.random.rand(100))

KNeighborsRegressor()

In [59]:
predictions = rgr.predict(np.arange(9).reshape(3,3))

In [60]:
predictions.shape

(3,)

In [19]:
x = set([1,2,4])
y = set([1,2,3,4])

In [24]:
x1 = np.arange(0,3)
x2 = np.arange(3,6)
Xs = np.array([x1,x2], dtype=object)

In [25]:
inputs = [0,1]

In [27]:
x = 1
len(x)

TypeError: object of type 'int' has no len()

In [None]:
Xs[inputs]

In [3]:
import importlib
importlib.reload(pc)

<module 'pipecaster' from '/Users/john/trading/src/pipecaster/pipecaster-repo/pipecaster/__init__.py'>

In [102]:
n_inputs = 6
set(range(n_inputs))

{0, 1, 2, 3, 4, 5}

In [12]:
def f(x):
    x[2] = 222

x = np.arange(9)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [13]:
f(x)
x

array([  0,   1, 222,   3,   4,   5,   6,   7,   8])

In [2]:
import pipecaster as pc

mcls = pc.MultiInputPipeline(n_inputs=6)

layer0 = mcls.get_next_layer() # get new layer array of length n_inputs, all initialized to PassThrough()
layer0[:5] = SimpleImputer() 
layer0[5] = CountVectorizer()

layer1 = mcls.get_next_layer()
layer1[:5] = StandardScaler()
layer1[5] = TfidfTransformer()

layer2 = mcls.get_next_layer() 
layer2[:] = SelectKBest(f_classif, k = 100)

layer3 = mcls.get_next_layer()
layer3[:5] = pc.SelectKBestInputs(score_func=f_classif, aggregator=np.sum, k=3)



In [9]:
mcls.layers[3].pipe_list

[(<pipecaster.input_selection.SelectKBestInputs at 0x127553510>,
  array([0, 1, 2, 3, 4]))]

In [77]:
layer4 = mcls.get_next_layer()
predictors = [KNeighborsClassifier() for i in range(5)]
layer4[:5] = pc.SelectKBestPredictors(predictors=predictors, scoring=make_scorer(roc_auc_score), cv=3)
layer4[5] = MultinomialNB()

layer5 = mcls.get_next_layer()
layer5[:] = pc.MetaClassifier(SVC())

mcls.fit(X_trains, y_train)
mcls.predict(Xs)

AttributeError: module 'pipecaster' has no attribute 'SelectKBestPredictors'

# architecture 4 (all numerical for testing)

In [None]:
mclf = pc.MultiInputPipeline(n_inputs=6)

layer0 = mclf.get_next_layer() # get new layer array of length n_inputs, all initialized to PassThrough()
layer0[:5] = SimpleImputer() 
layer0[5] = CountVectorizer()

layer1 = mclf.get_next_layer()
layer1[:5] = StandardScaler()
layer1[5] = TfidfTransformer()

layer2 = mclf.get_next_layer() # automatically add layer[:6] = PassThrough()
layer2[:] = SelectKBest(f_classif, k = 100)

layer3 = mclf.get_next_layer()
layer3[:5] = pc.SelectKBestMatrices(scoring=f_classif, aggregator='sum', k=3)

layer4 = mclf.get_next_layer()
predictors = [KNeighborsClassifier() for i in range(5)]
layer4[:5] = pc.SelectKBestPredictors(predictors=predictors, scoring=make_scorer(roc_auc_score), cv=3)
layer4[5] = MultinomialNB()

layer5 = mclf.get_next_layer()
layer5[:] = pc.MetaClassifier(SVC())

mclf.fit(X_trains, y_train)
mclf.predict(Xs)

# sklearn colum transformer

In [121]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import StackingClassifier

X_indices = [slice(1,100), slice(101,200), slice(201,300),
             slice(301,400), slice(401,500), slice(501,600)]

X_selectors = [ColumnTransformer([('X{}_selector'.format(i), 'passthrough', indices)])
               for i, indices in enumerate(X_indices)]

L0_classifiers = [Pipeline([('X{}_selector'.format(i), X_selector),
                      ('imputer', SimpleImputer()), 
                      ('scaler', StandardScaler()), 
                      ('kbest', SelectKBest(f_classif, k = 20)),
                      ('KNN', KNeighborsClassifier())])
              for i, X_selector in enumerate(X_selectors[:-1])]
    
cls = Pipeline([('X6_selector', X_selectors[5]),
                      ('count_vect', CountVectorizer()), 
                      ('tfid', TfidfTransformer()), 
                      ('kbest', SelectKBest(chi2, k = 20)),
                      ('naive_bayes', MultinomialNB())])

L0_classifiers.append(estimator)

L0_classifier_list = [('cls_{}'.format(i), cls) for i, cls in enumerate(L0_classifiers)]

my_cls = StackingClassifier(
    estimators=L0_classifier_list, final_estimator=SVC())

rules:
transform/estimator/predictor units are referred to as pipes for brevity
each layer of the MultiPipeline has the same number of inputs that are referenced by their array indices
a pipe always outputs to the same set of indices that it uses as input
two pipes may not use the same input or the same output


In [122]:
from xgboost.sklearn import XGBClassifier

cls = KNeighborsClassifier()
hasattr(cls, '__clone__')

False

In [126]:
import sklearn.base 
sklearn.base.clone(cls)

XGBRegressor()

In [70]:
n_inputs = 6

def get_slice_indices(slice_):
    return np.arange(6)[slice_]

In [71]:
slice_ = slice(1, 3, None)
get_slice_indices(slice_)


array([1, 2])

In [15]:
import numpy as np
mat1 = np.arange(9).reshape(3,3)
mat1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [17]:
mat2 = 100 + mat1
mat2

array([[100, 101, 102],
       [103, 104, 105],
       [106, 107, 108]])

In [18]:
Xs = np.array([mat1, mat2], dtype = object)

In [19]:
Xs[0]

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]], dtype=object)

In [23]:
np.array(5)

array(5)

In [22]:
Xs.shape

(2, 3, 3)

In [100]:
import numpy as np
ls1 = [1,2,3]
ls2 = np.array(ls1)

(type(cls) == list or type(cls) == np.ndarray)

False

In [136]:
if not True and not False:
    print('yes')

In [102]:
if True and not False:
    print("here")

here


True

# sklearn test, numerical input only

In [72]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.metrics import make_scorer, roc_auc_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import StackingClassifier

X_indices = [slice(0,100), slice(100,200), slice(200,300),
             slice(300,400), slice(400,500)]

X_selectors = [ColumnTransformer([('X{}_selector'.format(i), 'passthrough', indices)])
               for i, indices in enumerate(X_indices)]

L0_classifiers = [Pipeline([('X{}_selector'.format(i), X_selector),
                      ('imputer', SimpleImputer()), 
                      ('scaler', StandardScaler()), 
                      ('kbest', SelectKBest(f_classif, k = 20)),
                      ('KNN', KNeighborsClassifier())])
              for i, X_selector in enumerate(X_selectors)]
    
L0_classifier_list = [('cls_{}'.format(i), cls) for i, cls in enumerate(L0_classifiers)]
    

my_cls = StackingClassifier(
    estimators=L0_classifier_list, final_estimator=SVC())

In [78]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=500, n_informative=100, n_redundant=10)

ctl_cls = KNeighborsClassifier()
cross_val_score(ctl_cls, X, y, cv = 3, scoring = make_scorer(roc_auc_score))

array([0.64071856, 0.69365486, 0.62160017])

In [79]:
cross_val_score(my_cls, X, y, cv = 3, scoring = make_scorer(roc_auc_score))

array([0.68862275, 0.70267297, 0.69071496])

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [61]:
X_train

array([[-5.24655213e-01,  7.73561543e-03, -2.03583515e-01, ...,
         1.85729466e+01, -3.85624280e-01,  2.42433946e-01],
       [ 3.58305801e-01, -8.65509276e-01, -4.85829099e-01, ...,
        -1.53111323e+01,  2.21342723e-01, -1.87359820e-02],
       [ 1.69172439e+00, -4.42358621e-01,  1.09250570e+00, ...,
        -2.49917111e+01, -1.04821565e+00, -9.89248874e-01],
       ...,
       [ 6.31516738e-01, -9.30686358e-02, -1.86402307e-01, ...,
        -1.08200163e+01,  1.55595662e+00,  1.19938704e+00],
       [ 1.19997306e+00,  1.46976992e-01, -1.88428710e+00, ...,
        -5.03093792e+01,  7.14112674e-02, -1.36471996e+00],
       [-5.70524600e-02,  3.00984653e+00, -7.30875298e-01, ...,
        -2.35482832e+01,  5.76960915e-01,  8.55264803e-01]])

In [67]:
X_train = X_selectors[0].fit_transform(X_train, y_train)

In [70]:
L0_classifiers[0].fit_transform(X_train, y_train)

ValueError: Input X must be non-negative.

In [27]:
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
pipe.score(X_test, y_test)

0.88

## Use case: Matrix ensembles

## Use case: Matrix ensemble stacks

## Use case: Matrix selection

## Use case: Transformer Ensembles

## Use case: Transformer Selection

## Use case: Semi-supervised learning

## Use case: Pipeline Blueprints