### The problem

If we want to use a custom Transformer in a ```sklearn.pipeline``` that includes ```GridSearch```, the transformer needs to support ```get_params``` and ```set_params```. These are usually inhereted from ```BaseEstimator```.

For instance, here is our custom ```TextPipelineArrayFeaturizer``` which takes a list of functions and applies each function to some text data.

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder, LabelBinarizer
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate, KFold
from sklearn.model_selection import train_test_split

import numpy as np

from nlpfunctions.utils import *
from nlpfunctions.basicnlp import *
from nlpfunctions.nlppipelineutils import *


class TextPipelineArrayFeaturizer(BaseEstimator, TransformerMixin):
    """
    
    A function Transformer that takes a list of functions, calls each function with 
    our text (X as list of strings), and returns the results of all functions as a feature vector as np.array

    INPUT: Takes a list of functions, calls each function with our text (X as list of strings)
    OUTPUT: np.array

    Ref: https://dreisbach.us/articles/building-scikit-learn-compatible-transformers/
    
    """

    def __init__(self, *featurizers):
        self.featurizers = featurizers

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """Given a list of original data, return an array of list of feature vectors."""
        fvs = []
        for datum in X:
            fv = [f(datum) for f in self.featurizers]
            fvs.append(fv)
        return np.array(fvs)




Some functions to pass to our custom text featurizer:

In [2]:
count_adj = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,POS_tagging
                                       ,lambda x: count_pos(x, pos_to_cnt='J', normalise=True)
                                       )   

VDR_score = combine_functions(sent_tokenise
                                       ,get_sentiment_score_VDR
                                       ,np.nanmean
                                       )   

In [3]:
pipe1_text_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['text']))
        
        ,('transformer', Series2ListOfStrings())
        
        ,('text_featurizer', TextPipelineArrayFeaturizer(count_adj, VDR_score))
        
        ,('scaler', StandardScaler())
        
        ])

Dummy text data:

In [5]:
text_df = pd.DataFrame(['I hate cheese!!! But it is soft and gentle and that makes it hard to accept.', 
                         'You should regret all opportunities you let go...',
                         'They do not know where they are going. and it is weirdly comforting!',
                         'Women need men like fish needs a bicycle!!',
                         'Brilliant',
                         'This is a favourie habits of yours: being late.',
                         'I appreciate the comment but I will do what I thought instead',
                         'Not exactly what I had expected but it was brilliant!',
                         'There are several reasons. But I am going to tell her only one. Why? Because she deserves more.'], columns =['text'])

text_df['score'] = pd.Series([0,0,1,1,1,0,0,1,0])

text_df = text_df.reindex()

It seems to be working...

In [6]:
pipe1_text_features.fit_transform(text_df)

array([[ 1.68759859, -0.83011603],
       [-0.82703507, -0.92156291],
       [-0.82703507, -0.36312976],
       [-0.82703507,  0.04092928],
       [-0.82703507,  1.42440829],
       [ 1.38584255, -0.73167724],
       [-0.82703507,  0.05730507],
       [ 0.17881839,  2.05552055],
       [ 0.88291582, -0.73167724]])

But we get an error when our pipeline includes hypterparameter uning via ```GridSearch``` as BaseEstimator does not accept ```*args``` or ```**kwards``` in its ```__init__```: parameters must be explictly defined. 

**RuntimeError**: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class '__main__.TextPipelineArrayFeaturizer'> with constructor (self, *featurizers) doesn't  follow this convention.

In [7]:
from sklearn.svm import SVC

svm = SVC(probability=True)


pipe1_text_clf = Pipeline([
    
    ('text_features', pipe1_text_features),
    
    ('classifier', svm)
    
])

In [8]:
# Define parameters space and dictionary
parameters = dict(classifier__C = [0.1, 1, 10],    )


# Initiate non_nested parameter search via GridSearch 
pipe1_text_clf_cv = GridSearchCV(estimator=pipe1_text_clf,
                              param_grid=parameters,
                              cv=3,
                              return_train_score=True,
                              scoring='accuracy'    
                              )

In [9]:
X = text_df[['text']]
y = text_df['score'].values

In [10]:
pipe1_text_clf_cv.fit(X, y)

RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class '__main__.TextPipelineArrayFeaturizer'> with constructor (self, *featurizers) doesn't  follow this convention.

**Solution 1**

We can create a ```TextPipelineArrayFeaturizer``` that only accepts one function featurizer as input. We then combine several featurizers using ```UnionFeatures```.

**The negative side of this approach is that it adds complexity to already-complex pipelines by including an additional FeatureUnion.** This makes it harder to follow what's going on.

In [11]:
class TextPipelineArrayOneFeaturizer(BaseEstimator, TransformerMixin):
    """
    
    A function Transformer that takes a function and calls it with 
    our text (X as list of strings), and returns the results as a feature vector as np.array

    INPUT: a function to with our text (X as list of strings)
    OUTPUT: np.array

    ###Ref: https://dreisbach.us/articles/building-scikit-learn-compatible-transformers/
    
    """

    def __init__(self, func=None):
        self.func = func
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """Given a list of original data, return an array of features."""
        
        fv = [self.func(datum) for datum in X]
        return np.array(fv).astype(float).reshape(-1, 1)


In [12]:
pipe2_text_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['text']))
        
        ,('transformer', Series2ListOfStrings())
        
        ,('text_features', FeatureUnion(
                
                transformer_list = [
                        ('1', TextPipelineArrayOneFeaturizer(count_adj)),
                        ('2', TextPipelineArrayOneFeaturizer(VDR_score))
                        ],
                     
                ))
        
        ,('scaler', StandardScaler())
        
        ])

In [13]:
pipe2_text_features.fit_transform(text_df)

array([[ 1.68759859, -0.83011603],
       [-0.82703507, -0.92156291],
       [-0.82703507, -0.36312976],
       [-0.82703507,  0.04092928],
       [-0.82703507,  1.42440829],
       [ 1.38584255, -0.73167724],
       [-0.82703507,  0.05730507],
       [ 0.17881839,  2.05552055],
       [ 0.88291582, -0.73167724]])

It works within ```GridSearch```:

In [14]:
pipe2_text_clf = Pipeline([
    
    ('text_features', pipe2_text_features),
    
    ('classifier', svm)
    
])

pipe2_text_clf_cv = GridSearchCV(estimator=pipe2_text_clf,
                              param_grid=parameters,
                              cv=3,
                              return_train_score=True,
                              scoring='accuracy'    
                              )

In [15]:
pipe2_text_clf_cv.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('text_features', Pipeline(memory=None,
     steps=[('selector', ColumnSelector(columns=['text'])), ('transformer', Series2ListOfStrings()), ('text_features', FeatureUnion(n_jobs=1,
       transformer_list=[('1', TextPipelineArrayOneFeaturizer(func=<function combine_2fs.<locals>.<lambda> at 0...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'classifier__C': [0.1, 1, 10]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring='accuracy', verbose=0)

**Solution 2**

We can create a ```TextPipelineArrayMultiFeaturizer``` that can accept a limited number of functions as input (here I chose 10 but we could increase it).

The positive aspect is that it keeps the pipeline code rather clean as we do not need to add an additional FeaureUnion level.
The negative aspect is the set limited number of function-featurizers that can be applied.

In [16]:
class TextPipelineArrayMultiFeaturizer(BaseEstimator, TransformerMixin):
    """
    
    A function Transformer that takes a list of (maximum 10) functions, calls each function with 
    our text (X as list of strings), and returns the results of all functions as a feature vector as np.array

    INPUT: Takes a list of maximum 10 functions, calls each function with our text (X as list of strings)
    OUTPUT: np.array

    Ref: https://dreisbach.us/articles/building-scikit-learn-compatible-transformers/
    
    """

    def __init__(self, 
                 featurizer1 = None, 
                 featurizer2 = None,
                 featurizer3 = None,
                 featurizer4 = None,
                 featurizer5 = None,
                 featurizer6 = None,
                 featurizer7 = None,
                 featurizer8 = None,
                 featurizer9 = None,
                 featurizer10 = None):
        self.featurizer1 = featurizer1 
        self.featurizer2 = featurizer2
        self.featurizer3 = featurizer3 
        self.featurizer4 = featurizer4
        self.featurizer5 = featurizer5
        self.featurizer6 = featurizer6 
        self.featurizer7 = featurizer7
        self.featurizer8 = featurizer8
        self.featurizer9 = featurizer9
        self.featurizer10 = featurizer10
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """Given a list of original data, return an array of list of feature vectors."""
        featurizers = [self.featurizer1, self.featurizer2, self.featurizer3, self.featurizer4, self.featurizer5,
                      self.featurizer6, self.featurizer7, self.featurizer8, self.featurizer9, self.featurizer10]
        fvs = []
        for datum in X:
            fv = [f(datum) for f in featurizers if f is not None]
            fvs.append(fv)
        return np.array(fvs)

In [17]:
pipe3_text_features = Pipeline([
        
        ('selector', ColumnSelector(columns=['text']))
        
        ,('transformer', Series2ListOfStrings())
        
        ,('text_features', TextPipelineArrayMultiFeaturizer(count_adj, VDR_score))
               
        ,('scaler', StandardScaler())
        
        ])


pipe3_text_features.fit_transform(text_df)

array([[ 1.68759859, -0.83011603],
       [-0.82703507, -0.92156291],
       [-0.82703507, -0.36312976],
       [-0.82703507,  0.04092928],
       [-0.82703507,  1.42440829],
       [ 1.38584255, -0.73167724],
       [-0.82703507,  0.05730507],
       [ 0.17881839,  2.05552055],
       [ 0.88291582, -0.73167724]])

This on also works with ```GridSearch```:

In [18]:
pipe3_text_clf = Pipeline([
    
    ('text_features', pipe3_text_features),
    
    ('classifier', svm)
    
])

pipe3_text_clf_cv = GridSearchCV(estimator=pipe3_text_clf,
                              param_grid=parameters,
                              cv=3,
                              return_train_score=True,
                              scoring='accuracy'    
                              )

In [19]:
pipe3_text_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('text_features', Pipeline(memory=None,
     steps=[('selector', ColumnSelector(columns=['text'])), ('transformer', Series2ListOfStrings()), ('text_features', TextPipelineArrayMultiFeaturizer(featurizer1=<function combine_2fs.<locals>.<lambda> at 0x1a0ff12d08>,
                 featurizer10=N...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])