# Question: How should I transform multiple key/value columns in a scikit-learn pipeline?

See http://stackoverflow.com/questions/31749812/how-should-i-transform-multiple-key-value-columns-in-a-scikit-learn-pipeline/

Input data:

In [1]:
import pandas as pd

D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)

  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3


This is the type of output data that is required:

In [2]:
from sklearn.feature_extraction import DictVectorizer

row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
print(data)

DictVectorizer( sparse=False ).fit_transform(data)

[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]


array([[ 1.,  2.,  0.],
       [ 0.,  2.,  3.]])

# Solution

Courtesy of [Mike](http://stackoverflow.com/users/2055368/mike): http://stackoverflow.com/a/31752733/1185562 and extended into a general pipeline transformer.

Here is the transformer:

In [13]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

class KVExtractor(TransformerMixin):
    def __init__(self, kvpairs):
        self.kpairs = kvpairs
        
    def transform(self, X, *_):
        result = []
        for index, rowdata in X.iterrows():
            rowdict = {}
            for kvp in self.kpairs:
                rowdict.update( { rowdata[ kvp[0] ]: rowdata[ kvp[1] ] } )
            result.append(rowdict)
        return result
    
    def fit(self, *_):
        return self

Lets try it out:

In [14]:
kvpairs = [ ['k1', 'v1'], ['k2', 'v2'] ]
KVExtractor( kvpairs ).transform(D)

[{'a': 1, 'b': 2}, {'b': 2, 'c': 3}]

Now try it out in a pipeline with `DictVectorizer`:

In [15]:
pipeline = Pipeline(
    [( 'kv', KVExtractor( kvpairs ) )] +
    [( 'dv', DictVectorizer(sparse=False) )] +
    []
)
print(D)
A=pipeline.fit_transform(D)
print A.shape
print A

  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3
(2, 3)
[[ 1.  2.  0.]
 [ 0.  2.  3.]]


Try a new key without transforming:

In [16]:
D['k2'] = ['x', 'c']
print D
print pipeline.transform(D)

  k1  v1 k2  v2
0  a   1  x   2
1  b   2  c   3
[[ 1.  0.  0.]
 [ 0.  2.  3.]]


Perfect!

In [17]:
pipeline.inverse_transform(A)

AttributeError: 'KVExtractor' object has no attribute 'inverse_transform'