
# Column transformers: `sklearn.compose.ColumnTransformer`

The ColumnTransformer helps performing different transformations for different columns of the data, within a `Pipeline` that is safe from data leakage and that can be parametrized. 

- `ColumnTransformer` works on **arrays**, **sparse matrices**, and **pandas DataFrames**.



In [8]:
import pandas as pd
import numpy as np
import sklearn

In [9]:
df = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

In [21]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


Here we have a dataframe where `city` column can be treated as a categorical variable.

The column `title` is reasonable to be treated in a bag of words vector.

The `sklearn.compose.ColumnTransformer` can be used to process differently different columns of a dataframe and join the information into a single output Array.

In [53]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop',
    sparse_threshold=0.3)

In [54]:
column_trans.transformers

[('city_category',
  OneHotEncoder(categorical_features=None, categories=None, drop=None,
                dtype='int', handle_unknown='error', n_values=None, sparse=True),
  ['city']),
 ('title_bow',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                  lowercase=True, max_df=1.0, max_features=None, min_df=1,
                  ngram_range=(1, 1), preprocessor=None, stop_words=None,
                  strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, vocabulary=None),
  'title')]

The `column_transformer` defined above has two parts. 

- `city_category` is a  `OneHotEncoder` object that takes the column `city` and performs the one hot encoding.


- `title_bow` is a  `CountVectorizer` object that takes the column `title` and performs a "bag of words" (bog) vector.

The previous transformations are performed once a `column_trans.transform` is called.



In [55]:
column_trans.fit(df)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('city_category',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype='int',
                                               handle_unknown='error',
                                               n_values=None, sparse=True),
                                 ['city']),
                                ('title_bow',
                                 CountVectorizer(analyzer='word', binary=False,
                                                 decode_error='strict',
                                                 dtype=<class 'numpy.int64'>,
                                                 encoding='utf-8',
                                                 input='content',
                          

Notice that the output of the transform is a dense matrix:

In [56]:
X = column_trans.transform(df)
X

<4x16 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

This happens because, by default, `column_trans` has a field named `sparse_threshold` set to `0.3`.

```
sparse_threshold : float, default = 0.3
 |      If the output of the different transformers contains sparse matrices,
 |      these will be stacked as a sparse matrix if the overall density is
 |      lower than this value. Use ``sparse_threshold=0`` to always return
 |      dense.  When the transformed output consists of all dense data, the
 |      stacked result will be dense, and this keyword will be ignored
 ```

In [58]:
column_trans.sparse_threshold, column_trans.sparse_output_

(0.3, True)

The transformers can be accessed with `.named_transformers_`

In [66]:
column_trans.named_transformers_

{'city_category': OneHotEncoder(categorical_features=None, categories=None, drop=None,
               dtype='int', handle_unknown='error', n_values=None, sparse=True),
 'title_bow': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=1.0, max_features=None, min_df=1,
                 ngram_range=(1, 1), preprocessor=None, stop_words=None,
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None),
 'remainder': 'drop'}

The names of the features (created by by the transformers) can be accessed with `.feature_names()`

In [69]:
column_trans.get_feature_names()

['city_category__x0_London',
 'city_category__x0_Paris',
 'city_category__x0_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

In [70]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


## The `remainder` argument

Notice that all features start with `city_` or `title_` because the `column_transformers_` but the original dataframe had as columns:

```
    'city', 'title', 'expert_rating', 'user_rating'
```

this happens because we set up `remainder='drop'`, if we change it

```
 |  remainder : {'drop', 'passthrough'} or estimator, default 'drop'
 |      By default, only the specified columns in `transformers` are
 |      transformed and combined in the output, and the non-specified
 |      columns are dropped. (default of ``'drop'``).
 |      By specifying ``remainder='passthrough'``, all remaining columns that
 |      were not specified in `transformers` will be automatically passed
 |      through. This subset of columns is concatenated with the output of
 |      the transformers.
 |      By setting ``remainder`` to be an estimator, the remaining
 |      non-specified columns will use the ``remainder`` estimator. The
 |      estimator must support :term:`fit` and :term:`transform`.
 |      Note that using this feature requires that the DataFrame columns
 |      input at :term:`fit` and :term:`transform` have identical order.
 ```

In [96]:
column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='passthrough',
    sparse_threshold=0.2)

# features 'city' and 'title' are transformed, the others are just copied
column_trans.fit(df);

In [103]:
column_trans.transform(df)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]])

Notice the last two columns correspond to the values of `expert_rating` and `user_rating` columns.

Now we `column_trans.transform(df)` has two more columns than when it was fitted with `remainder=drop`.

In [106]:
df[['expert_rating','user_rating']]

Unnamed: 0,expert_rating,user_rating
0,5,4
1,3,5
2,4,4
3,5,3


In [125]:
from sklearn import neural_network
rbm = sklearn.neural_network.BernoulliRBM(n_components=10)
rbm.fit(column_trans.transform(df))
rbm.transform(column_trans.transform(df)).shape

(4, 10)