
# Column transformers: `sklearn.compose.ColumnTransformer`

The ColumnTransformer helps performing different transformations for different columns of the data, within a `Pipeline` that is safe from data leakage and that can be parametrized. 

- `ColumnTransformer` works on **arrays**, **sparse matrices**, and **pandas DataFrames**.



In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
import sklearn

In [2]:
df = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

In [3]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


Here we have a dataframe where `city` column can be treated as a categorical variable.

The column `title` is reasonable to be treated in a bag of words vector.

The `sklearn.compose.ColumnTransformer` can be used to process differently different columns of a dataframe and join the information into a single output Array.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop',
    sparse_threshold=0.3)

In [5]:
column_trans.transformers

[('city_category', OneHotEncoder(dtype='int'), ['city']),
 ('title_bow', CountVectorizer(), 'title')]

The `column_transformer` defined above has two parts. 

- `city_category` is a  `OneHotEncoder` object that takes the column `city` and performs the one hot encoding.


- `title_bow` is a  `CountVectorizer` object that takes the column `title` and performs a "bag of words" (bog) vector.

The previous transformations are performed once a `column_trans.transform` is called.



In [6]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


In [7]:
column_trans.fit(df)

ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
                                 ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

Notice that the output of the transform is a dense matrix:

In [8]:
X = column_trans.transform(df)
X

<4x16 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [9]:
column_trans.get_feature_names_out()

array(['city_category__city_London', 'city_category__city_Paris',
       'city_category__city_Sallisaw', 'title_bow__bow',
       'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
       'title_bow__how', 'title_bow__last', 'title_bow__learned',
       'title_bow__moveable', 'title_bow__of', 'title_bow__the',
       'title_bow__trick', 'title_bow__watson', 'title_bow__wrath'],
      dtype=object)

Note that the previous 16 features come from joining the outputs of the two transformers

In [10]:
display(column_trans.transformers_[0][1].transform(df[['city']]))
display(column_trans.transformers_[1][1].transform(df['title']))

<4x3 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

<4x13 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [11]:
sp.hstack((column_trans.transformers_[0][1].transform(df[['city']]),
           column_trans.transformers_[1][1].transform(df['title'])))

<4x16 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

This happens because, by default, `column_trans` has a field named `sparse_threshold` set to `0.3`.

```
sparse_threshold : float, default = 0.3
 |      If the output of the different transformers contains sparse matrices,
 |      these will be stacked as a sparse matrix if the overall density is
 |      lower than this value. Use ``sparse_threshold=0`` to always return
 |      dense.  When the transformed output consists of all dense data, the
 |      stacked result will be dense, and this keyword will be ignored
 ```

In [12]:
column_trans.sparse_threshold, column_trans.sparse_output_

(0.3, True)

The transformers can be accessed with `.named_transformers_`

In [13]:
column_trans.named_transformers_

{'city_category': OneHotEncoder(dtype='int'),
 'title_bow': CountVectorizer(),
 'remainder': 'drop'}

The names of the features (created by by the transformers) can be accessed with `.feature_names()`

In [14]:
column_trans.get_feature_names_out()

array(['city_category__city_London', 'city_category__city_Paris',
       'city_category__city_Sallisaw', 'title_bow__bow',
       'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
       'title_bow__how', 'title_bow__last', 'title_bow__learned',
       'title_bow__moveable', 'title_bow__of', 'title_bow__the',
       'title_bow__trick', 'title_bow__watson', 'title_bow__wrath'],
      dtype=object)

In [15]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


## The `remainder` argument

Notice that all features start with `city_` or `title_` because the `column_transformers_` but the original dataframe had as columns:

```
    'city', 'title', 'expert_rating', 'user_rating'
```

this happens because we set up `remainder='drop'`, if we change it

```
 |  remainder : {'drop', 'passthrough'} or estimator, default 'drop'
 |      By default, only the specified columns in `transformers` are
 |      transformed and combined in the output, and the non-specified
 |      columns are dropped. (default of ``'drop'``).
 |      By specifying ``remainder='passthrough'``, all remaining columns that
 |      were not specified in `transformers` will be automatically passed
 |      through. This subset of columns is concatenated with the output of
 |      the transformers.
 |      By setting ``remainder`` to be an estimator, the remaining
 |      non-specified columns will use the ``remainder`` estimator. The
 |      estimator must support :term:`fit` and :term:`transform`.
 |      Note that using this feature requires that the DataFrame columns
 |      input at :term:`fit` and :term:`transform` have identical order.
 ```

In [17]:
column_trans = ColumnTransformer(
                [('city_category', OneHotEncoder(dtype='int'),['city']),
                 ('title_bow', CountVectorizer(), 'title')],
                remainder='passthrough',
                sparse_threshold=1.)

column_trans.fit(df);

In [18]:
df

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


Note that now we have 18 features.

- Features 'city' and 'title' are transformed to 3 and 13 features respectively
- Features 'expert_rating' and 'user_rating' are kept intact.

In total we have 3 + 13 + 1 + 1 = 18 features

In [20]:
column_trans.transform(df)

<4x18 sparse matrix of type '<class 'numpy.int64'>'
	with 26 stored elements in Compressed Sparse Row format>

Now we `column_trans.transform(df)` has two more columns than when it was fitted with `remainder=drop`.

Notice the last two columns correspond to the values of `expert_rating` and `user_rating` columns.

In [21]:
column_trans.transform(df).todense()[:,-2:]

matrix([[5, 4],
        [3, 5],
        [4, 4],
        [5, 3]])

In [22]:
df[['expert_rating','user_rating']]

Unnamed: 0,expert_rating,user_rating
0,5,4
1,3,5
2,4,4
3,5,3


### Generate features from columns in a df and stack them to the columns

In [34]:
column_trans = ColumnTransformer( 
                 [('city_category', OneHotEncoder(dtype='int'),['city']),
                 ('title_bow', CountVectorizer(), 'title')],
                 remainder='passthrough',
                 sparse_threshold=1.)
