Thanks to [Data School](https://www.youtube.com/watch?v=NGq8wnH5VSo&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=1).

# 1.Use ColumnTransformer to apply different preprocessing to different columns

Use ColumnTransformer to apply different preprocessing to different columns:
- select from DataFrame columns by name
- passthrough or drop unspecified columns

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows = 6)

In [4]:
df.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [5]:
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]

In [6]:
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [8]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

In [9]:
ohe = OneHotEncoder()
imp = SimpleImputer()

In [10]:
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(imp, ['Age']),
remainder = 'passthrough')

In [11]:
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 22.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    , 38.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 26.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 35.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 35.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    , 31.2   ,  8.4583]])

# 2.SEVEN ways to select columns using ColumnTransformer

1. column name
2. integer position
3. slice
4. boolean mask
5. regex pattern
6. dtypes to include
7. dtypes to exclude

In [13]:
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [14]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

In [None]:
# ohe = OneHotEncoder()

In [15]:
# all SEVEN of these produce the same results
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
# ct = make_column_transformer((ohe, [1, 2]))
# ct = make_column_transformer((ohe, slice(1, 3)))
# ct = make_column_transformer((ohe, [False, True, True, False]))
# ct = make_column_transformer((ohe, make_column_selector(pattern='E|S')))
# ct = make_column_transformer((ohe, make_column_selector(dtype_include=object)))
# ct = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))

In [16]:
# one-hot encode Embarked and Sex (and drop all other columns)
ct.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

# 3.Encode categorical features using OneHotEncoder or OrdinalEncoder

Two common ways to encode categorical features:

- OneHotEncoder for unordered (nominal) data
- OrdinalEncoder for ordered (ordinal) data

In [2]:
import pandas as pd
X = pd.DataFrame({'Shape': ['square', 'square', 'oval', 'circle'],
                 'Class': ['third', 'first', 'second', 'first'],
                 'Size': ['S', 'S', 'L', 'XL']})

In [3]:
# "Shape" is unordered, "Class" and "Size" are ordered
X

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,first,XL


In [4]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [6]:
# left-to-right column order is alphabetical (circal, oval, square)
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(X[['Shape']])

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [8]:
# category ordering (within each feature) is defined by me
oe = OrdinalEncoder([['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
oe.fit_transform(X[['Class', 'Size']])

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [0., 3.]])