# Data Transforms

Today we'll work with the <a href="https://www.openml.org/d/40945">titanic dataset</a>.

In [1]:
from sklearn.datasets import fetch_openml
titanic = fetch_openml("titanic", version=1, as_frame=True)

Most realistic datasets need to be stored in DataFrames since they have mixed data types.

In [3]:
titanic.data.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [9]:
titanic.target

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: category
Categories (2, object): ['0', '1']

Not all of these features seem useful. Let's begin by selecting the features that do.

In [10]:
features = ["pclass", "sex", "age", "sibsp", "parch", "fare"]
titanic.data = titanic.data[features]

Two of these features have missing values.

In [11]:
titanic.data

Unnamed: 0,pclass,sex,age,sibsp,parch,fare
0,1.0,female,29.0000,0.0,0.0,211.3375
1,1.0,male,0.9167,1.0,2.0,151.5500
2,1.0,female,2.0000,1.0,2.0,151.5500
3,1.0,male,30.0000,1.0,2.0,151.5500
4,1.0,female,25.0000,1.0,2.0,151.5500
...,...,...,...,...,...,...
1304,3.0,female,14.5000,1.0,0.0,14.4542
1305,3.0,female,,1.0,0.0,14.4542
1306,3.0,male,26.5000,0.0,0.0,7.2250
1307,3.0,male,27.0000,0.0,0.0,7.2250


In [13]:
print(titanic.data.isnull().sum())

pclass      0
sex         0
age       263
sibsp       0
parch       0
fare        1
dtype: int64


In [28]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

category_prep = Pipeline([
    ("encode", OneHotEncoder(handle_unknown='ignore'))
])

quantity_prep = Pipeline([
    ("impute", SimpleImputer()),
    ("scale", StandardScaler())
])

prep = ColumnTransformer([
    ("category", category_prep, ["sex", "pclass"]),
    ("quantity", quantity_prep, ["age", "sibsp", "parch", "fare"])
])

The quantitative and categorical features need to be prepared differently. Let's set up a couple of pipelines for doing so.

In [34]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
pipeline = Pipeline(
    [("prep", prep), ("classify", SVC())
])

settings = {"classify__C": [.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipeline, settings, cv=5)
scores = cross_val_score(grid, titanic.data, titanic.target, cv=5)
print(scores.mean(), scores.std())

0.7104062472580503 0.09213851136557909


Now we can give this data to an sklearn classifier.