# Introduction
Let's load and explore the example dataset we are going to use in this practical session.

In [None]:
import pandas as pd
df = pd.read_csv('../data/sample_lasagna.csv')
df.head()

Let's specify our attribute/predictor variables $X$ and the target variable $y$ that we wish to predict.

**Note**: To create $X$, we drop the target variable column and the columns containing the subject ID.

In [None]:
y = df['Have Tried']
X = df.drop(columns=['Person','Have Tried'])

# Mixed Types
Our input matrix $X$ contains columns of different data types, which would need to be processed in different ways depending on the the data type in each column. Let's first look at the data types in $X$.

In [None]:
X.dtypes

We can easily split our input matrix $X$ into numeric columns/features and categorical columns/features. Let's do this now, and look at each.

In [None]:
numeric_features = ["Age", "Weight","Income","Car Value","CC Debt","Mall Trips"]
categorical_features = ["Pay Type","Gender","Live Alone","Dwell Type","Nbhd"]

In [None]:
X_num = X[numeric_features]
X_num.head()

In [None]:
X_cat = X[categorical_features]
X_cat.head()

# Processing mixed types
Once we've split our input matrix $X$ into numeric and categorical features, we can process each split in a way that is suited to the data type.

For example, we can use a `OneHotEncoder` to process the categorical features.

In [None]:
from sklearn.preprocessing import OneHotEncoder
hot_encoder = OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False)
hot_encoder.fit(X_cat)
X_cat_onehot = pd.DataFrame(hot_encoder.transform(X_cat), 
                                  columns=hot_encoder.get_feature_names_out(X_cat.columns))
X_cat_onehot.head()

For the numeric features, we could process them using a `MinMaxScaler`.

In [None]:
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_num_scaled = pd.DataFrame(minmax_scaler.fit_transform(X_num), columns=X_num.columns)
X_num_scaled.head()

Finally we would need to combine the processed numeric and processed categorical features back into the one processed feature matrix.

In [None]:
X_train_preprocessed = pd.concat([X_num_scaled, X_cat_onehot], axis=1)
X_train_preprocessed.head()

# Exercise
Re-run the above data processing steps using an input matrix $X$ with only columns `Age`, `Weight`, `Income`, `Pay Type`, and `Gender`; and using a `StandardScaler` to process the numeric variables. More precisely:
1. Specify the numeric and categorical features and split $X$
2. Process categorical features using OneHotEncoder
3. Process numeric features using StandardScaler
4. Combine results into one processed feature matrix and check it using `head`

Below, there is some code to get you started and steps in the comments to follow.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

y = df['Have Tried']
X = df[['Age', 'Weight', 'Income', 'Pay Type', 'Gender']]

# (SOLUTION)
