# Preprocessing
This notebook will walk through the concepts we saw today. Specifically, it presents

- How to load data using pandas
- How to split the data into train and test
- How to use the differents transformers
- A complete example on how to preprocess the data on a real dataset


## Data loading

First of all, we will see how to load data using **pandas**. We will see 2 examples: loading a local file and loading a dataset from a URL.

In [None]:
import pandas

titanic_from_file = pandas.read_csv("titanic.csv")
titanic_from_url = pandas.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

In [None]:
titanic_from_file.head()

In [None]:
titanic_from_url.head()

## Split the data into train and test


The next step once we have the data loaded into memory using **pandas** is to obtain our train and test datasets. Remember that this process uses random selection, so for reproducibility we will set a seed. This means that each time we call this method it will return the same selection.

First, we will separate the data into features (X) and labels (y), as this dataset is usually used for classification (i.e. supervised learning), and contains a column with the label *Survived*.

In [None]:
# Use drop to get the dataset without a specific column. If the parameter
# inplace is not set to True, it will not modify the original data
X = titanic_from_file.drop("Survived", axis=1)
y = titanic_from_file["Survived"]

In [None]:
X.head()

In [None]:
y.head()

Now we are ready to split the data into train and test

In [None]:
from sklearn.model_selection import train_test_split
test_size_pctg = 0.2
seed = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size_pctg, random_state=seed)

print(X.shape, X_train.shape, X_test.shape)
print(y.shape, y_train.shape, y_test.shape)

## Transforming data


Here we show a basic example of the transformers explained.

### Normalization

First, we see the normalization process using the `StandardScaler`. This will convert each column into a new column with standard deviation of 1 and mean of 0.

As the standard scaler will apply the function:
$$
    z_i = \frac{x_i - \mu}{\sigma}
$$
to scale the values, first it must learn this $\mu$ (mean) and $\sigma$ (std. deviation). To learn this values (parameters) we will **fit** the transformer.

In [None]:
dataset = pandas.DataFrame({
    "column1": [1, 2, 4, 5, 6, 7, 2, 1, 3],
    "column2": [-1, 2, 1, 1, 3, 4, 5, 6, 7],
})

dataset.describe()

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(dataset)  # Learn std.dev and mean

new_X = scaler.transform(dataset)
# Note that the scaler will return a numpy array
new_X

In [None]:
print("Mean for each column:", new_X.mean(axis=0))
print("Standard deviation for each column:", new_X.std(axis=0))

Finally, let's see that this transformer allows us to recover the original data using `inverse_transform`:

In [None]:
scaler.inverse_transform(new_X)

As an exercise, try to implement the StandardScaler by yourself:

In [None]:
import pandas
from sklearn.base import BaseEstimator, TransformerMixin

class MyStandardScaler(BaseEstimator, TransformerMixin):
    def fit(self, X):
        if isinstance(X, pandas.DataFrame):
            X = X.values
        
        raise NotImplementedError
        # TODO get the mean and the standard deviation for each column
        self.means_ = None
        self.std_dev_ = None
        
        return self
    
    def transform(self, X):
        if isinstance(X, pandas.DataFrame):
            X = X.values
        
        # Use the mean and the std deviation of each column to scale
        # the data in X
        raise NotImplementedError
        
my_X = MyStandardScaler().fit_transform(dataset)

### Discretization

The next transformer we explained is the *bins discretizer*. Experiment with the different parameters available (see [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html))

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, encode="ordinal")
discretizer.fit_transform(dataset)

### Encoders

Finally, let's see the encoders:

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
print(ohe.fit_transform(dataset))
print(ohe.categories_)

In [None]:
from sklearn.preprocessing import LabelEncoder

dataset_categorical = pandas.DataFrame({
    "col1": ["cat", "dog", "bird", "cat", "bird", "cat"],
})

label_enc = LabelEncoder()
print(label_enc.fit_transform(dataset_categorical))
print(label_enc.classes_)

## Feature selection


In some cases, we might not be able to determine which features are worth or not to be used in a model manually. For those cases, we can use statistics and unsupervised learning to find relations between the features and the label in our training set, and then select some features based on this new information.

An example of this technique is the SelectKBest method in sklearn, where we determine which are the best $K$ features based on univariate statistical tests, such as $\chi^2$ (chi squared).

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

X, y = load_iris(return_X_y=True)

print(X.shape)

selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)

print(X_new.shape)
print(selector.scores_)

## Data scrubbing

Below we have a real example on how to preprocess a dataset. We will use the[Titanic dataset](https://hbiostat.org/data/repo/titanic.txt).

The problem we try to solve is to determine if a Titanic's passenger would have survived, given the age, passenger class, and sex.

The attributes are:
- idx (we use it as the dataframe index using index_col=0)
- Class
- Survived (1=True / 0=False)
- Name
- Age
- Embarked
- Destination
- Room
- Ticket
- Boat
- Sex

In [None]:
import pandas

titanic = pandas.read_csv("https://hbiostat.org/data/repo/titanic.txt", index_col=0)

titanic.head()

In [None]:
titanic_X = titanic.drop("survived", axis=1)
titanic_y = titanic["survived"]

X_train, X_test, y_train, y_test = train_test_split(
    titanic_X, titanic_y, test_size=0.2, random_state=42)

Let's select some attributes we will use for learning:
- Passenger class (pclass)
- Age
- Sex

In [None]:
X_train = X_train[["pclass", "age", "sex"]]

print(X_train.shape)
X_train.isna().sum(axis=0)

We see that for the age attribute we have 448 rows of 1050 that have missing values. We can approach this using any of the techniques we explained:
- Replacing it with the mean value
- Replacing it using KNN
- Replacing it randomly
- Deleting the records

The example uses the SimpleImputer to show how this can be done, but as we know it is easy to replace one transformer for another using *sklearn*, so try to change it and experiment with the other methods. Remember that other methods might require the data as integers only. In those cases, for now, you can apply the fit and transform only to a specific column using something like:
```python
dest = titanic_X.copy()
columns_to_input = ["age"]
dest[columns_to_input] = imputer.fit_transform(titanic_X[columns_to_input] 
```

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")

# First learn the parameters (in the simple imputer this is the
# most frequent value) for each column 
imputer.fit(X_train)
# Then "impute" the values
X_train_imputed = imputer.transform(X_train)

# "Reconvert" it to a dataframe
X_train_imputed = pandas.DataFrame(X_train_imputed,
                                   columns=X_train.columns,
                                   index=X_train.index)

# Check we have no more nans
print(X_train_imputed.shape)
X_train_imputed.isna().sum(axis=0)

Once we have the values imputed, then we can prepare it for our model. We will use sklearn decision trees. This model expects as input a list of *real-valued* features, as the decisions will be **Feature <= value**.

As we have categorical data, we have to convert it to integer or real values. We could use either a *LabelEncoder* or a *OneHotEncoder*.

The example shows a *LabelEncoder* for the feature *pclass*, but remember that as the category might not have a natural ordering, a *OneHotEncoder* might be better as the *LabelEncoder* will map each category to an integer between $[0, K-1]$, where $K$ is the number of categories for each column, thus introducing an ordering between classes.

In [None]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

X_train_encoded = X_train_imputed.copy()

pclass_encoder = LabelEncoder()
pclass_encoder = pclass_encoder.fit(X_train_imputed["pclass"])
X_train_encoded["pclass"] = pclass_encoder.transform(X_train_imputed["pclass"])

# [...] Exercise: encode the age using a OneHotEncoder (set the parameter sparse to false)
# Remember that the OHE works with multiple columns as input, while the LabelEncoder
# works with a single column input, so the calls to fit and transform might be a bit
# different.
# Hint: as the OneHotEncoder will return mutliple columns for each column
# you can drop the original column in X_train_encoded and use something like:
# >>> new_columns = list(map(lambda s: "sex_" + s, list(encoder.categories_[0])))
# >>> X_train_encoded[new_columns] = encoder.transform(...)

X_train_encoded.head()

We are now ready to fit our first model with clean data. We first separate it into train and test, and then fit the classifier. Finally, we check the score of the model in both train and test, to see if our model overfits.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


# NOTE: Once you have encoded the column sex using OHE, you can comment the following line
X_train_encoded = X_train_encoded[["pclass", "age"]]


model = DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=5)
model.fit(X_train_encoded, y_train)

print("Score in train:", model.score(X_train_encoded, y_train))

In order to get the score for the test dataset, we have to apply the same preprocessing as we dit to the train dataset. **Remember that the test dataset cannot modify any parameter of the entire process. This means that we cannot call any method that fits anything**

In [None]:
# TODO preprocess the dataset
X_test_prepared = X_test

print("Score in test:", model.score(X_test_prepared, y_test))

As a bonus, we will inspect the built tree:

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))
tree.plot_tree(model, feature_names=X_train.columns)
plt.show()