# Building Pipelines: Exercise
## Instructions
### Objective
- Create a machine learning pipeline using the tools you've been introduced to (Pipelines, ColumnTransformer, ML Models).
- Fit this Pipeline on a dataset and get predictions.

### Requirements
- You must perform a **train-test-split** on your data.
- Your final pipeline must use the following tools:

    1. Simple Imputer
    2. Standard Scaler and/or Min-Max Scaler
    3. One Hot Encoder
    3. Label Encoder
    4. Pipeline(s)
    5. Column Transformer
    6. One of the classifier models you've been introduced to.

---

### Tips (if you're stuck)
<a href='#hints'>Step-by-Step Outline</a>

# Don't touch this part!

In [None]:
# Nothing to see here...keep scrolling.
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from sklearn import set_config

set_config(display='diagram')

# I won't comment on this function for the sake of the goal of the exercise.
def make_me_a_classification_dataset():
    """Don't touch this code - that's cheating! :)"""
    
    np.random.seed(51)
    X, y = make_blobs(
        n_samples=10_000,
        n_features=12,
        cluster_std=2,
        centers=3,
        random_state=51
        )
    
    df = pd.DataFrame(
        np.concatenate([X, y.reshape(-1, 1)], axis=1),
        columns=[f'f{n}' for n in range(1, 13)] + ['target']
        )
    
    ranges_lst = [(np.random.randint(67), np.random.randint(100)) 
                  for _ in range(9)]
    random_cols = np.random.choice(df.columns[:-1], size=9, replace=False)
    for rng, col in zip(ranges_lst, random_cols):
        m, n = sorted(rng)
        scaler = MinMaxScaler(feature_range=(m, n))
        new_vals = scaler.fit_transform(df[col].values.reshape(-1, 1))
        df[col] = np.round(new_vals, 2)
        del scaler
        
    labels_lst =[
        ['very low', 'low', 'high', 'very high'],
        ['slow', 'med', 'fast', 'extreme'],
        ['low', 'high']
    ]
    for labels, col in zip(labels_lst, 
                           [c for c in df.columns if c not in random_cols]):
        df[col] = pd.qcut(
            df[col], len(labels), labels=labels, duplicates='drop')
    
    df['target'] = df['target'].map(
        {0.0: 'Basset Hound', 1.0: 'Pit-bull', 2.0:'Chihuahua'})
    
    missing = [
        (np.random.randint(1000), 
         np.random.choice(df.columns[:-1])) 
        for _ in range(750)
    ]
    for a, b in missing:
        df.loc[a, b] = np.nan
    
    return df

# Your Work Here!

In [None]:
# Imports.


In [None]:
# Loading data.
df = make_me_a_classification_dataset()

<a id='hints'></a>
# *Hints*

1. Get a high-level picture of your data.
    - *Columns, data-types, missing values, etc.*
    
    
2. Train-Test-Split.


3. Create and fit a **Label Encoder** on the `target` (this step is acceptable to perform outside the Pipelines).


4. Create Pipelines for each step of your preprocessing.
    - *There should be two pipelines: one for each of your numerical and categorical columns.*
        - **Imputing** 
        - **Scaling** 
        - **One-Hot-Encoding**


4. Combine Pipelines in a ColumnTransformer.


5. Create a final Pipeline which uses your ColumnTransformer and finishes with a classifier.


6. Fit your pipeline on your training data.


7. Check your model's performance with at least one metric.
    - Compare the model's performance predicting both the `train` and `test` data.