<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Practice-Session-1-(Sess-#6)" data-toc-modified-id="Practice-Session-1-(Sess-#6)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Practice Session 1 (Sess #6)</a></span><ul class="toc-item"><li><span><a href="#Steps-to-be-followed" data-toc-modified-id="Steps-to-be-followed-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Steps to be followed</a></span></li></ul></li><li><span><a href="#Read-the-dataset" data-toc-modified-id="Read-the-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read the dataset</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>EDA</a></span></li><li><span><a href="#Baseline" data-toc-modified-id="Baseline-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Baseline</a></span></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Feature Engineering</a></span></li><li><span><a href="#Evaluation-and-Validation" data-toc-modified-id="Evaluation-and-Validation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Evaluation and Validation</a></span></li></ul></div>

# Practice Session 1 (Sess #6)

We're putting in practice what we need to know about feature engineering and model validation.

The goal of this week is to practice on most of the topics, so I will experiment with a new (at least, for me) dataset that can be found [here](http://archive.ics.uci.edu/ml/datasets/Cylinder+Bands). It contains $p=39$ attributes or features, mixed between categorical, integer and real, with a total of $m=512$ samples. The target variable is the one called `band type`, taking the values `band` or `not band`.

## Steps to be followed

1. Read the dataset
2. Data preparation: Check for variable types, NAs, values imputation, column names, scaling, encoding, etc.
3. EDA (Exploratory Data Analysis): Try to extract some insights from the data like features completely uncorrelated or under-represented. Can we merge values from any of the categorical features? Would it be beneficial to discrretiza numerical features? Do we have outliers?
3. Baseline: Simply take the simpler possible model (logistic regression) and set a base score that we'll try to improve along the process. To run logistic regression **you need** to have **numerical features**, so the fastest way of preparing your data to be used in `lr` is to perform _one hot encoding_. Perform this encoding so that you don't destroy the original prepared data. Consider to include onehot encoding as an step which is done right before evaluation over a copy of your prepared data.
4. Feature Engineering: We will try
    - categorical encoding: compare techniques like onehot and target encoding
    - feature selection: compare the results from filtering, wrappers and regularization
    - feature construction: compare GPLearn with _ad hoc_ methods, or Deep Feature Synthesis.
5. Evaluation: the goal here is to fine tune our models, so, to do that we need a new model, like **decission trees**, in this case. We will experiment with:
    - Cross validation and bootstrapping comparison.
    - Fine tune tree parameters: pruning and parameters optimization.

# Read the dataset

Uncomment the following line if you need to install the `dataset` package if you want to go faster, or use Pandas to read it directly.

In [11]:
# !pip install --upgrade git+http://github.com/renero/dataset

In [12]:
from dataset import Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
url = 'bands.csv'
data = Dataset(url) # 'data' is now holding the dataset object
df = data.features  # 'df' is now holding the pandas DataFrame
data.describe()

40 Features. 539 Samples
Available types: [dtype('float64') dtype('O')]
  · 16 categorical features
  · 24 numerical features
  · 8 categorical features with NAs
  · 20 numerical features with NAs
  · 12 Complete features
--
Target: Not set


# Data Preparation

You've a lot of NA's so you can explore imputation techniques here. Once that's check outliers, scale numerical variables, all to the same range, and you should be done.

To know what features contain NA values, you can use `Dataset` property `data.incomplete_features` or look up for them in pandas using `is_na().any()` over columns.

In [10]:
data.incomplete_features

['grain_screened',
 'proof_on_ctd',
 'blade_mfg',
 'direct_steam',
 'solvent_type',
 'type_on_cylinder',
 'cylinder_size',
 'paper_mill_location',
 'plating_tank',
 'proof_cut',
 'viscosity',
 'caliper',
 'ink_temperature',
 'humifity',
 'roughness',
 'blade_pressure',
 'varnish_pct',
 'press_speed',
 'ink_pct',
 'solvent_pct',
 'ESA_Voltage',
 'ESA_Amperage',
 'wax',
 'hardener',
 'roller_durometer',
 'current_density',
 'anode_space_ratio',
 'chrome_content']

Your first decision has to do with WHAT to do with NA's. Then, scale numerical variables, and finally check if all categorical variables are correctly encoded.

# EDA

# Baseline

# Feature Engineering

# Evaluation and Validation