<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dataset-Class" data-toc-modified-id="Dataset-Class-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dataset Class</a></span><ul class="toc-item"><li><span><a href="#Basic-data-manipulation-functions" data-toc-modified-id="Basic-data-manipulation-functions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Basic data manipulation functions</a></span><ul class="toc-item"><li><span><a href="#Load-data-from-dataframe" data-toc-modified-id="Load-data-from-dataframe-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Load data from dataframe</a></span></li><li><span><a href="#Access-feature-(column)-names" data-toc-modified-id="Access-feature-(column)-names-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Access feature (column) names</a></span></li></ul></li><li><span><a href="#Basic-Data-Preparation-methods" data-toc-modified-id="Basic-Data-Preparation-methods-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Basic Data Preparation methods</a></span><ul class="toc-item"><li><span><a href="#Replace-NA" data-toc-modified-id="Replace-NA-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Replace NA</a></span></li><li><span><a href="#Fix-numerical-features" data-toc-modified-id="Fix-numerical-features-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Fix numerical features</a></span></li></ul></li><li><span><a href="#Basic-Feature-Selection-methods" data-toc-modified-id="Basic-Feature-Selection-methods-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Basic Feature Selection methods</a></span><ul class="toc-item"><li><span><a href="#Under-represented-features-and-Correlation" data-toc-modified-id="Under-represented-features-and-Correlation-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Under represented features and Correlation</a></span></li><li><span><a href="#Wrapper-method:-stepwise-feature-selection" data-toc-modified-id="Wrapper-method:-stepwise-feature-selection-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Wrapper method: stepwise feature selection</a></span></li></ul></li><li><span><a href="#Baseline-a-simple-linear-model" data-toc-modified-id="Baseline-a-simple-linear-model-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Baseline a simple linear model</a></span></li></ul></li></ul></div>

# Dataset Class

This class collects the helper methods to be used along the different lessons, specifically for data preparation and basic feature engineering.

To start using it, simply add

    from dataset import Dataset

In [2]:
# imports
import numpy as np
import pandas as pd
import statsmodels.api as sm
import copy
import warnings

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline

from dataset import Dataset

warnings.simplefilter(action='ignore')

In [3]:
houses = Dataset('./data/houseprices_prepared.csv.gz')
houses.set_target('SalePrice')
houses.describe()

79 Features. 1460 Samples
Available types: [dtype('float64') dtype('O')]
  · 43 categorical features
  · 36 numerical features
  · 16 categorical features with NAs
  · 0 numerical features with NAs
  · 63 Complete features
--
Target: SalePrice (float64)
'SalePrice'
  · Min.: 34900.0000
  · 1stQ: 129975.0000
  · Med.: 163000.0000
  · Mean: 180921.1959
  · 3rdQ: 214000.0000
  · Max.: 755000.0000


## Basic data manipulation functions

### Load data from dataframe

To load data from an existing dataframe into this class, use:

In [3]:
my_existing_dataframe = pd.read_csv('./data/houseprices_prepared.csv.gz')
del(houses)

houses = Dataset.from_dataframe(my_existing_dataframe)
houses.set_target('SalePrice')
houses.describe()

79 Features. 1460 Samples
Available types: [dtype('float64') dtype('O')]
  · 43 categorical features
  · 36 numerical features
  · 16 categorical features with NAs
  · 0 numerical features with NAs
  · 63 Complete features
--
Target: SalePrice (float64)
'SalePrice'
  · Min.: 34900.0000
  · 1stQ: 129975.0000
  · Med.: 163000.0000
  · Mean: 180921.1959
  · 3rdQ: 214000.0000
  · Max.: 755000.0000


### Access feature (column) names

Print a convenient table with the list of features that are categorical and contains NA. Other options are:

  - all (default)
  - features
  - target
  - complete
  - numerical
  - numerical_na
  - categorical

To display features of any type in table format, which is more convenient when there're many of them, use:

In [4]:
houses.table('categorical_na')

-----------------------------------------------------------------------------
Alley        MasVnrType   BsmtQual     BsmtCond     BsmtExposure BsmtFinType1 
BsmtFinType2 Electrical   FireplaceQu  GarageType   GarageFinish GarageQual   
GarageCond   PoolQC       Fence        MiscFeature  
-----------------------------------------------------------------------------


## Basic Data Preparation methods

### Replace NA

Replace the NA's by new values in all 'categorical_na' features. There's a special case called 'Electrical' where NA is replaced by 'Unknown'. As you can see, you can pass a single column name or a list of column names.

To obtain a list of names from the dataset for each type of feature, we use `dataset.names(kind)`.

Describe then the dataset to check that there're no NA among the categorical variables!

In [5]:
houses.replace_na(column='Electrical', value='Unknown')
houses.replace_na(column=houses.names('categorical_na'), value='None')
houses.table('categorical_na')

houses.describe()

79 Features. 1460 Samples
Available types: [dtype('float64') dtype('O')]
  · 43 categorical features
  · 36 numerical features
  · 0 categorical features with NAs
  · 0 numerical features with NAs
  · 79 Complete features
--
Target: SalePrice (float64)
'SalePrice'
  · Min.: 34900.0000
  · 1stQ: 129975.0000
  · Med.: 163000.0000
  · Mean: 180921.1959
  · 3rdQ: 214000.0000
  · Max.: 755000.0000


### Fix numerical features

Basically, we must scale them to the same range of values, ensuring mean in 0, and std deviation = 1. After taht, we will fix skewness, if present.

In [6]:
houses.scale()
houses.fix_skewness()

## Basic Feature Selection methods

### Under represented features and Correlation

It's time to see if we've under represented features, or features that are highly correlated, and therefore we could drop them.

In [7]:
under_represented_features = houses.under_represented_features()
houses.drop_columns(under_represented_features)
print('Dropping {} under represented features'.format(
    len(under_represented_features)))

redundant_features = houses.correlated(threshold=0.7)
houses.drop_columns(redundant_features)
print('Dropping {} highly correlated features'.format(
    len(redundant_features)))

houses.describe()

Dropping 5 under represented features
Dropping 5 highly correlated features
69 Features. 1460 Samples
Available types: [dtype('float64') dtype('O')]
  · 36 categorical features
  · 33 numerical features
  · 0 categorical features with NAs
  · 0 numerical features with NAs
  · 69 Complete features
--
Target: SalePrice (float64)
'SalePrice'
  · Min.: 34900.0000
  · 1stQ: 129975.0000
  · Med.: 163000.0000
  · Mean: 180921.1959
  · 3rdQ: 214000.0000
  · Max.: 755000.0000


### Wrapper method: stepwise feature selection

Scikit Learn lacks its own implementation of stepwise selection because deliberately avoids inferential approach to model learning. This means that feature selection based on a significance test like $p-value$ is strongly discouraged, at least, in this package.

The Dataset method implements a forward & backward feature selection based on the p-value from a Ordinary Least Squares optimization for simple linear regression. The requisite is to count on numerical features to fit the linear model and check variables importance.

The algorithm

    1. set columns available = all_columns
    2. while a new column is added or removed
       3.  find the minimum p-value in all the colummns available
       4.  if the minimum p-value found < MIN_P_VALUE
           5. add that colum to the list of features selected
       6.  find the maximum p-value in all features selected
       7.  if the maximum p-value found > MAX_P_VALUE
           8. drop that colum from the list of features selected

In [8]:
houses.onehot_encode()
best_features = houses.stepwise_selection(verbose=False)
print('Selected {} features, from original {} set'.format(
    len(best_features), len(houses.names('features'))))

houses.keep_columns(best_features)
houses.describe()

Selected 38 features, from original 254 set
38 Features. 1460 Samples
Available types: [dtype('float64')]
  · 0 categorical features
  · 38 numerical features
  · 0 categorical features with NAs
  · 0 numerical features with NAs
  · 38 Complete features
--
Target: SalePrice (float64)
'SalePrice'
  · Min.: 34900.0000
  · 1stQ: 129975.0000
  · Med.: 163000.0000
  · Mean: 180921.1959
  · 3rdQ: 214000.0000
  · Max.: 755000.0000


## Baseline a simple linear model

In [9]:
X, y = houses.split()
model = LinearRegression()
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=666)
scores = cross_val_score(model, 
                         X.train, y.train, 
                         cv=cv, 
                         scoring='r2')
print('Obtained {} positive R2 scores'.format(len(scores)))
print('Avg. CV R2: {:.2f} +/- {:.02}'.format(
    np.mean(scores[scores > 0.0]),
    np.std(scores[scores > 0.0])))

model.fit(X.train, y.train)
print('R2 in hold-out dataset: {:.2f}'.format(
    model.score(X.test, y.test)))

Obtained 100 positive R2 scores
Avg. CV R2: 0.85 +/- 0.032
R2 in hold-out dataset: 0.90


In [10]:
houses.table('features')

--------------------------------------------------------------
BsmtFinSF1           Fireplaces           FullBath             
GarageCars           GrLivArea            LotArea              
LotFrontage          MSSubClass           OverallCond          
OverallQual          YearBuilt            MSZoning_C (all)     
LandContour_Bnk      LotConfig_FR2        Neighborhood_BrkSide 
Neighborhood_Crawfor Neighborhood_Edwards Neighborhood_NoRidge 
Neighborhood_NridgHt Neighborhood_Somerst Neighborhood_StoneBr 
Condition1_Norm      BldgType_Duplex      Exterior1st_BrkFace  
ExterQual_Ex         ExterQual_TA         BsmtQual_Ex          
BsmtExposure_Av      BsmtExposure_Gd      KitchenQual_Ex       
Functional_Typ       FireplaceQu_None     GarageType_2Types    
GarageType_None      GarageFinish_None    GarageQual_Ex        
GarageQual_None      SaleType_New         
--------------------------------------------------------------
