In [19]:
import matplotlib.pyplot as plt
import numpy
import pandas

import seaborn
seaborn.set_context('talk')

In [20]:
melb_df = pandas.read_csv(
    'https://cs.famaf.unc.edu.ar/~mteruel/datasets/diplodatos/melb_data.csv')
melb_df[:3]

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0


## StandardScaler

`Sklearn` has a set of implemented classes called `transformers`. Transformers are embeddings (projections) that from receive an input matrix $N \times M$ and projects this matrix to a new space $N \times D$. D can less (dimensional reduction), equal (preprocessing/scaling), or higher than M (encodings).

`StandardScaler` is a transformer that for a set of input columns, standardize them by removing the mean and dividing by their standard deviation.

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(melb_df[["Price"]])

In [22]:
melb_df["scaled_Price"] = scaler.transform(melb_df[["Price"]])

In [23]:
melb_df[["Price", "scaled_Price"]]

Unnamed: 0,Price,scaled_Price
0,1480000.0,0.632448
1,1035000.0,-0.063640
2,1465000.0,0.608984
3,850000.0,-0.353025
4,1600000.0,0.820157
...,...,...
13575,1245000.0,0.264851
13576,1031000.0,-0.069897
13577,1170000.0,0.147533
13578,2500000.0,2.227975


In [24]:
melb_df["scaled_Price"].mean(axis=0), melb_df["scaled_Price"].std(axis=0)

(1.4441074747407044e-16, 1.0000368208848183)

## Ordinal Encoding

`OrdinalEncoding` receives an ordinal categorical random variable with `N` categories and enumerate them with integers from $0, ... , N-1$.

If the random variable has no order it's better to use `OneHotEncoding`. As a demonstration, we are going to use this method with `SellerG` provided that this variable is not ordinal.

In [25]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
oe.fit(melb_df[["SellerG"]])

After fitting, you can access the number the enumerated categories.

In [26]:
oe.categories_

[array(['@Realty', 'ASL', "Abercromby's", 'Ace', 'Alexkarbon', 'Allens',
        'Anderson', 'Appleby', 'Aquire', 'Area', 'Ascend', 'Ash', 'Asset',
        'Assisi', 'Australian', 'Barlow', 'Barry', 'Bayside', 'Bekdon',
        'Beller', 'Bells', 'Besser', 'Better', 'Biggin', 'Blue',
        'Boutique', 'Bowman', 'Brace', 'Brad', 'Buckingham', 'Bullen',
        'Burnham', 'Buxton', 'Buxton/Advantage', 'Buxton/Find', 'C21',
        'CASTRAN', 'Caine', 'Calder', 'Carter', 'Castran', 'Cayzer',
        'Century', 'Chambers', 'Changing', 'Charlton', 'Chisholm',
        'Christopher', 'Clairmont', 'Collins', 'Community', 'Compton',
        'Conquest', 'Considine', 'Coventry', 'Craig', 'Crane', "D'Aprano",
        'Daniel', 'Darras', 'Darren', 'David', 'Del', 'Dingle', 'Direct',
        'Dixon', 'Domain', 'Douglas', 'Edward', 'Elite', 'Eview', 'FN',
        'First', 'Fletchers', 'Fletchers/One', 'Follett', 'Frank', 'Free',
        'GL', 'Galldon', 'Gardiner', 'Garvey', 'Gary', 'Geoff', 'Grant

In [27]:
melb_df["encoded_SellerG"] = oe.transform(melb_df[["SellerG"]])

In [28]:
melb_df["encoded_SellerG"].value_counts()

155.0    1565
106.0    1316
260.0    1167
16.0     1011
196.0     701
         ... 
187.0       1
129.0       1
258.0       1
96.0        1
180.0       1
Name: encoded_SellerG, Length: 268, dtype: int64

You can also obtain the categories given the enumeration with `inverse_transform`.

In [29]:
oe.inverse_transform(melb_df[["encoded_SellerG"]])

array([['Biggin'],
       ['Biggin'],
       ['Biggin'],
       ...,
       ['Raine'],
       ['Sweeney'],
       ['Village']], dtype=object)

## Discretizers

Discretization provides a way to partition continuous features into discrete values. Discretized features can make a model more expressive, while maintaining interpretability.

Sklearn implementation is through the class `KBinsDiscretizer` where the output is one-hot encoded into a sparse matrix.

In [30]:
from sklearn.preprocessing import KBinsDiscretizer

Let's suppose that we want to discretize the columns `Price` and `Rooms`. The first column we want to create 3 segments and for the second one only 2 segments. To do so, we make an instance of the class `KBinsDiscretizer` with the parameter `n_bins`. `n_bins` takes a list of integers where each element represents the number of segments. In this case `n_bins=[3, 2]`. Then, attribute `encode='ordinal'` tells the discretizer to enumerate each segment with an identifier.

In [31]:
kbe = KBinsDiscretizer(n_bins=[3, 2], encode='ordinal')

In [32]:
kbe.fit(melb_df[["Price", "Rooms"]])

The new segments are:

`Price`: `[85000, 730000), [730000, 1180000), [1180000, 9000000)`

`Rooms`: `[1, 3), [3, 10)`

Then, each segment is enumerated with an identifier.

`Price`: `[85000, 730000) -> 0, [730000, 1180000) -> 1, [1180000, 9000000) -> 2`

`Rooms`: `[1, 3) -> 0, [3, 10) -> 1`

Finally, we use the identifiers as the encoding.


In [33]:
kbe.bin_edges_

array([array([  85000.,  730000., 1180000., 9000000.]),
       array([ 1.,  3., 10.])], dtype=object)

In [34]:
melb_df[["discretized_Price", "discretized_Rooms"]] = kbe.transform(melb_df[["Price", "Rooms"]])

In [35]:
melb_df[["discretized_Price", "discretized_Rooms"]]

Unnamed: 0,discretized_Price,discretized_Rooms
0,2.0,0.0
1,1.0,0.0
2,2.0,1.0
3,1.0,1.0
4,2.0,1.0
...,...,...
13575,2.0,1.0
13576,1.0,1.0
13577,1.0,1.0
13578,2.0,1.0


## Polynomial Features

In [36]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(4)
poly.fit(melb_df[["Price", "Distance"]])

In [37]:
poly_features = poly.transform(melb_df[["Price", "Distance"]])

In [38]:
poly_features

array([[1.00000000e+00, 1.48000000e+06, 2.50000000e+00, ...,
        1.36900000e+13, 2.31250000e+07, 3.90625000e+01],
       [1.00000000e+00, 1.03500000e+06, 2.50000000e+00, ...,
        6.69515625e+12, 1.61718750e+07, 3.90625000e+01],
       [1.00000000e+00, 1.46500000e+06, 2.50000000e+00, ...,
        1.34139062e+13, 2.28906250e+07, 3.90625000e+01],
       ...,
       [1.00000000e+00, 1.17000000e+06, 6.80000000e+00, ...,
        6.32979360e+13, 3.67885440e+08, 2.13813760e+03],
       [1.00000000e+00, 2.50000000e+06, 6.80000000e+00, ...,
        2.89000000e+14, 7.86080000e+08, 2.13813760e+03],
       [1.00000000e+00, 1.28500000e+06, 6.30000000e+00, ...,
        6.55371202e+13, 3.21310395e+08, 1.57529610e+03]])

## Pipelines

In [39]:
melb_df["label"] = (melb_df["Price"] > 1000000).replace({True: 1, False: 0})

In [40]:
melb_df["label"].value_counts()

0    7837
1    5743
Name: label, dtype: int64

In [41]:
melb_df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount', 'scaled_Price',
       'encoded_SellerG', 'discretized_Price', 'discretized_Rooms', 'label'],
      dtype='object')

In [42]:
X, y = melb_df[["Rooms", "Type", "Method", "Distance"]], melb_df["label"]

In [43]:
X

Unnamed: 0,Rooms,Type,Method,Distance
0,2,h,S,2.5
1,2,h,S,2.5
2,3,h,SP,2.5
3,3,h,PI,2.5
4,4,h,VB,2.5
...,...,...,...,...
13575,4,h,S,16.7
13576,3,h,SP,6.8
13577,3,h,S,6.8
13578,4,h,PI,6.8


In [44]:
y

0        1
1        1
2        1
3        0
4        1
        ..
13575    1
13576    1
13577    1
13578    1
13579    1
Name: label, Length: 13580, dtype: int64

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [46]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder


col_transformer = ColumnTransformer([
    ("categ", OneHotEncoder(), ["Type", "Method"]),
    ("scale", StandardScaler(), ["Rooms", "Distance"]),
    ("poly", PolynomialFeatures(2), ["Rooms", "Distance"])
])

col_transformer.fit(X_train)

In [47]:
X_train

Unnamed: 0,Rooms,Type,Method,Distance
12167,1,u,S,5.0
6524,2,h,SA,8.0
8413,3,h,S,12.6
2919,3,u,SP,13.0
6043,3,h,S,13.3
...,...,...,...,...
13123,3,h,SP,5.2
3264,3,h,S,10.5
9845,4,h,PI,6.7
10799,3,h,S,12.0


In [48]:
col_transformer.transform(X_train)


array([[  0.  ,   0.  ,   1.  , ...,   1.  ,   5.  ,  25.  ],
       [  1.  ,   0.  ,   0.  , ...,   4.  ,  16.  ,  64.  ],
       [  1.  ,   0.  ,   0.  , ...,   9.  ,  37.8 , 158.76],
       ...,
       [  1.  ,   0.  ,   0.  , ...,  16.  ,  26.8 ,  44.89],
       [  1.  ,   0.  ,   0.  , ...,   9.  ,  36.  , 144.  ],
       [  1.  ,   0.  ,   0.  , ...,  16.  ,  25.6 ,  40.96]])

In [49]:
col_transformer.transform(X_test).shape

(2716, 16)

In [50]:
from sklearn.base import BaseEstimator
import pandas as pd
from random import random

class RandomClassifier(BaseEstimator):
    def fit(self, X, y):
        # Your code for fitting
        return self

    def predict(self, X):
        y_pred = [random() > 0.5 for _ in range(len(X))]
        y_pred = pd.Series(y_pred)
        return y_pred.astype(int)

In [51]:
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
    ("preprocessor", col_transformer),
    ("pca", PCA(n_components=3)),
    ("classifier", RandomClassifier())
])
pipe

In [52]:
pipe.fit(X_train, y_train)

In [53]:
y_pred = pipe.predict(X_test)

In [54]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.59      0.50      0.54      1585
           1       0.42      0.52      0.47      1131

    accuracy                           0.51      2716
   macro avg       0.51      0.51      0.50      2716
weighted avg       0.52      0.51      0.51      2716



In [55]:
rclf = RandomClassifier()
rclf.fit(X_train, y_train)
rclf.predict(X_test)

0       1
1       0
2       1
3       0
4       1
       ..
2711    1
2712    1
2713    1
2714    1
2715    0
Length: 2716, dtype: int64

In [56]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(2, 5))
min_max_scaler.fit(melb_df[["Price"]])
min_max_scaler.transform(melb_df[["Price"]])

array([[2.46943354],
       [2.31968592],
       [2.46438587],
       ...,
       [2.36511497],
       [2.81267527],
       [2.4038138 ]])

In [57]:
from sklearn.preprocessing import MaxAbsScaler

In [58]:
max_abs_scaler = MaxAbsScaler()
max_abs_scaler.fit(-melb_df[["Price"]])
scaled_max_abs = max_abs_scaler.transform(-melb_df[["Price"]])