# LogTransformer
This notebook shows the functionality in the LogTransformer class. This transformer applies the log transform to numeirc columns. <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.numeric import LogTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()

In [5]:
boston_df.shape

(506, 17)

In [6]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [7]:
boston_df.isnull().sum()

CRIM        55
ZN          62
INDUS        0
CHAS         0
NOX         44
RM          56
AGE         42
DIS         51
RAD         62
TAX         52
PTRATIO     56
B           50
LSTAT       49
target       0
ZN_cat      62
CHAS_cat     0
RAD_cat     62
dtype: int64

## Simple usage

### Initialising LogTransformer

All the arguments are optional in this transformer. The user can specify;
- `columns` the columns to apply the log transform to
- `add_1` whether to add 1 to the column before applying the log transform, useful if you have 0s in the column
- `drop` to drop the original columns
- `suffix` to specify the suffix to add onto the original column names for the logged versions of these columns

In [8]:
log_1 = LogTransformer(
    columns = ['CRIM', 'NOX', 'RM'], 
    add_1 = False,
    drop = True,
    suffix = 'log'
)

### LogTransformer fit
There is no fit method for the LogTransformer as the transformer only applies to log function.

### LogTransformer transform
Multiple column mappings were specified when creating `log_1` so these columns will be logged and then dropped. <br>
Notice that nulls are preserved when logging. The transformer uses `np.log`.

In [9]:
boston_df_2 = log_1.transform(boston_df)

In [10]:
boston_df_2[['CRIM_log', 'NOX_log', 'RM_log']].head()

Unnamed: 0,CRIM_log,NOX_log,RM_log
0,-5.064036,-0.619897,1.883275
1,-3.600502,-0.757153,1.859574
2,-3.601235,-0.757153,1.971996
3,,-0.780886,
4,-2.672924,-0.780886,


In [11]:
[x in boston_df_2.columns for x in ['CRIM', 'NOX', 'RM']]

[False, False, False]

## Adding 1 columns before transform
To deomstrate this feature we impute nulls with 0 in the `AGE` column. <br>
By setting the `add_1` argument to `True` a constant value of 1 will be added to the column before applying the log transform. <br>
This is useful if you have 0 values in the column, as the log of 0 is undefined and you will encounter a `RuntimeWarning` and have resulting `-inf` values in your output, if you try to log 0s.

In [12]:
boston_df['AGE'].fillna(0, inplace = True)

In [13]:
log_2 = LogTransformer(
    columns = ['AGE'], 
    add_1 = True,
    drop = False,
    suffix = 'log_plus_1'
)

In [14]:
boston_df_3 = log_2.transform(boston_df)

In [15]:
boston_df_3[['AGE', 'AGE_log_plus_1']].head()

Unnamed: 0,AGE,AGE_log_plus_1
0,0.0,0.0
1,78.9,4.380776
2,61.1,4.128746
3,45.8,3.845883
4,0.0,0.0
