In [1]:
import numpy as np
import pandas as pd

### Missing Values

Handling missing values is an essential preprocessing task that can drastically deteriorate your model when not done with sufficient care.

In [5]:
# creating a dataset

X = pd.DataFrame(
    np.array([5, 7, 8, np.NaN, np.NaN, np.NaN, -5, 0, 25, 999, 1, -1, np.NaN, 0, np.NaN,]).reshape((5, 3))
)
X.columns = ['f1', 'f2', 'f3']

In [6]:
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


Rows and columns with to many non-meaningfull missing vaues can be deleted from your data with pandas's*dropna* function. 

Function parameters:

- *axis*: 0 for rows, 1 for columns
- *tresh*: the number of non-NaN's not to drop a row or column
- *implace*: update the frame

Updating dataset by deleting all the rows (*axis*=0) with *only* missing values. 

In [9]:
# 
X.dropna(axis=0, thresh=1, inplace=True)

In [10]:
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


In [11]:
X.reset_index(inplace=True)
X

Unnamed: 0,index,f1,f2,f3
0,0,5.0,7.0,8.0
1,2,-5.0,0.0,25.0
2,3,999.0,1.0,-1.0
3,4,,0.0,


In [12]:
X.drop(['index'], axis=1, inplace=True)
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,999.0,1.0,-1.0
3,,0.0,


Let's also create some extra boolean features that tell us if a sample has a missing value for a certain features.

In [14]:
from sklearn.impute import MissingIndicator

In [15]:
X.replace({999.0 : np.NaN}, inplace=True)
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,,1.0,-1.0
3,,0.0,


In [17]:
indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(X)
indicator = pd.DataFrame(indicator, columns=['m1', 'm3'])
indicator

Unnamed: 0,m1,m3
0,False,False
1,False,False
2,True,False
3,True,True


### Inputing values

For filling up missing values with common strategies, sklean provides a *SimpleImputer*. The four main strategies are *mean*, *most_frequent*, *median* and *constant*.

We impute missing values for our dataframe X with the feature's mean.

In [18]:
from sklearn.impute import SimpleImputer

In [19]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X)

array([[ 5.        ,  7.        ,  8.        ],
       [-5.        ,  0.        , 25.        ],
       [ 0.        ,  1.        , -1.        ],
       [ 0.        ,  0.        , 10.66666667]])

In [20]:
X.fillna(X.mean(), inplace=True)

In [21]:
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,0.0,1.0,-1.0
3,0.0,0.0,10.666667


### Polynomial features

Polynomial features are foten created when we want to include the notion that there exists a nonlinear relationship between the features and the target. They are mostly used to add complexity to linear models with little features, or when we suspect the effect of one feature is dependent on another feature.

In [23]:
from sklearn.preprocessing import PolynomialFeatures

In [24]:
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly

In [28]:
polynomials = pd.DataFrame(poly\
                           .fit_transform(X), 
                           columns=['0','1','2','3', 
                                    'p1', 'p2', 'p3', 'p4'])\
                                        [['p1', 'p2', 'p3', 'p4']]
polynomials

Unnamed: 0,p1,p2,p3,p4
0,35.0,40.0,56.0,280.0
1,-0.0,-125.0,0.0,-0.0
2,0.0,-0.0,-1.0,-0.0
3,0.0,0.0,0.0,0.0
