# Data Preprocessing

- [Dealing with missing data](#Dealing-with-missing-data)
  - [Identifying missing values in tabular data](#Identifying-missing-values-in-tabular-data)
  - [Eliminating training examples or features with missing values](#Eliminating-training-examples-or-features-with-missing-values)
  - [Imputing missing values](#Imputing-missing-values)
  - [Understanding the scikit-learn estimator API](#Understanding-the-scikit-learn-estimator-API)
- [Handling categorical data](#Handling-categorical-data)
  - [Nominal and ordinal features](#Nominal-and-ordinal-features)
  - [Mapping ordinal features](#Mapping-ordinal-features)
  - [Encoding class labels](#Encoding-class-labels)
  - [Performing one-hot encoding on nominal features](#Performing-one-hot-encoding-on-nominal-features)
- [Partitioning a dataset into a separate training and test set](#Partitioning-a-dataset-into-seperate-training-and-test-sets)
- [Bringing features onto the same scale](#Bringing-features-onto-the-same-scale)

<br>
<br>

In [None]:
from IPython.display import Image
%matplotlib inline

Line1:for displaying images in jupyter notebook

Line2:magic command of jupyter notebook.Set `inline` to display images in the same cell as the code.

# Dealing with missing data

## Identifying missing values in tabular data

In [None]:
import pandas as pd
from io import StringIO
import sys

csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:

if (sys.version_info < (3, 0)):
    csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df

In [None]:
df.isnull().sum()

In [None]:
# access the underlying NumPy array
# via the `values` attribute
df.values

将DataFrame转换为NumPy数组

去除了列名和索引信息，仅保留原始数值数据

大多数数学库要求的输入格式

适用机器学习算法输入准备、高性能数值运算、需与NumPy/SciPy生态交互等

## Eliminating training examples or features with missing values

`dropna`用来处理缺失值

In [None]:
# remove rows that contain missing values

df.dropna(axis=0)

In [None]:
# remove columns that contain missing values

df.dropna(axis=1)

`axis`指定操作轴向

0=行(默认)，1=列

In [None]:
# only drop rows where all columns are NaN

df.dropna(how='all')  

`how`取`any`,存在NaN即删,取`all`,全NaN才删

In [None]:
# drop rows that have fewer than 3 real values 

df.dropna(thresh=4)

`thresh`阈值过滤,取4则保留至少4个有效值的row/column

In [None]:
# only drop rows where NaN appear in specific columns (here: 'C')

df.dropna(subset=['C'])

`subset`参数指定检测行/列

<br>
<br>

## Imputing missing values

In [None]:
# again: our original array
df.values

In [None]:
# impute missing values via the column mean

from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data


explain:

从scikit-learn导入缺失值处理模块(for ML preprocessing)

`numpy`是数值计算库(高效数组操作+NaN处理)

`SimpleImputer`这个class中包括`strategy` `missing_values` `statistics` `fill_value`(constant时的填充常数)
第6行创建SimpleImputer实例
- `nan`:"not a number"in numpy
- strategy:缺失值处理方法
`mean`均值填补`median`中位数填补`most_frequent`众数填补`constant`固定值
- `fix()`分析数据、计算填充参数(`statistics`)(这里是按列计算均值)
- `transform()`填充缺失值
- `fit_transform()`合并fit和transform

In [None]:
#help SimpleImputer

In [None]:
df.fillna(df.mean())

这是`pandas`实现的列均值填补

# Understanding the scikit-learn estimator API
In the previous section, we used the SimpleImputer class from scikit-learn to impute missing values in our dataset. The SimpleImputer class belongs to the so-called transformer classes in scikit-learn, which are used for data transformation. The two essential methods of those estimators are fit and transform. The fit method is used to learn the parameters from the training data, and the transform method uses those parameters to transform the data. **Any data array that is to be transformed needs to have the same number of *features* as the data array that was used to fit the model.**

The following figure illustrates how a transformer, fitted on the training data, is used to transform a training dataset as well as a new test dataset:

The classifiers that we used in classification, belong to the so-called estimators in scikit-learn, with an API that is conceptually very similar to the transformer class. Estimators have a predict method but can also have a transform method, as you will see later in this chapter. As you may recall, we also used the fit method to learn the parameters of a model when we trained those estimators for classification. However, in supervised learning tasks, we additionally provide the class labels for fitting the model, which can then be used to make predictions about new, unlabeled data examples via the predict method, as illustrated in the following figure:



数据预处理流程与机器学习模型训练可以使用统一的 API 进行封装，以方便后续的模型训练和部署。

<br>
<br>

# Handling categorical data

分类，其值表示类别或标签

 So far, we have only been working with numerical values(数值数据). However, it is not uncommon for real-world datasets to contain one or more categorical feature 
columns. In this section, we will make use of simple yet effective examples to see how to deal with this type of data in numerical computing libraries(数值计算库).
 When we are talking about categorical data, we have to further distinguish between ordinal and nominal features. `Ordinal features`(有序分类) can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order: XL > L > M. In contrast, `nominal features`(名义分类) don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.
 

## Nominal and ordinal features

In [16]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class2'],
                ['red', 'L', 13.5, 'class1'],
                ['blue', 'XL', 15.3, 'class2']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


`pandas`中的`columns`方法设置数据框类名

## Mapping ordinal features

To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2:

## Mapping ordinal features

手动定义有序分类特征的映射关系

In [17]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


分类字符串值被转换成整数

In [18]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

这是将映射反转回原始标签

<br>
<br>

## Encoding class labels
Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0:

In [None]:
import numpy as np

# create a mapping dict
# to convert class labels from strings to integers
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping

In [None]:
# to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df

In [None]:
# reverse the class label mapping
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

 Alternatively, there is a convenient __LabelEncoder__ class directly implemented in scikit-learn to achieve this:

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

In [None]:
# reverse mapping
class_le.inverse_transform(y)

<br>
<br>

## Performing one-hot encoding on nominal features
In the previous Mapping ordinal features section, we used a simple dictionary mapping approach to convert the ordinal size feature into integers. Since scikit learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows:

In [None]:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X


A common workaround for this problem is to use a technique called __*one-hot encoding*__. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module:

In [None]:
from sklearn.preprocessing import OneHotEncoder

X = df[['color', 'size', 'price']].values
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()

Note that we applied the OneHotEncoder to only a single column, (X[:, 0].reshape(-1, 1))), to avoid modifying the other two columns in the 
array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer, which accepts a list of (name, transformer, column(s)) tuples as follows:

In [None]:
from sklearn.compose import ColumnTransformer

X = df[['color', 'size', 'price']].values
c_transf = ColumnTransformer([ ('onehot', OneHotEncoder(), [0]),
                               ('nothing', 'passthrough', [1, 2])])
c_transf.fit_transform(X).astype(float)

In the preceding code example, we specified that we want to modify only the first column and leave the other two columns untouched via the 'passthrough' 
argument.

An even more convenient way to create those dummy features via one-hot encoding is to use the __*get_dummies*__ method implemented in pandas. Applied to a DataFrame, the _**get_dummies**_ method will only convert _**string columns**_ and leave all other columns unchanged:

In [None]:
# one-hot encoding via pandas

pd.get_dummies(df[['price', 'color', 'size']])

In [None]:
# multicollinearity guard in get_dummies

pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)

In [None]:
# multicollinearity guard for the OneHotEncoder

color_ohe = OneHotEncoder(categories='auto', drop='first')
c_transf = ColumnTransformer([ ('onehot', color_ohe, [0]),
                               ('nothing', 'passthrough', [1, 2])])
c_transf.fit_transform(X).astype(float)

<br>
<br>

# Partitioning a dataset into a seperate training and test set

In [None]:
#df_wine = pd.read_csv('https://archive.ics.uci.edu/'
#                      'ml/machine-learning-databases/wine/wine.data',
#                      header=None)

# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:

df_wine = pd.read_csv('wine.data', header=None)


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()

In [None]:
from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, 
                     test_size=0.3, 
                     random_state=0, 
                     stratify=y)

<br>
<br>

# Bringing features onto the same scale

In [None]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
X_train_norm

In [None]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
X_train_std

A visual example:

In [None]:
ex = np.array([0, 1, 2, 3, 4, 5])

print('standardized:', (ex - ex.mean()) / ex.std())

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))