# Sklearn-pandas 使用说明


这个模块的目的是在sklearn的机器学习算法和 pandas-style Data Frames之间构架一座桥梁.
模块的特点:
* A way to map DataFrame columns to transformations, which are later recombined into freatures. 能够对DataFrame中的一些列做sklearn里面的变换,


## 使用


### 导入

我们可以从sklearn-pandas导入两个东东:
* DataFrameMapper类, a class for mapping pandas data frame columns to different sklearn transformations
* cross_val_score, 和sklearn.cross_validation.cross_val_score接口一样,但是在DataFrame格式数据上计算







In [1]:
from sklearn_pandas import DataFrameMapper, cross_val_score

单单使用sklearn-pandas是不够的,我们也引入pandas numpy sklearn

In [3]:
import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition
import sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn.feature_extraction.text import CountVectorizer

### 读入数据

In [5]:
data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                                      'children': [4., 6, 3, 3, 2, 3, 5, 4],
                                      'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

In [19]:
databk = data
print data

   children   pet  salary
0         4   cat      90
1         6   dog      24
2         3   dog      44
3         3  fish      27
4         2   cat      32
5         3   dog      59
6         5   cat      36
7         4  fish      27


## Transforamtion Mapping 重点操作

### map the columns to transforamtion 对DataFrame数据进行列操作

mapper接受a list of pairs. 每一个pair第一个元素是DataFrame中的列名或者是列名组成的list; pair第二个元素是对那些列进行何种transformation操作,即 an object which will perform the transformation wihch will be applied to that column,





DataFrameMapper([(),() ]) 参数是一个list,list里面含有一个或多个pair,每一个pair表示要对某一列或多列做何种操作,

In [9]:
mapper = DataFrameMapper([
  ('pet', sklearn.preprocessing.LabelBinarizer()),
 (['children'], sklearn.preprocessing.StandardScaler()) ])


可以发现上面参数   'children'是以列表的形式出现, ['columnname']和'columnname'的差别仅在于二者的shape不同: 'columnname'表示一位数组;而['columnname'] 表示二维数组. 这是利用了pandas中DataFrame的一个返回shape的特性.


In [12]:
print data['children'].shape

print data[['children']].shape

(8,)
(8, 1)


但是要注意sklean中一些变换只能针对一维数组! 而像OneHotEncoder或 Imputer针对二维数组 with the shape [n_samples, n_features].

### 测试各种变换操作

In [13]:
np.round(mapper.fit_transform(data.copy()), 2)


array([[ 1.  ,  0.  ,  0.  ,  0.21],
       [ 0.  ,  1.  ,  0.  ,  1.88],
       [ 0.  ,  1.  ,  0.  , -0.63],
       [ 0.  ,  0.  ,  1.  , -0.63],
       [ 1.  ,  0.  ,  0.  , -1.46],
       [ 0.  ,  1.  ,  0.  , -0.63],
       [ 1.  ,  0.  ,  0.  ,  1.04],
       [ 0.  ,  0.  ,  1.  ,  0.21]])

输出结果中前三列是LabelBinarizer(分别对应cat dog fish)的结果,第四列是StandardScalar的结果

通常输出结果是按照构建DataFrameMapper时输入列 排序的, 有顺序对应关系.


mapper这个变换操作对象就训练完成了,然后可以用于新数据的变换了

In [15]:
sample = pd.DataFrame({'pet': ['cat'], 'children': [5.]})
np.round(mapper.transform(sample), 2)


array([[ 1.  ,  0.  ,  0.  ,  1.04]])

### 对多列进行变换操作

In [16]:
 mapper2 = DataFrameMapper([ (['children', 'salary'], sklearn.decomposition.PCA(1)) ]) #多个列名构成list

In [17]:
np.round(mapper2.fit_transform(data.copy()), 1)


array([[ 47.6],
       [-18.4],
       [  1.6],
       [-15.4],
       [-10.4],
       [ 16.6],
       [ -6.4],
       [-15.4]])

以上操作是在children和salary两列上进行PCA操作,并返回最大主成分

In [18]:
print data

   children   pet  salary
0         4   cat      90
1         6   dog      24
2         3   dog      44
3         3  fish      27
4         2   cat      32
5         3   dog      59
6         5   cat      36
7         4  fish      27


In [20]:
print databk


   children   pet  salary
0         4   cat      90
1         6   dog      24
2         3   dog      44
3         3  fish      27
4         2   cat      32
5         3   dog      59
6         5   cat      36
7         4  fish      27


### 对同一列进行多个变换操作, 也是可以滴,只需要将多个操作构建list

In [23]:
mapper3 = DataFrameMapper([ (['age'], [sklearn.preprocessing.Imputer(), sklearn.preprocessing.StandardScaler()])])
data_3 = pd.DataFrame({'age': [1, np.nan, 3]})
mapper3.fit_transform(data_3)


array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

### 也可以不对mapper中的列进行变换操作


不进行操作的列,操作用None代替


In [24]:
mapper3 = DataFrameMapper([
 ('pet', sklearn.preprocessing.LabelBinarizer()),
('children', None)
])

np.round(mapper3.fit_transform(data.copy()))


array([[ 1.,  0.,  0.,  4.],
       [ 0.,  1.,  0.,  6.],
       [ 0.,  1.,  0.,  3.],
       [ 0.,  0.,  1.,  3.],
       [ 1.,  0.,  0.,  2.],
       [ 0.,  1.,  0.,  3.],
       [ 1.,  0.,  0.,  5.],
       [ 0.,  0.,  1.,  4.]])

### 处理 稀疏特征 Working with sparse features


DataFrameMapper``s will return a dense feature array by default. Setting ``sparse=True in the mapper will return a sparse array whenever any of the extracted features is sparse.

In [26]:
 mapper4 = DataFrameMapper([
     ('pet', CountVectorizer())
], sparse=True)

print  type(mapper4.fit_transform(data))


<class 'scipy.sparse.csr.csr_matrix'>


## 交叉验证

scikit-learn 低于0.16.0中的交叉验证不支持DataFrame, 但是0.17.0已经支持了貌似...所以,这个没啥用


