# sklearn 数据集使用

## 1. 数据集加载
loaders和fetchers的所有函数都返回一个字典一样的对象，里面至少包含两项:
- shape为n_samples*n_features的数组，对应的字典key是data(20news groups数据集除外)
- 长度为n_samples的numpy数组,包含了目标值,对应的字典key是target

### 1.1 load_*可以加载小的标准数据集，不需要下载

In [1]:
from sklearn.datasets import load_iris

In [2]:
iris = load_iris()
print('鸢尾花: \n', iris.keys())
print('查看数据集描述: \n', iris['DESCR'])
print('查看特征名: \n', iris['feature_names'])
print('查看特征值: \n', iris.data.shape)
print('查看标签: \n', iris.target.shape)

鸢尾花: 
 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
查看数据集描述: 
 .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribu

### 1.2 fetch_*可以加载较大的真实世界中的数据集，需要下载

In [3]:
from sklearn.datasets import fetch_olivetti_faces

In [4]:
olivetti_faces = fetch_olivetti_faces(data_home='~/Code/machine_learning/data/')

In [5]:
print(olivetti_faces.keys())
print(olivetti_faces.data.shape)
print(olivetti_faces.images.shape)
print(olivetti_faces.target.shape)
print(olivetti_faces.DESCR)

dict_keys(['data', 'images', 'target', 'DESCR'])
(400, 4096)
(400, 64, 64)
(400,)
.. _olivetti_faces_dataset:

The Olivetti faces dataset
--------------------------

`This dataset contains a set of face images`_ taken between April 1992 and 
April 1994 at AT&T Laboratories Cambridge. The
:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
fetching / caching function that downloads the data
archive from AT&T.

.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html

As described on the original website:

    There are ten different images of each of 40 distinct subjects. For some
    subjects, the images were taken at different times, varying the lighting,
    facial expressions (open / closed eyes, smiling / not smiling) and facial
    details (glasses / no glasses). All the images were taken against a dark
    homogeneous background with the subjects in an upright, frontal position 
    (with tolerance for some side movement).

**Data Set

## 2. 数据集划分

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(120, 4) (30, 4) (120,) (30,)
