&emsp;&emsp;"留出法"(hold-out)直接将数据集$ D$划分为两个互斥的集合,其中一个集合作为训练集$S $,另一个作为测试集$ T$,即
$ D=S \bigcup T, S \bigcap T = \varnothing $.在$ S$上训练出模型后,用$ T $来评估其测试误差,作为对泛化误差的估计.     
&emsp;&emsp;需注意到是,训练/测试集的划分尽可能保持数据分布的一致性,避免因数据划分过程引入额外的偏差而对最终结果产生影响,例如在
分类任务中要保持样本的类别比例相似.如果从采样(sampling的角度)来看待数据集的划分过程,则保留类别比例的采样方式通常称为"分层采样"(stratified sampling).
例如通过对$ D $进行分层采样而获得含70%样本的训练集$ S $和含30%样本的测试集$ T $,若$ D$包含500个正例,500个反例,则分层采样得到的$ S $应
包含350个正例,350个反例,而$ T $则包含150个正例和150个反例;若$ S,T $中样本类别比例差别很大,则误差估计将由于训练/测试数据分布的差异而产生偏差.    
&emsp;&emsp;单次使用留出法得到的估计结果往往不稳定可靠,在使用留出法时,一般要采用若干次随机划分.重新进行实验评估后取平均值作为留出法的评估结果.  
&emsp;&emsp;一般将$ 2/3 \sim 4/5 $的样本用于训练,剩余样本用于测试.  

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [9]:
# 默认test_size=0.25
X_train0, X_test0 = train_test_split\
    (X, test_size=0.33, random_state=1) # 只分割X
'''
*arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

test_size : float, int or None, optional (default=None)
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples.
        
random_state : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.
'''
print(X_train0)
print(X_test0)

[[8 9]
 [0 1]
 [6 7]]
[[4 5]
 [2 3]]


In [10]:
y_train0, y_test0= train_test_split\
    (y, test_size=0.33, random_state=1) # 只分割y
print(y_train0)
print(y_test0)

[4, 0, 3]
[2, 1]


In [11]:
X_train1, X_test1, y_train1, y_test1 = train_test_split\
    (X, y, test_size=0.33, random_state=1) # 分割X和y

print(X_train1)
print(X_test1)
print(y_train1)
print(y_test1)

[[8 9]
 [0 1]
 [6 7]]
[[4 5]
 [2 3]]
[4, 0, 3]
[2, 1]


In [15]:
X, y = np.arange(50).reshape((10, 5)), [1, 2, 1, 2, 3, 3, 2, 1, 2, 1]
print(pd.Series(y).value_counts())

'''
stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as
    the class labels.
'''
X_train1, X_test1, y_train1, y_test1 = train_test_split\
    (X, y, test_size=0.33, random_state=1,
     stratify=y) # y每个类别样本数必须>=2

print(X_train1)
print(X_test1)
print(y_train1)
print(y_test1)

1    4
2    4
3    2
dtype: int64
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [30 31 32 33 34]
 [35 36 37 38 39]
 [20 21 22 23 24]
 [10 11 12 13 14]]
[[25 26 27 28 29]
 [40 41 42 43 44]
 [45 46 47 48 49]
 [15 16 17 18 19]]
[1, 2, 2, 1, 3, 1]
[3, 2, 1, 2]


In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame.index = ['one', 'two', 'three', 'four', 'five', 'six']
frame

In [None]:
frame1, frame2 = train_test_split(frame) # 直接对DataFrame按行进行分割
frame1

In [None]:
frame2