# Data split
It is specific to the division of data prior to its training, and it is done by using the model_selection of the sklearn library
It has more than one tool:
* model_selection.train_test_split
* model_selection.Kfold
* model_selection.RepeatedKFold
* model_selection.StratifiedKFold
* model_selection.RepeatedStratifiedKFold
* model_selection.LeaveOneOut
* model_selection.LeavePOut
* model_selection.ShuffleSplit
* model_selection.StratifiedShuffleSplit
* model_selection.TimeSeriesSplit


#### train_test_split
to split a dataset into two subsets: a training set and a test set.

The training set is used to train a machine learning model, and the test set is used to evaluate the performance of the model.

* arrays: The dataset to be split. This can be a list, a NumPy array, a scipy.sparse matrix, or a pandas DataFrame.

* test_size: The proportion of the dataset to include in the test split. This can be a float between 0 and 1, or an integer representing the absolute number of test samples.

* random_state: A random number generator seed. This is used to randomize the order of the data before splitting it, 

which helps to ensure that the training and test sets are representative of the entire dataset.
* stratify: If this is set to a categorical variable, the data will be stratified so that the proportion of each 

category in the training and test sets is the same as the proportion of the category in the original dataset. This is important for 

classification problems, where it is important that the training and test sets have the same class distributions.


In [4]:
#Import Libraries
from sklearn.model_selection import train_test_split
import numpy as np  
#----------------------------------------------------
X = np.arange(10).reshape((5, 2)) 
y = range(5)

X
list(y)
#Splitting data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=44, shuffle =True)

#Splitted Data
print('X_train shape is ' , X_train.shape)
print('X_test shape is ' , X_test.shape)


X_train shape is  (3, 2)
X_test shape is  (2, 2)


#### KFold
KFold is a cross-validation method that splits the data into k folds. The data is then used to train a model k times, with each fold being used as the test set once. This helps to ensure that the model is not overfitting the training data, as it is being evaluated on different folds of the data each time.

The KFold class in scikit-learn takes the following parameters:

* n_splits: The number of folds to use. This must be at least 2.

* shuffle: Whether to shuffle the data before splitting it.

 This is usually a good idea, as it helps to ensure that the folds are representative of the entire dataset.

* random_state: A random number generator seed. This is used to randomize the order of the data before splitting it, which helps to ensure that the folds are unique.

In [6]:
#Import Libraries
from sklearn.model_selection import KFold
#----------------------------------------------------

#KFold Splitting data

kf = KFold(n_splits=4, random_state=44, shuffle =True)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([11, 22, 33, 44])
#KFold Data
for train_index, test_index in kf.split(X):
    print('Train Data is : \n', train_index)
    print('Test Data is  : \n', test_index)
    print('-------------------------------')
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print('X_train Shape is  ' , X_train.shape)
    print('X_test Shape is  ' , X_test.shape)
    print('y_train Shape is  ' ,y_train.shape)
    print('y_test Shape is  ' , y_test.shape)
    print('========================================')

Train Data is : 
 [0 1 2]
Test Data is  : 
 [3]
-------------------------------
X_train Shape is   (3, 2)
X_test Shape is   (1, 2)
y_train Shape is   (3,)
y_test Shape is   (1,)
Train Data is : 
 [0 1 3]
Test Data is  : 
 [2]
-------------------------------
X_train Shape is   (3, 2)
X_test Shape is   (1, 2)
y_train Shape is   (3,)
y_test Shape is   (1,)
Train Data is : 
 [0 2 3]
Test Data is  : 
 [1]
-------------------------------
X_train Shape is   (3, 2)
X_test Shape is   (1, 2)
y_train Shape is   (3,)
y_test Shape is   (1,)
Train Data is : 
 [1 2 3]
Test Data is  : 
 [0]
-------------------------------
X_train Shape is   (3, 2)
X_test Shape is   (1, 2)
y_train Shape is   (3,)
y_test Shape is   (1,)
