# Lesson 7 - Session 1 - example 1

**Split data**
1. Train set, test set
2. Train set, test set, validation set
3. Cross validation

https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset

In [2]:
from sklearn.datasets import load_diabetes

print(load_diabetes().DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

## Split data to train set / test set

In [3]:
# Split data into train set and test set
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# load a diabetes toy dataset of sklearn (10 features, target is the disease progression)
diabetes = load_diabetes()
print('Features', diabetes.feature_names) 
X, y = diabetes.data, diabetes.target

# split into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

print(f'Train samples = {len(X_train)}')
print(f'Test samples = {len(X_test)}')

Features ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Train samples = 309
Test samples = 133


## Split data to train set / validation set / test set

In [4]:
# split data into train set, validation set, test set
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# load a diabetes toy dataset of sklearn (10 features, target is the disease progression)
X,y=load_diabetes(return_X_y=True)

# split into train set and test set (test 20%)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
# split train set into train set and validation set (train 80% * 75% = 60%, validation 80% * 25% = 20%)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.25,random_state=42)

print(f'Train samples = {len(X_train)}')
print(f'Validation samples = {len(X_val)}')
print(f'Test samples = {len(X_test)}')

Train samples = 264
Validation samples = 89
Test samples = 89


## k-fold cross validation

In [5]:
# k-fold cross validation
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

k=5
# since k=5,  each run uses 4/5=80% for training and 1/5=20% for validation
scores = cross_val_score(estimator=LinearRegression(), X=X ,y=y, cv=k)
print('Scores ', scores)
print(f'Mean={np.mean(scores):.4f}, Std={np.std(scores):.4f}')

Scores  [0.42955643 0.52259828 0.4826784  0.42650827 0.55024923]
Mean=0.4823, Std=0.0493
