## 6.3 Data splitting
In this notebook, we will focus on the techniques for **data splitting**.

In [None]:
# import necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [3]:
# Open the dataset

import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("yasserh/housing-prices-dataset")
#path = kagglehub.dataset_download("ignacioazua/world-gdp-population-and-co2-emissions-dataset")

print("Path to dataset files:", path)

print("Path to dataset files:", path) # Path to the downloaded folder 
filename = os.listdir(path)
print(filename) # Shows content of the folder
#filepath=os.path.join(path, "World_GDP_Population_CO2_Emissions_Dataset.csv")
filepath=os.path.join(path, "Housing.csv")
print(filepath)

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
['Housing.csv']
/home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1/Housing.csv


In [5]:
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [None]:
# For the purpose of our analysis, let's say we want to focus on predicting price (our y variable) based on area and number of bedrooms (our predictors X)

X = df[["area", "bedrooms"]]
y = df["price"]

In [None]:
# Split into train and test sets (80% train, 20% test): sklearn provides a function to do it automatically
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train:", X_train)
print("X_test:", X_test)


X_train:      area  bedrooms
46   6000         3
93   7200         3
335  3816         2
412  2610         3
471  3750         3
..    ...       ...
71   6000         4
106  5450         4
270  4500         3
435  4040         2
102  5500         3

[436 rows x 2 columns]
X_test:      area  bedrooms
316  5900         4
77   6500         3
360  4040         2
90   5000         3
493  3960         3
..    ...       ...
15   6000         4
357  6930         4
39   6000         4
54   6000         3
155  6100         3

[109 rows x 2 columns]


However, we have seen that splitting only into train and test set is not efficient enough, and it can bias the results, potentially leading to overfitting. Therefore, in many case you need an **evaluation set** to test and adjust the model. However, `sklearn` does not include a built-in function for splitting in three dataset - so we will split our dataset in two files, `train` and `temp`, and split the latter again in `dev` and `test`.

In [8]:
X = df[["area", "bedrooms"]]
y = df["price"]

In [None]:
# Split in train set and temporary set

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)

In [None]:
# Re-split the temporary set

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)


In [11]:
print(len(X_train), len(X_val), len(X_test))

327 109 109


We discussed the importance of **stratified splitting**, for example for classification problems. Here is how you can implement that in a simple way, always using `sklearn`:

In [17]:
path = kagglehub.dataset_download("uciml/pima-indians-diabetes-database")

print("Path to dataset files:", path)
filename = os.listdir(path)
print(filename) # Shows content of the folder
filepath=os.path.join(path, "diabetes.csv")
print(filepath)

df=pd.read_csv(filepath)
df.head()

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/uciml/pima-indians-diabetes-database/versions/1
['diabetes.csv']
/home/cgraiff/.cache/kagglehub/datasets/uciml/pima-indians-diabetes-database/versions/1/diabetes.csv


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# In this case, we want to predict the outcome based on all the other factor, so we select the columns accordingly

X=df.iloc[:, :-1]
y=df["Outcome"]

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.4,        # 40% in temp
    random_state=42,
    stratify=y # we stratify based on y
)

In [None]:
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42,
    stratify=y_temp
)

In [21]:
from collections import Counter

print("Train:", Counter(y_train))
print("Validation:", Counter(y_val))
print("Test:", Counter(y_test))

Train: Counter({0: 299, 1: 161})
Validation: Counter({0: 100, 1: 54})
Test: Counter({0: 101, 1: 53})
