### In this notebook, we will apply an extract-transform-load (ETL) pipeline to:

- Import the required libraries.
- Ingest the two splits (red, white) of the project data `codesignal/wine-quality` from Hugging Face Datasets.
- Combine the `red` and `white` data splits into a single HF dataset, adding a `wine_type` categorical variable with values (red, white).
- Convert the HF dataset to Pandas dataframe.
- Set-up a global random state for reproducibility.
- Shuffle the dataset.
- Split the dataset into train (60%), validate (20%), and test (20%) splits.
- Persist the data splits as CSV files into the file system.
- Load the CSV files into Pandas dataframes for testing. 

In [12]:
# Import the required libraries
import pandas as pd
from datasets import load_dataset, concatenate_datasets
from sklearn.model_selection import train_test_split

# Data destination path
out = '../data/'

# Ingest the two splits (red, white) of the project data `codesignal/wine-quality` from Hugging Face Datasets
dataset = load_dataset('codesignal/wine-quality', split=['red', 'white'])

# Combine the `red` and `white` data splits into a single HF dataset, adding a `wine_type` categorical variable with values (red, white)
dataset_combined = concatenate_datasets([dataset[0], dataset[1]])
dataset_combined = dataset_combined.add_column('wine_type', ['red'] * len(dataset[0]) + ['white'] * len(dataset[1]))

# Convert the HF dataset to Pandas dataframe
df = dataset_combined.to_pandas()

# Set up a global random state for reproducibility
random_state = 42

# Shuffle the dataset
df = df.sample(frac=1, random_state=random_state).reset_index(drop=True)

# Split the dataset into train (60%), validate (20%), and test (20%) splits
train_df, test_df = train_test_split(df, test_size=0.2, random_state=random_state)
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=random_state)

# Persist the data splits as CSV files into the file system
train_df.to_csv(f'{out}train.csv', index=False)
val_df.to_csv(f'{out}validate.csv', index=False)
test_df.to_csv(f'{out}test.csv', index=False)

# Load the CSV files into Pandas dataframes for testing
train_df = pd.read_csv(f'{out}train.csv')
val_df = pd.read_csv(f'{out}validate.csv')
test_df = pd.read_csv(f'{out}test.csv')

# Print the proportions of the 3 datasets to the original df
print(f"Proportion of train set: {len(train_df)/len(df)}")
print(f"Proportion of validation set: {len(val_df)/len(df)}")
print(f"Proportion of test set: {len(test_df)/len(df)}")

Proportion of train set: 0.5998152993689395
Proportion of validation set: 0.20009235031553024
Proportion of test set: 0.20009235031553024
