# Iris Classification

## Train-Test Split

This notebook will use the data generated from the previous notebook and create a train-test split of the data.

The split data will be saved to csv files in the data folder.

Best practice is to run analysis only on training data to avoid data leakage. Failure to do so can result in overfit and less-generalizable models.

In [None]:
import os 
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Check location to ensure we load data from the correct path
os.getcwd()

In [None]:
# Define path to input file
data_file_path = os.path.join('..', 'data', 'data.csv')

In [None]:
# Load csv into pandas df
df = pd.read_csv(data_file_path)

In [None]:
# Check df looks correct
df.head()

In [None]:
# Verify data shape
df.shape

In [None]:
# y is targets only
# x is feature columns without target
# NOTE: y is lower, X is upper
y = df['target']
X = df.drop(columns=['target'])

In [None]:
# Check X
X.head()

In [None]:
# Check y
y.head()

In [None]:
# Print shapes to ensure matching dimension
print(X.shape, y.shape, sep='\n')

In [None]:
# Perform split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

In [None]:
# Create list of dfs and output names
dfs = [X_train, X_test, y_train, y_test]
csv_names = ['X_train.csv', 'X_test.csv', 'y_train.csv', 'y_test.csv']

In [None]:
# Iterate over lists, saving each df to matching path
data_path = os.path.join('..', 'data')
for df, name in zip(dfs, csv_names):
    out_path = os.path.join(data_path, name)
    df.to_csv(out_path, index=False)