# Iris Dataset Preprocessing

The Iris Dataset has three unique labels and four features attached to each label. The "Iris-virginica" and "Iris-versicolor" labels are _not_ linearly separable, meaning they cannot be effectively separated with simple methods such as a perceptron. These two labels will be used to train multiple network designs, both quantum and classical.

Here, I separate the data into training and testing splits.


In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("./IRIS.csv")

# Remove all instances of Iris-virginica
df = df[~df['species'].str.contains('Iris-setosa')]

# Convert labels to 0 or 1
labels = df['species']
unique_labels = {list(set(labels))[i]: i for i in range(len(set(labels)))}
labels = labels.map(unique_labels)
labels = labels.to_numpy()

inputs = df.drop('species', axis=1).to_numpy()

# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
    inputs, 
    labels, 
    test_size=0.2, 
    random_state=42,
    stratify=labels)

np.save("./iris_train_x.npy", x_train)
np.save("./iris_test_x.npy", x_test)
np.save("./iris_train_y.npy", y_train)
np.save("./iris_test_y.npy", y_test)