# Required assignment 8.1 Predicting plant types

In this activity, you will work with the iris data set. This data set was first used by  Sir Ronald Fisher. This is perhaps the best-known database in the pattern recognition literature. The data set contains three classes that consist of 50 instances each, for which each class refers to a type of iris plant. There are four features: sepal length, sepal width, petal length and petal width.

This data set is commonly used for classification tasks, when the goal is to predict the species of an iris flower based on these measurements.

In this assignment, you will:

1. Load the iris data set and select only the sepal length and sepal width features

2. Compute the mean and standard deviation from all of the data to normalise these features using z-score normalisation

3. Train a KNN classifier with k = 3 on the normalised data using Euclidean distance

4. Predict the classes of the original data set samples using the trained KNN model

5. Define and use a function to normalse and predict the classes of new, unseen samples with the same scaling parameters

In [1]:
#Import the necessary libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier


### Question 1:

- Load the iris data set using the `load_iris()` function. 

- Slice the data array to extract only the first two features: sepal length and sepal width.

- Store the sliced features in `X` and the target in `y`.

In [2]:
### GRADED CELL
iris = None
X = None
y = None
target_names = None

# YOUR CODE HERE
#raise NotImplementedError()

# Load iris dataset
iris = load_iris()

# Select features: first two columns (sepal length, sepal width)
X = iris.data[:, :2]

# Target labels (species)
y = iris.target

# Target names (for readability)
target_names = iris.target_names



### Question 2:

Compute the mean and standard deviation from the entire data set.

Hint: Use the `.mean_` and `.scale_` in `StandardScaler()`.

In [3]:
###GRADED CELL
mean = None
std = None
scaler = StandardScaler()
X_normalised = scaler.fit_transform(X)

# YOUR CODE HERE
#raise NotImplementedError()

# Fit the scaler and transform the data
scaler = StandardScaler()
X_normalised = scaler.fit_transform(X)

# Get the mean and standard deviation used
mean = scaler.mean_
std = scaler.scale_

print("Mean (µ) used for normalisation:", mean)
print("Standard deviation (σ) used for normalisation:", std)


Mean (µ) used for normalisation: [5.84333333 3.05733333]
Standard deviation (σ) used for normalisation: [0.82530129 0.43441097]


### Question 3:

- Use `KNeighborsClassifier` with three neighbours. 

- Use `.fit()` to train the model using the normalised input `X_normalised`.

- Then used the `.predit()` function to find the predicted output.

Store the predicted output in `y_pred`. 

In [4]:
### GRADED CELL
knn = None
y_pred = None

# YOUR CODE HERE
#raise NotImplementedError()

# k-NN with k=3 using Euclidean distance (default)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_normalised, y) # fit on the normalised data
y_pred = knn.predict(X_normalised) # predict on the same (original) samples

# training accuracy just to sanity-check
train_acc = (y_pred == y).mean()
print("Training accuracy:", round(train_acc, 3))


Training accuracy: 0.833


Let's review the predictions of five random samples:

In [5]:
random_indices = np.random.choice(len(X), size=5, replace=False)

print("\nSample predictions (5 random instances):")
for i in random_indices:
    print(f"Sepal length: {X[i,0]:.2f}, Sepal width: {X[i,1]:.2f} | "
          f"True class: {target_names[y[i]]} | Predicted class: {target_names[y_pred[i]]}")

# Not requested in the exercise but good practice is to review accuracy
train_acc = (y_pred == y).mean() # accuracy is very low only with the Sepal information
print(f"\n Training accuracy:\n", round(train_acc, 3))



Sample predictions (5 random instances):
Sepal length: 6.00, Sepal width: 3.40 | True class: versicolor | Predicted class: virginica
Sepal length: 6.00, Sepal width: 2.90 | True class: versicolor | Predicted class: versicolor
Sepal length: 6.30, Sepal width: 2.70 | True class: virginica | Predicted class: virginica
Sepal length: 4.40, Sepal width: 2.90 | True class: setosa | Predicted class: setosa
Sepal length: 5.70, Sepal width: 2.60 | True class: versicolor | Predicted class: versicolor

 Training accuracy:
 0.833


### Question 4:

- Create a function called `predict_new_samples(new_samples)` that returns the `target_names[]` for the predictions. 

- Use the normalised input with the same mean and standard deviation computed before.

- Use the `knn.predict()` to compute the predictions.

In [6]:
###GRADED CELL
def predict_new_samples(new_samples):
    """
    new_samples: numpy array of shape (n_samples, 2) with sepal length and width
    new_samples_norm to be computed from scaler.mean_ and scaler.scale_
    predictions: numpy array of shape (n_samples,) with predicted class indices

    """
    return None
# YOUR CODE HERE
#raise NotImplementedError()

def predict_new_samples(new_samples):

    # normalise with the same scaler used earlier
    new_samples_norm = (new_samples - scaler.mean_) / scaler.scale_
    
    # predict with trained KNN
    predictions = knn.predict(new_samples_norm)
    
    # map indices to target names
    return target_names[predictions]


The predictions for the new sample can be computed using the `predict_new_samples()` function.

In [7]:
new_data = np.array([[5.0, 3.5], [6.5, 3.0]])
predicted_classes = predict_new_samples(new_data)
print("\nPredictions for new samples:")
for sample, pred in zip(new_data, predicted_classes):
    print(f"Sepal length: {sample[0]}, Sepal width: {sample[1]} => Predicted class: {pred}")



Predictions for new samples:
Sepal length: 5.0, Sepal width: 3.5 => Predicted class: setosa
Sepal length: 6.5, Sepal width: 3.0 => Predicted class: virginica


### Question 5:

Instead of only using `sepal_length` and `sepal_width`, use the entire data set as input. Use a KNN classifier with five nearest neighbours. Fit the function on the normalised data, and use the built-in `.predict()` method to make predictions for the sample input.

Here are the steps you need to follow:

1. Load the iris data set.

2. Split the features and target and store it in `X_full` and `y_full`.

3. Use the `StandardScaler` function to compute `X_full_normalised`.

4. Use the `kNeighborsClassifier` function with five nearest neighbours.

5. Use the `knn.fit()` function.

6. Use the `unknown_sample = np.array([[5.8,2.7,5.1,1.9]])`.

7. Normalise and use it to predict the output class.

In [11]:
### GRADED CELL
iris_full = ...
X_full = ...
y_full = ...
target_names = ...
scaler = ...
X_full_normalised = ...
knn = ...

#Sample to classify (sepal_len, sepal_wid, petal_len, petal_wid)
unknown_sample = np.array([[5.8, 2.7, 5.1, 1.9]]) 

unknown_sample_normalised = ...
predicted_class_index = ...
predicted_class_name = ...

# YOUR CODE HERE
#raise NotImplementedError()

# 1) Load the iris dataset
iris_full = load_iris()

# 2) Features (all 4) and target
X_full = iris_full.data # shape (150, 4)
y_full = iris_full.target # 0/1/2
target_names = iris_full.target_names # ['setosa','versicolor','virginica']

# 3) Standardise (z-score) using the full feature set
scaler = StandardScaler()
X_full_normalised = scaler.fit_transform(X_full)

# 4) KNN with k=5 (Euclidean is default)
knn = KNeighborsClassifier(n_neighbors=5)

# 5) Fit the model
knn.fit(X_full_normalised, y_full)

# 7) Normalise the sample with the SAME scaler and predict
unknown_sample_normalised = scaler.transform(unknown_sample)
predicted_class_index = knn.predict(unknown_sample_normalised)[0]
predicted_class_name = target_names[predicted_class_index]

print(f"Predicted class for the unknown sample {unknown_sample[0]} is: {predicted_class_name}")


#Add this to also know the predict probability for each class
probs = knn.predict_proba(unknown_sample_normalised)[0]
print("\nClass probabilities:")
for name, prob in zip(target_names, probs):
    print(f"{name}: {prob:.3f}")


Predicted class for the unknown sample [5.8 2.7 5.1 1.9] is: virginica

Class probabilities:
setosa: 0.000
versicolor: 0.200
virginica: 0.800
