# Practicing a tiny supervised learning flow (B2 level)


I am following the first supervised learning steps from the "Machine Learning 1" notes. My goal is to keep the code simple and to explain every move like a beginner who is still connecting the dots.

I will:
1. Load the small Iris CSV file that was shared with the class.
2. Explore basic statistics to imitate the descriptive analysis suggested in the theory.
3. Split the data into train and test parts so I can evaluate honestly.
4. Build a tiny k-nearest neighbors (k-NN) classifier from scratch using only Python basics.
5. Measure the accuracy and try a manual prediction.

In [None]:
import csv
import random
from pathlib import Path

possible_paths = [
    Path("iris_sample.csv"),
    Path("B2/iris_sample.csv"),
]

for option in possible_paths:
    if option.exists():
        data_path = option
        break
else:
    raise FileNotFoundError("Could not find iris_sample.csv next to the notebook.")

print(f"Using data file: {data_path}")


## Loading a friendly mini data set
The notes talk about the Iris flowers, so I read the provided CSV file (four features + species).
Having the values on disk keeps the project self contained and still feels like a real workflow.

In [None]:
with data_path.open(newline="") as csv_file:
    reader = csv.DictReader(csv_file)
    iris_rows = [row for row in reader]
    header = reader.fieldnames or []

if not iris_rows:
    raise ValueError("The CSV file is empty.")

print(f"Columns found: {header}")
print(f"Loaded {len(iris_rows)} rows")


## Parsing the rows into numbers and labels
The theory says we work with feature vectors and targets.
I convert each line into a tuple: ([features], label).

In [None]:
feature_columns = [name for name in header if name != "species"]

def parse_rows(rows):
    dataset = []
    for row in rows:
        features = [float(row[column]) for column in feature_columns]
        label = row["species"]
        dataset.append((features, label))
    return dataset

full_dataset = parse_rows(iris_rows)
print(full_dataset[0])
print(f"Total samples: {len(full_dataset)}")


## Exploring the basic statistics
To keep it simple I calculate minimum, maximum, and average for each feature.
This mimics the descriptive analysis chapter.

In [None]:
def describe_feature(dataset, index):
    values = [row[0][index] for row in dataset]
    minimum = min(values)
    maximum = max(values)
    average = sum(values) / len(values)
    return minimum, maximum, average

friendly_feature_names = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal width (cm)",
]

for idx, friendly_name in enumerate(friendly_feature_names):
    min_val, max_val, avg_val = describe_feature(full_dataset, idx)
    print(f"{friendly_name}: min={min_val:.1f}, max={max_val:.1f}, avg={avg_val:.2f}")


## Train-test split

I shuffle the data with the fixed seed that our teacher mentioned (90) so the results stay repeatable.


In [None]:
def train_test_split(dataset, test_ratio=0.25, seed=90):
    random.seed(seed)
    shuffled = dataset[:]
    random.shuffle(shuffled)
    test_size = max(1, int(len(shuffled) * test_ratio))
    test_data = shuffled[:test_size]
    train_data = shuffled[test_size:]
    return train_data, test_data

train_data, test_data = train_test_split(full_dataset)
print(f"Train size: {len(train_data)}")
print(f"Test size: {len(test_data)}")


## Building a pocket k-NN classifier
The theory explains that k-NN looks at the closest examples.
I use the Euclidean distance and keep k = 3 neighbors because it is a small data set.

In [None]:
def euclidean_distance(a, b):
    total = 0.0
    for value_a, value_b in zip(a, b):
        total += (value_a - value_b) ** 2
    return total ** 0.5


def knn_predict(train_set, new_sample, k=3):
    distances = []
    for features, label in train_set:
        distance = euclidean_distance(features, new_sample)
        distances.append((distance, label))
    distances.sort(key=lambda item: item[0])
    neighbors = distances[:k]
    votes = {}
    for _, neighbor_label in neighbors:
        votes[neighbor_label] = votes.get(neighbor_label, 0) + 1
    best_label = max(votes.items(), key=lambda item: item[1])[0]
    return best_label

print(knn_predict(train_data, test_data[0][0]))


## Evaluating accuracy
I check how many test samples the classifier gets right and compute the proportion.

In [None]:
def accuracy_score(model_data, test_set, k=3):
    correct = 0
    evaluation_pairs = []
    for features, expected_label in test_set:
        predicted_label = knn_predict(model_data, features, k=k)
        evaluation_pairs.append((expected_label, predicted_label))
        if predicted_label == expected_label:
            correct += 1
    accuracy_value = correct / len(test_set)
    return accuracy_value, evaluation_pairs

accuracy, evaluation_pairs = accuracy_score(train_data, test_data)
print(f"Accuracy on the test set: {accuracy:.2f}")


## Confusion matrix with the classroom orientation

Our professor warned us that the confusion matrix in Python is often shown the other way around, so I build it with predicted labels on the rows and actual labels on the columns to match that note.


In [None]:
def build_confusion_matrix(pairs):
    labels = sorted({expected for expected, _ in pairs} | {predicted for _, predicted in pairs})
    label_index = {label: position for position, label in enumerate(labels)}
    matrix = [[0 for _ in labels] for _ in labels]
    for expected, predicted in pairs:
        row = label_index[predicted]
        column = label_index[expected]
        matrix[row][column] += 1
    return labels, matrix

matrix_labels, confusion = build_confusion_matrix(evaluation_pairs)
print("Labels (in order):", matrix_labels)
print("Confusion matrix (rows = predicted, columns = actual):")
for row in confusion:
    print(row)


## Trying a manual prediction
To see the algorithm in action I pick a flower with long petals and check the predicted species.

In [None]:
mystery_flower = [6.1, 2.8, 4.7, 1.4]
prediction = knn_predict(train_data, mystery_flower)
print(f"Predicted species: {prediction}")


## Small conclusions
- Reading the CSV myself made me notice how the raw numbers line up with the theory columns.
- The evaluation step shows if my simple intuition is on the right track.
- With more data I could tune *k* or try other algorithms from the course notes.