# Practicing a tiny supervised learning flow (B2 level)


I am following the first supervised learning steps from the "Machine Learning 1" notes. My goal is to keep the code simple and to explain every move like a beginner who is still connecting the dots.

I will:
1. Write down a very small Iris style data set by hand.
2. Explore basic statistics to imitate the descriptive analysis suggested in the theory.
3. Split the data into train and test parts so I can evaluate honestly.
4. Build a tiny k-nearest neighbors (k-NN) classifier from scratch using only Python basics.
5. Measure the accuracy and try a manual prediction.

In [None]:
import random


## Creating a friendly mini data set
The notes talk about the Iris flowers, so I copy twelve rows (four features + species).
Having the values here keeps the project self contained and avoids extra installations.

In [None]:
iris_rows = [
    "5.1,3.5,1.4,0.2,setosa",
    "4.9,3.0,1.4,0.2,setosa",
    "5.0,3.6,1.4,0.3,setosa",
    "5.4,3.9,1.7,0.4,setosa",
    "5.8,2.7,4.1,1.0,versicolor",
    "6.0,2.2,4.0,1.0,versicolor",
    "6.4,3.2,4.5,1.5,versicolor",
    "6.6,2.9,4.6,1.3,versicolor",
    "6.3,3.3,6.0,2.5,virginica",
    "5.8,2.7,5.1,1.9,virginica",
    "7.1,3.0,5.9,2.1,virginica",
    "6.5,3.0,5.2,2.0,virginica"
]
print(f"Loaded {len(iris_rows)} rows")


## Parsing the rows into numbers and labels
The theory says we work with feature vectors and targets.
I convert each line into a tuple: ([features], label).

In [None]:
def parse_rows(rows):
    dataset = []
    for row in rows:
        parts = row.split(",")
        features = [float(value) for value in parts[:4]]
        label = parts[4]
        dataset.append((features, label))
    return dataset

full_dataset = parse_rows(iris_rows)
print(full_dataset[0])
print(f"Total samples: {len(full_dataset)}")


## Exploring the basic statistics
To keep it simple I calculate minimum, maximum, and average for each feature.
This mimics the descriptive analysis chapter.

In [None]:
feature_names = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal width (cm)"
]

def describe_feature(dataset, index):
    values = [row[0][index] for row in dataset]
    minimum = min(values)
    maximum = max(values)
    average = sum(values) / len(values)
    return minimum, maximum, average

for idx, name in enumerate(feature_names):
    min_val, max_val, avg_val = describe_feature(full_dataset, idx)
    print(f"{name}: min={min_val:.1f}, max={max_val:.1f}, avg={avg_val:.2f}")


## Train-test split
I shuffle the data with a fixed seed (so results are repeatable) and keep 25% for testing.

In [None]:
def train_test_split(dataset, test_ratio=0.25, seed=42):
    random.seed(seed)
    shuffled = dataset[:]
    random.shuffle(shuffled)
    test_size = max(1, int(len(shuffled) * test_ratio))
    test_data = shuffled[:test_size]
    train_data = shuffled[test_size:]
    return train_data, test_data

train_data, test_data = train_test_split(full_dataset)
print(f"Train size: {len(train_data)}")
print(f"Test size: {len(test_data)}")


## Building a pocket k-NN classifier
The theory explains that k-NN looks at the closest examples.
I use the Euclidean distance and keep k = 3 neighbors because it is a small data set.

In [None]:
def euclidean_distance(a, b):
    total = 0.0
    for value_a, value_b in zip(a, b):
        total += (value_a - value_b) ** 2
    return total ** 0.5


def knn_predict(train_set, new_sample, k=3):
    distances = []
    for features, label in train_set:
        distance = euclidean_distance(features, new_sample)
        distances.append((distance, label))
    distances.sort(key=lambda item: item[0])
    neighbors = distances[:k]
    votes = {}
    for _, neighbor_label in neighbors:
        votes[neighbor_label] = votes.get(neighbor_label, 0) + 1
    best_label = max(votes.items(), key=lambda item: item[1])[0]
    return best_label

print(knn_predict(train_data, test_data[0][0]))


## Evaluating accuracy
I check how many test samples the classifier gets right and compute the proportion.

In [None]:
def accuracy_score(model_data, test_set, k=3):
    correct = 0
    for features, expected_label in test_set:
        predicted_label = knn_predict(model_data, features, k=k)
        if predicted_label == expected_label:
            correct += 1
    return correct / len(test_set)

accuracy = accuracy_score(train_data, test_data)
print(f"Accuracy on the test set: {accuracy:.2f}")


## Trying a manual prediction
To see the algorithm in action I pick a flower with long petals and check the predicted species.

In [None]:
mystery_flower = [6.1, 2.8, 4.7, 1.4]
prediction = knn_predict(train_data, mystery_flower)
print(f"Predicted species: {prediction}")


## Small conclusions
- Preparing the data by hand helped me understand every column.
- The evaluation step shows if my simple intuition is on the right track.
- With more data I could tune *k* or try other algorithms from the course notes.