# B2 Python practice – novice walk-through

I follow the `B2_python.pdf` bulletin using the Titanic `train.csv`. I keep every step tiny and simple so a beginner can re-run it without guessing hidden tricks.

## 1. Load the Titanic `train.csv`

The professor insisted on using the `train.csv` sheet. I copy it inside the `b2` folder and let the notebook search for it starting from the current working directory.

In [None]:
import csv
import math
import random
import statistics
from collections import Counter, defaultdict
from pathlib import Path
from typing import Iterable

random.seed(90)  # class seed

search_dirs = [Path.cwd()] + list(Path.cwd().parents)
csv_path = None
for directory in search_dirs:
    candidate = directory / "Machine Learning 1" / "b2" / "train.csv"
    if candidate.exists():
        csv_path = candidate
        break

if csv_path is None:
    raise FileNotFoundError("Could not find Machine Learning 1/b2/train.csv near this notebook.")

rows: list[dict[str, str]] = []
with csv_path.open(encoding="utf-8") as handle:
    reader = csv.DictReader(handle)
    for row in reader:
        rows.append(row)

(len(rows), list(rows[0].keys()))

I peek at a couple of rows to make sure the characters look fine and that the file really matches the Titanic training sheet.

In [None]:
for sample in rows[:5]:
    print(sample["PassengerId"], sample["Name"], sample["Sex"], sample["Survived"])

## 1.1 Quick dataset checks

A quick size and missing-value summary keeps me aligned with the theory section about data understanding.

In [None]:
passenger_count = len(rows)
columns = list(rows[0].keys())
print(f"Passengers: {passenger_count}")
print(f"Columns: {columns}")

missing_by_column = {
    column: sum(1 for row in rows if not row[column]) for column in columns
}
missing_by_column

## 1.2 Survival balance

Before training any model I check how many people survived. Zero means the passenger did not survive and one means the opposite.

In [None]:
label_counts = Counter(int(r["Survived"]) for r in rows)
survival_rate = label_counts[1] / passenger_count
print("Label counts:", label_counts)
print(f"Survival rate: {survival_rate:.3f}")

## 2. Prepare helper functions

The bulletin requests tiny implementations of ZeroR, OneR, and k-NN. I implement them here step by step so the code stays visible.

### 2.1 Basic parsing and defaults

Some values are missing, especially ages and the port of embarkation. I reuse the average or the most common value so no row disappears.

In [None]:
def parse_float(value: str) -> float:
    try:
        return float(value)
    except (TypeError, ValueError):
        return math.nan

age_values = [parse_float(row["Age"]) for row in rows if row["Age"]]
fare_values = [parse_float(row["Fare"]) for row in rows if row["Fare"]]
mean_age = statistics.mean(age_values)
median_fare = statistics.median(fare_values)
most_common_embarked = Counter(row["Embarked"] or "S" for row in rows).most_common(1)[0][0]

mean_age, median_fare, most_common_embarked

### 2.2 Turn rows into numeric features

k-NN needs numbers, so I turn each passenger into a simple feature vector. I keep the raw values too so OneR can look at the original columns.

In [None]:
def encode_embarked(value: str) -> float:
    mapping = {"S": 0.0, "C": 1.0, "Q": 2.0}
    clean = (value or most_common_embarked).strip().upper()
    return mapping.get(clean, mapping[most_common_embarked])


def preprocess_row(row: dict[str, str]) -> dict[str, object]:
    age = parse_float(row["Age"])
    if math.isnan(age):
        age = mean_age
    fare = parse_float(row["Fare"])
    if math.isnan(fare):
        fare = median_fare
    sibsp = int(row["SibSp"])
    parch = int(row["Parch"])
    pclass = int(row["Pclass"])
    sex_clean = row["Sex"].strip().lower()
    sex_flag = 1.0 if sex_clean == "female" else 0.0
    embarked_clean = (row["Embarked"] or most_common_embarked).strip().upper()
    embarked_flag = encode_embarked(embarked_clean)
    features_raw = [
        float(age),
        float(sibsp),
        float(parch),
        float(fare),
        sex_flag,
        float(pclass),
        embarked_flag,
    ]
    return {
        "label": int(row["Survived"]),
        "age": float(age),
        "sibsp": sibsp,
        "parch": parch,
        "fare": float(fare),
        "pclass": pclass,
        "sex": sex_clean,
        "embarked": embarked_clean,
        "sex_flag": sex_flag,
        "embarked_flag": embarked_flag,
        "features_raw": features_raw,
        "passenger_id": int(row["PassengerId"]),
        "name": row["Name"],
    }

processed_rows = [preprocess_row(row) for row in rows]
processed_rows[0]

### 2.3 Standardise numeric features

I reuse the training averages and standard deviations so every feature stays on a similar scale before running k-NN.

In [None]:
def fit_standard_scaler(dataset: list[dict[str, object]]) -> tuple[list[float], list[float]]:
    columns = len(dataset[0]["features_raw"])
    means: list[float] = []
    stds: list[float] = []
    for col in range(columns):
        values = [row["features_raw"][col] for row in dataset]
        mean = statistics.mean(values)
        std = statistics.pstdev(values)
        if std == 0:
            std = 1.0
        means.append(mean)
        stds.append(std)
    return means, stds


def scale_row(row: dict[str, object], means: list[float], stds: list[float]) -> dict[str, object]:
    scaled = []
    for value, mean, std in zip(row["features_raw"], means, stds):
        scaled.append((value - mean) / std if std else 0.0)
    new_row = dict(row)
    new_row["features"] = scaled
    return new_row


def scale_dataset(dataset: list[dict[str, object]], means: list[float], stds: list[float]) -> list[dict[str, object]]:
    return [scale_row(row, means, stds) for row in dataset]

### 2.4 Train/test split helper

The bulletin mentions a classic hold-out evaluation. I shuffle once with the shared seed (90) and reuse these helpers in later sections.

In [None]:
def scale_with_training(
    train_rows: list[dict[str, object]],
    test_rows: list[dict[str, object]],
) -> tuple[list[dict[str, object]], list[dict[str, object]], tuple[list[float], list[float]]]:
    means, stds = fit_standard_scaler(train_rows)
    return scale_dataset(train_rows, means, stds), scale_dataset(test_rows, means, stds), (means, stds)


def prepare_split(
    dataset: list[dict[str, object]],
    test_ratio: float = 0.3,
    seed: int = 90,
) -> tuple[list[dict[str, object]], list[dict[str, object]], tuple[list[float], list[float]]]:
    indexes = list(range(len(dataset)))
    random.Random(seed).shuffle(indexes)
    split_point = int(len(indexes) * (1 - test_ratio))
    train_idx = indexes[:split_point]
    test_idx = indexes[split_point:]
    train_rows = [dataset[i] for i in train_idx]
    test_rows = [dataset[i] for i in test_idx]
    return scale_with_training(train_rows, test_rows)


train_data, test_data, scaler = prepare_split(processed_rows)
len(train_data), len(test_data)

### 2.5 Metrics and evaluation helpers

Accuracy and a confusion matrix (rows = predictions, columns = actual labels) are enough for this bulletin.

In [None]:
def accuracy(y_true: list[int], y_pred: list[int]) -> float:
    correct = sum(1 for actual, predicted in zip(y_true, y_pred) if actual == predicted)
    return correct / len(y_true)


def confusion_matrix(y_true: list[int], y_pred: list[int]) -> dict[int, dict[int, int]]:
    labels = sorted(set(y_true) | set(y_pred))
    table: dict[int, dict[int, int]] = {pred: {actual: 0 for actual in labels} for pred in labels}
    for actual, predicted in zip(y_true, y_pred):
        table[predicted][actual] += 1
    return table


def print_confusion(table: dict[int, dict[int, int]]) -> None:
    labels = sorted(next(iter(table.values())).keys())
    header = "predicted -> actual".ljust(16) + " ".join(str(label).center(8) for label in labels)
    print(header)
    for pred_label in sorted(table.keys()):
        row = str(pred_label).center(16)
        for actual_label in labels:
            row += str(table[pred_label][actual_label]).center(8)
        print(row)


def evaluate_models(
    train_split: list[dict[str, object]],
    test_split: list[dict[str, object]],
) -> dict[str, float]:
    y_true = [row["label"] for row in test_split]
    zero_r_model = train_zero_r(train_split)
    zero_r_predictions = predict_zero_r(zero_r_model, test_split)
    one_r_model = train_one_r(train_split)
    one_r_predictions = predict_one_r(one_r_model, test_split)
    knn_predictions = predict_knn(train_split, test_split, k=3)
    return {
        "ZeroR": accuracy(y_true, zero_r_predictions),
        "OneR": accuracy(y_true, one_r_predictions),
        "kNN (k=3)": accuracy(y_true, knn_predictions),
    }


## 3. ZeroR baseline

ZeroR always predicts the most common class in the training data. It gives me a lower bound for accuracy.

In [None]:
def train_zero_r(dataset: list[dict[str, object]]) -> int:
    labels = Counter(row["label"] for row in dataset)
    return labels.most_common(1)[0][0]


def predict_zero_r(model: int, dataset: list[dict[str, object]]) -> list[int]:
    return [model for _ in dataset]


zero_r_model = train_zero_r(train_data)
zero_r_predictions = predict_zero_r(zero_r_model, test_data)
zero_r_acc = accuracy([row["label"] for row in test_data], zero_r_predictions)
zero_r_model, zero_r_acc

## 4. OneR rule

OneR searches for the single column that delivers the smallest error. Numeric values are discretised into four equal-width bins to keep the explanations easy.

In [None]:
def make_bins(values: list[float], bins: int = 4) -> list[float]:
    clean = sorted(v for v in values if not math.isnan(v))
    if not clean:
        return []
    step = max(1, len(clean) // bins)
    cuts = []
    for i in range(1, bins):
        cuts.append(clean[min(i * step, len(clean) - 1)])
    return sorted(set(cuts))


def assign_bin(value: float, cuts: list[float]) -> str:
    if not cuts:
        return "all"
    if math.isnan(value):
        return "missing"
    for cut in cuts:
        if value <= cut:
            return f"<= {cut:.2f}"
    return f"> {cuts[-1]:.2f}"


def train_one_r(dataset: list[dict[str, object]]) -> dict[str, object]:
    candidate_features = [
        "sex",
        "pclass",
        "embarked",
        "age",
        "sibsp",
        "parch",
        "fare",
    ]
    best_feature = None
    best_rules: dict[str, int] | dict[int, int] | None = None
    best_cuts: list[float] | None = None
    best_error = float("inf")
    default_label = train_zero_r(dataset)

    for feature in candidate_features:
        if feature in {"sex", "embarked"}:
            groups: dict[str, Counter[int]] = defaultdict(Counter)
            for row in dataset:
                groups[str(row[feature])][row["label"]] += 1
            rules = {value: counter.most_common(1)[0][0] for value, counter in groups.items()}
            errors = sum(1 for row in dataset if rules.get(str(row[feature]), default_label) != row["label"])
            if errors < best_error:
                best_feature = feature
                best_rules = rules
                best_cuts = None
                best_error = errors
        else:
            values = [float(row[feature]) for row in dataset]
            cuts = make_bins(values)
            groups: dict[str, Counter[int]] = defaultdict(Counter)
            for row in dataset:
                bucket = assign_bin(float(row[feature]), cuts)
                groups[bucket][row["label"]] += 1
            rules = {bucket: counter.most_common(1)[0][0] for bucket, counter in groups.items()}
            errors = sum(1 for row in dataset if rules.get(assign_bin(float(row[feature]), cuts), default_label) != row["label"])
            if errors < best_error:
                best_feature = feature
                best_rules = rules
                best_cuts = cuts
                best_error = errors

    return {
        "feature": best_feature,
        "rules": best_rules or {},
        "cuts": best_cuts,
        "default": default_label,
    }


def predict_one_r(model: dict[str, object], dataset: list[dict[str, object]]) -> list[int]:
    feature = model["feature"]
    rules = model["rules"]
    cuts = model["cuts"]
    default_label = model["default"]
    predictions: list[int] = []
    for row in dataset:
        if cuts is None:
            key = str(row[feature])
        else:
            key = assign_bin(float(row[feature]), cuts)
        predictions.append(int(rules.get(key, default_label)))
    return predictions


one_r_model = train_one_r(train_data)
one_r_predictions = predict_one_r(one_r_model, test_data)
one_r_acc = accuracy([row["label"] for row in test_data], one_r_predictions)
one_r_model, one_r_acc

## 5. Tiny k-NN (k = 3)

With scaled numeric features I can compute Euclidean distances by hand. Three neighbours keep the voting rule short.

In [None]:
def euclidean_distance(a: list[float], b: list[float]) -> float:
    return math.sqrt(sum((x - y) ** 2 for x, y in zip(a, b)))


def predict_knn(
    train_set: list[dict[str, object]],
    test_set: list[dict[str, object]],
    k: int = 3,
) -> list[int]:
    predictions: list[int] = []
    for test_item in test_set:
        distances: list[tuple[float, int]] = []
        for train_item in train_set:
            dist = euclidean_distance(test_item["features"], train_item["features"])
            distances.append((dist, train_item["label"]))
        distances.sort(key=lambda pair: pair[0])
        neighbours = distances[:k]
        vote = Counter(label for _, label in neighbours).most_common(1)[0][0]
        predictions.append(vote)
    return predictions


knn_predictions = predict_knn(train_data, test_data, k=3)
knn_acc = accuracy([row["label"] for row in test_data], knn_predictions)
knn_acc

## 6. Hold-out summary and confusion matrix

I compare the three algorithms on the same split and then print the confusion matrix of the best one (rows = predictions, columns = actual labels).

In [None]:
holdout_scores = {
    "ZeroR": zero_r_acc,
    "OneR": one_r_acc,
    "kNN (k=3)": knn_acc,
}
holdout_scores

In [None]:
best_model_name = max(holdout_scores, key=holdout_scores.get)
print("Best model:", best_model_name)
if best_model_name == "ZeroR":
    best_predictions = zero_r_predictions
elif best_model_name == "OneR":
    best_predictions = one_r_predictions
else:
    best_predictions = knn_predictions

matrix = confusion_matrix([row["label"] for row in test_data], best_predictions)
print_confusion(matrix)

## 7. Repeated hold-out (5 runs)

To reduce the luck of a single shuffle I repeat the hold-out evaluation five times, always building the scaler from the training portion.

In [None]:
def repeated_holdout(
    dataset: list[dict[str, object]],
    repeats: int = 5,
    test_ratio: float = 0.3,
    start_seed: int = 90,
) -> dict[str, list[float]]:
    scores = {"ZeroR": [], "OneR": [], "kNN (k=3)": []}
    for offset in range(repeats):
        seed = start_seed + offset
        train_split, test_split, _ = prepare_split(dataset, test_ratio=test_ratio, seed=seed)
        result = evaluate_models(train_split, test_split)
        for name, score in result.items():
            scores[name].append(score)
    return scores


repeated_scores = repeated_holdout(processed_rows, repeats=5)
repeated_scores

In [None]:
def mean(values: Iterable[float]) -> float:
    values = list(values)
    return sum(values) / len(values)


repeated_summary = {name: mean(scores) for name, scores in repeated_scores.items()}
repeated_summary

## 8. Manual 3-fold cross-validation

Following the class slides, I rotate three folds by hand. Each fold rebuilds the scaler from the current training data before predicting the validation fold.

In [None]:
def k_fold_validation(
    dataset: list[dict[str, object]],
    k: int = 3,
    seed: int = 90,
) -> dict[str, list[float]]:
    indexes = list(range(len(dataset)))
    random.Random(seed).shuffle(indexes)
    fold_sizes = [len(dataset) // k for _ in range(k)]
    for i in range(len(dataset) % k):
        fold_sizes[i] += 1
    folds: list[list[int]] = []
    start = 0
    for size in fold_sizes:
        folds.append(indexes[start:start + size])
        start += size

    scores = {"ZeroR": [], "OneR": [], "kNN (k=3)": []}
    for fold_index in range(k):
        test_idx = set(folds[fold_index])
        train_rows = [dataset[i] for i in indexes if i not in test_idx]
        test_rows = [dataset[i] for i in indexes if i in test_idx]
        train_split, test_split, _ = scale_with_training(train_rows, test_rows)
        result = evaluate_models(train_split, test_split)
        for name, score in result.items():
            scores[name].append(score)
    return scores


cv_scores = k_fold_validation(processed_rows, k=3)
cv_scores

In [None]:
cv_summary = {name: mean(scores) for name, scores in cv_scores.items()}
cv_summary

## 9. OneR rule on the full dataset

After validating the ideas I retrain OneR on every passenger to keep a final rule ready for the exam.

In [None]:
full_one_r_model = train_one_r(processed_rows)
full_one_r_model

## 10. Manual k-NN prediction example

The bulletin also asks for a manual k-NN check. I create a pretend passenger and show the three closest neighbours together with the final vote.

In [None]:
all_means, all_stds = fit_standard_scaler(processed_rows)
full_scaled = scale_dataset(processed_rows, all_means, all_stds)

manual_raw = {
    "PassengerId": "0",
    "Survived": "0",  # placeholder
    "Pclass": "2",
    "Name": "Manual Passenger",
    "Sex": "female",
    "Age": "29",
    "SibSp": "0",
    "Parch": "0",
    "Ticket": "",
    "Fare": "23.45",
    "Cabin": "",
    "Embarked": "S",
}
manual_processed = preprocess_row(manual_raw)
manual_scaled = scale_row(manual_processed, all_means, all_stds)

neighbour_distances = []
for row in full_scaled:
    dist = euclidean_distance(manual_scaled["features"], row["features"])
    neighbour_distances.append((dist, row))

neighbour_distances.sort(key=lambda pair: pair[0])
closest_three = neighbour_distances[:3]
for distance, neighbour in closest_three:
    print(
        f"Distance: {distance:.3f} | Survived: {neighbour['label']} | "
        f"Sex: {neighbour['sex']} | Age: {neighbour['age']} | Fare: {neighbour['fare']}"
    )

manual_prediction = predict_knn(full_scaled, [manual_scaled], k=3)[0]
manual_prediction

## 11. Final thoughts

* ZeroR provides the reference accuracy and reminds me that most passengers did not survive.
* OneR usually picks the `sex` column, matching the class notes about simple yet strong rules.
* k-NN benefits from scaling and tends to achieve the best accuracy on both hold-out and cross-validation.
* The manual example confirms how to run the neighbour vote without external libraries.

Everything stays within the standard library and respects the shared random seed of 90.