# B2 Python practice – novice walk-through

I am following the bulletin that sits next to the `B2_python.pdf` file. The idea is to keep every step tiny and clear so even someone that is opening Jupyter Notebook for the first time can copy it.


## 1. Load the Titanic `train.csv`

The teacher asked us to work with the `train.csv` sheet. I put it inside the **Machine Learning 1** folder so the notebook can open it without extra paths.


In [None]:
import csv
import math
import random
import statistics
from collections import Counter, defaultdict

random.seed(90)  # the class rule

csv_path = 'Machine Learning 1/train.csv'
rows = []
with open(csv_path, newline='', encoding='utf-8') as handle:
    reader = csv.DictReader(handle)
    for row in reader:
        rows.append(row)

len(rows)


I like to peek at the first rows to make sure the file is correct and the accents look fine.


In [None]:
for sample in rows[:5]:
    print(sample['PassengerId'], sample['Name'], sample['Sex'], sample['Survived'])


The target column we need to learn is **Survived**. Zero means the passenger did not make it. One means the passenger survived.

I also counted the labels to see the class balance.


In [None]:
label_counts = Counter(int(r['Survived']) for r in rows)
label_counts


## 2. Prepare very small helper functions

The bulletin asks for simple baselines (ZeroR, OneR) and a little k-NN. I am keeping all helpers inside the notebook so I can read them line by line.

### 2.1. Feature preparation

I am only going to use the most basic numeric columns: `Age`, `SibSp`, `Parch`, and `Fare`. Sex is important, so I convert it to `0` for male and `1` for female.

Some ages are empty. I replace them with the mean age so I do not throw away data.


In [None]:
def parse_float(value):
    return float(value) if value else math.nan

ages = [parse_float(r['Age']) for r in rows if r['Age']]
fares = [parse_float(r['Fare']) for r in rows if r['Fare']]
mean_age = statistics.mean(ages)
mean_fare = statistics.mean(fares)


def preprocess_row(row):
    age = parse_float(row['Age'])
    fare = parse_float(row['Fare'])
    if math.isnan(age):
        age = mean_age
    if math.isnan(fare):
        fare = mean_fare
    features = [
        age,
        float(row['SibSp']),
        float(row['Parch']),
        fare,
        1.0 if row['Sex'] == 'female' else 0.0,
    ]
    return {
        'features': features,
        'label': int(row['Survived']),
        'raw': row,
    }

processed = [preprocess_row(r) for r in rows]
processed[0]


### 2.2. Train / test split helper

We follow the class rule and shuffle with seed 90 before slicing the data.


In [None]:
def train_test_split(items, test_ratio=0.3):
    indexes = list(range(len(items)))
    random.Random(90).shuffle(indexes)
    split = int(len(items) * (1 - test_ratio))
    train_idx = indexes[:split]
    test_idx = indexes[split:]
    train = [items[i] for i in train_idx]
    test = [items[i] for i in test_idx]
    return train, test

train_data, test_data = train_test_split(processed)
len(train_data), len(test_data)


### 2.3. Metrics and confusion matrix

Accuracy is enough for now. I also coded the confusion matrix so the rows represent the predicted label (teacher reminder: Python prints the matrix flipped if we do not do this by hand).


In [None]:
def accuracy(y_true, y_pred):
    correct = sum(int(a == b) for a, b in zip(y_true, y_pred))
    return correct / len(y_true)


def confusion_matrix(y_true, y_pred):
    labels = sorted(set(y_true) | set(y_pred))
    table = {pred: {actual: 0 for actual in labels} for pred in labels}
    for actual, predicted in zip(y_true, y_pred):
        table[predicted][actual] += 1
    return table


## 3. ZeroR baseline

ZeroR just predicts the most frequent class.


In [None]:
def train_zero_r(dataset):
    counter = Counter(item['label'] for item in dataset)
    majority_label, _ = counter.most_common(1)[0]
    return majority_label


def predict_zero_r(model, dataset):
    return [model for _ in dataset]

zero_r_model = train_zero_r(train_data)
zero_r_predictions = predict_zero_r(zero_r_model, test_data)
zero_r_acc = accuracy([d['label'] for d in test_data], zero_r_predictions)
zero_r_acc


## 4. OneR rule

I try every single column and pick the one with the smallest error. For numeric columns I create four equal-width buckets so the rule stays readable.


In [None]:
def make_bins(values, bins=4):
    clean = sorted(v for v in values if not math.isnan(v))
    if not clean:
        return []
    step = len(clean) // bins
    if step == 0:
        return sorted(set(clean))
    cuts = []
    for i in range(1, bins):
        cuts.append(clean[min(i * step, len(clean) - 1)])
    return sorted(set(cuts))


numeric_columns = {
    'Age': [parse_float(r['Age']) if r['Age'] else math.nan for r in rows],
    'Fare': [parse_float(r['Fare']) if r['Fare'] else math.nan for r in rows],
}

bin_rules = {name: make_bins(vals) for name, vals in numeric_columns.items()}


def bucketize(value, cuts):
    if math.isnan(value):
        return 'missing'
    for threshold in cuts:
        if value <= threshold:
            return f"<= {threshold:.2f}"
    return '> last'


def train_one_r(dataset):
    candidates = ['Sex', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
    best_feature = None
    best_error = float('inf')
    best_rule = None

    for feature in candidates:
        table = defaultdict(Counter)
        for item in dataset:
            row = item['raw']
            label = item['label']
            if feature in ('Age', 'Fare'):
                value = bucketize(parse_float(row[feature]) if row[feature] else math.nan, bin_rules[feature])
            else:
                value = row[feature]
            table[value][label] += 1

        rules = {}
        error = 0
        for value, counts in table.items():
            chosen_label, chosen_count = counts.most_common(1)[0]
            rules[value] = chosen_label
            error += sum(counts.values()) - chosen_count

        if error < best_error:
            best_error = error
            best_feature = feature
            best_rule = dict(rules)

    return {'feature': best_feature, 'rules': best_rule}


def predict_one_r(model, dataset):
    feature = model['feature']
    rules = model['rules']
    predictions = []
    for item in dataset:
        row = item['raw']
        if feature in ('Age', 'Fare'):
            value = bucketize(parse_float(row[feature]) if row[feature] else math.nan, bin_rules[feature])
        else:
            value = row[feature]
        predictions.append(rules.get(value, 0))
    return predictions

one_r_model = train_one_r(train_data)
one_r_model


The rule tells me which single column worked best. Now I check the accuracy.


In [None]:
one_r_predictions = predict_one_r(one_r_model, test_data)
one_r_acc = accuracy([d['label'] for d in test_data], one_r_predictions)
one_r_acc


## 5. Tiny k-NN (k = 3)

I only use the numeric features we cleaned above.


In [None]:
def euclidean_distance(a, b):
    return math.sqrt(sum((x - y) ** 2 for x, y in zip(a, b)))


def predict_knn(train_set, test_set, k=3):
    predictions = []
    for test_item in test_set:
        distances = []
        for train_item in train_set:
            dist = euclidean_distance(test_item['features'], train_item['features'])
            distances.append((dist, train_item['label']))
        distances.sort(key=lambda x: x[0])
        nearest = distances[:k]
        vote = Counter(label for _, label in nearest).most_common(1)[0][0]
        predictions.append(vote)
    return predictions

knn_predictions = predict_knn(train_data, test_data, k=3)
knn_acc = accuracy([d['label'] for d in test_data], knn_predictions)
knn_acc


## 6. Confusion matrix for the best model

k-NN was slightly better on my split, so I print its confusion matrix with the row = predicted, column = actual format the professor mentioned.


In [None]:
knn_confusion = confusion_matrix([d['label'] for d in test_data], knn_predictions)
knn_confusion


## 7. Simple 3-fold cross-validation

I still stay in pure Python. I manually rotate three folds so we get a tiny taste of validation without fancy libraries.


In [None]:
def k_fold_split(dataset, k=3):
    indexes = list(range(len(dataset)))
    random.Random(90).shuffle(indexes)
    fold_size = len(dataset) // k
    folds = []
    for i in range(k):
        start = i * fold_size
        end = start + fold_size
        folds.append(indexes[start:end])
    leftovers = indexes[k * fold_size:]
    for idx, extra in enumerate(leftovers):
        folds[idx % k].append(extra)
    return folds


def cross_validate(dataset, predict_func, k=3):
    folds = k_fold_split(dataset, k)
    scores = []
    for i in range(k):
        test_idx = folds[i]
        train_idx = [idx for j, fold in enumerate(folds) if j != i for idx in fold]
        train_set = [dataset[idx] for idx in train_idx]
        test_set = [dataset[idx] for idx in test_idx]
        predictions = predict_func(train_set, test_set)
        score = accuracy([item['label'] for item in test_set], predictions)
        scores.append(score)
    return scores

zero_r_cv = cross_validate(processed, lambda train, test: predict_zero_r(train_zero_r(train), test))
one_r_cv = cross_validate(processed, lambda train, test: predict_one_r(train_one_r(train), test))
knn_cv = cross_validate(processed, lambda train, test: predict_knn(train, test, k=3))

zero_r_cv, one_r_cv, knn_cv


I also compute the mean accuracy of each list to summarize the table.


In [None]:
cv_summary = {
    'ZeroR': statistics.mean(zero_r_cv),
    'OneR': statistics.mean(one_r_cv),
    'kNN (k=3)': statistics.mean(knn_cv),
}
cv_summary


## 8. Final thoughts

* ZeroR is the baseline. It helped me see that the dataset is imbalanced.
* OneR quickly showed that `Sex` is the strongest single feature.
* k-NN gave the best accuracy when I used four numeric columns plus the simple sex flag.

This matches what we talked about in class: start simple, keep the math readable, and stick to the fixed random seed so everyone gets the same answers.
