# Pen Digit Dataset

In this notebook, we will download and save the [Pen Digit](https://archive.ics.uci.edu/dataset/81/pen+based+recognition+of+handwritten+digits) dataset. In the [pen_digit_anomalies.ipynb](pen_digit_anomalies.ipynb), we will apply the anomaly detection algorithms to the dataset. 

In [1]:
import os
import numpy as np
import pandas as pd

import Data

Download and extract dataset, if it does not already exist:

In [2]:
DATASET_URLS = [
    "https://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits-orig.tes.Z",
    "https://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/pendigits-orig.tra.Z",
]

for source_url in DATASET_URLS:
    target_filename = f"{Data.DATA_DIR}/{source_url.split('/')[-1]}"
    if not os.path.exists(target_filename[:-2]):
        try:
            Data.download(source_url, target_filename)
            !uncompress {target_filename}
        except:
            if os.path.exists(target_filename):
                os.remove(target_filename)
            raise

Load dataset and create data frame:

In [3]:
def load_pendigits_dataset(filename):
    with open(filename, "r") as f:
        data_lines = f.readlines()

    data = []
    data_labels = []
    current_digit = None

    for line in data_lines:
        if line == "\n":
            continue

        if line[0] == ".":
            if "SEGMENT DIGIT" in line[1:]:
                if current_digit is not None:
                    data.append(np.array(current_digit))
                    data_labels.append(digit_label)

                current_digit = []
                digit_label = int(line.split('"')[1])
            else:
                continue

        else:
            x, y = map(float, line.split())
            current_digit.append([x, y])

    data.append(np.array(current_digit))
    data_labels.append(digit_label)

    return data, data_labels

In [4]:
data = {
    "train": load_pendigits_dataset(f"{Data.DATA_DIR}/pendigits-orig.tra"),
    "test": load_pendigits_dataset(f"{Data.DATA_DIR}/pendigits-orig.tes"),
}

dataframes = []
for subset, data in data.items():
    df = pd.DataFrame(data).T
    df.columns = ["Stream", "Digit"]
    df["Subset"] = subset
    dataframes.append(df)
df = pd.concat(dataframes)

In [5]:
df

Unnamed: 0,Stream,Digit,Subset
0,"[[267.0, 333.0], [267.0, 336.0], [267.0, 339.0...",8,train
1,"[[249.0, 234.0], [249.0, 235.0], [251.0, 238.0...",2,train
2,"[[196.0, 228.0], [193.0, 222.0], [191.0, 218.0...",1,train
3,"[[231.0, 309.0], [232.0, 314.0], [232.0, 318.0...",4,train
4,"[[200.0, 273.0], [200.0, 273.0], [199.0, 271.0...",1,train
...,...,...,...
3493,"[[274.0, 336.0], [276.0, 337.0], [276.0, 337.0...",4,test
3494,"[[245.0, 320.0], [244.0, 324.0], [244.0, 327.0...",2,test
3495,"[[299.0, 375.0], [300.0, 377.0], [305.0, 381.0...",0,test
3496,"[[234.0, 296.0], [231.0, 291.0], [228.0, 280.0...",0,test


Obtain summary statistics for the dataset:

In [6]:
mean_corpus_size = df[df["Subset"] == "train"]["Digit"].value_counts().mean()
testing_data_size = len(df[df["Subset"] == "test"])
mean_outlier_size = (
    testing_data_size - df[df["Subset"] == "test"]["Digit"].value_counts().mean()
)

print(f"Mean corpus size: {mean_corpus_size}")
print(f"Testing subset size: {testing_data_size}")
print(f"Mean testing outlier subset size: {mean_outlier_size}")

Mean corpus size: 749.4
Testing subset size: 3498
Mean testing outlier subset size: 3148.2


In [7]:
# pickle the data into train and test sets
df[df["Subset"] == "train"].to_pickle(f"{Data.DATA_DIR}/pen_digit_train.pkl")
df[df["Subset"] == "test"].to_pickle(f"{Data.DATA_DIR}/pen_digit_test.pkl")