**One Rule** algorithm is a simple simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values. We only use a single rule for this classification by choose the feature with the best performance (fewest prediction errors). If both attributes produces the same score, then OneR chooses the attributes at random.

The algorithm is as follow:

for each attribute `A`:

    for each value `v` of that attribute, create a rule:
        1. count how often each class appears
        2. find the most frequent class, `C`
        3. make a rule, if `A=v`, then `C=c`
calculate the error rate of this value
find the attribute that produces the lowest error rate

In [30]:
from sklearn.datasets import load_iris
import numpy as np

dataset = load_iris()
print(dataset.DESCR)

X = dataset.data
y = dataset.target
n_samples, n_features = X.shape

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [34]:
# Compute the mean for each attributes
attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)

X_d = np.array(X >= attribute_means, dtype='int')

In [39]:
# Split into training and test set
from sklearn.model_selection import train_test_split

# Seed our random state so that we will get reproducible results
random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)

print("There are {} training samples".format(y_train.shape))
print("There are {} test samples".format(y_test.shape))

There are (112,) training samples
There are (38,) test samples


In [20]:
from collections import defaultdict
from operator import itemgetter

def train_feature_value(X, y_true, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature_index] == value:
            class_counts[y] += 1
    sorted_class_counts = sorted(class_count.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    incorrect_predictions = [class_count for class_value, class_count in class_counts.items() if class_value != most_frequent_class]
    error = sum(incorrect_predictions)
    return most_frequent_class, error

