**One Rule** algorithm is a simple simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values. We only use a single rule for this classification by choose the feature with the best performance (fewest prediction errors). If both attributes produces the same score, then OneR chooses the attributes at random.

The algorithm is as follow:

for each attribute `A`:

    for each value `v` of that attribute, create a rule:
        1. count how often each class appears
        2. find the most frequent class, `C`
        3. make a rule, if `A=v`, then `C=c`
calculate the error rate of this value
find the attribute that produces the lowest error rate

In [13]:
from sklearn.datasets import load_iris
import numpy as np

dataset = load_iris()
print(dataset.DESCR)

X = dataset.data
y = dataset.target
n_samples, n_features = X.shape

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [14]:
# Compute the mean for each attributes
attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)

X_d = np.array(X >= attribute_means, dtype='int')

In [15]:
# Split into training and test set
from sklearn.model_selection import train_test_split

# Seed our random state so that we will get reproducible results
random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)

print("There are {} training samples".format(y_train.shape))
print("There are {} test samples".format(y_test.shape))

There are (112,) training samples
There are (38,) test samples


In [33]:
from collections import defaultdict
from operator import itemgetter

def train_feature_value(X, y_true, feature, value):
    class_counts = defaultdict(int)
    
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    
    most_frequent_class = sorted_class_counts[0][0]
    
    incorrect_predictions = [class_count for class_value, class_count in class_counts.items() 
                             if class_value != most_frequent_class]
    
    error = sum(incorrect_predictions)
    
    return most_frequent_class, error

In [34]:
def train(X, y_true, feature):
    """Compute the predictors and error for a given feature using the OneR algorithm
    
    Parameters
    ----------
    X: array [n_samples, n_features]
        The two dimensional array that holds the dataset. Each row is a sample, each column
        is a feature.
        
    y_true: array[n_samples,]
        The one dimensional array that holds the class values. Corresponds to X, such that
        y_true[i] is the class value for sample X[i]
    
    feature: int
        An integer corresponding to the index of the variables we wish to test.
        0 <= variable < n_features
    
    Returns
    -------
    predictors: dictionary of tuples: (value, prediction)
        For each item in the array, if the variable has a given value, make the given prediction.
    
    error: float
        The ratio of training data that this rule incorrectly predicts.
    """

    values = set(X[:, feature])
    predictors = {}
    errors = []
    
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    
    total_error = sum(errors)
    return predictors, total_error

In [35]:
all_predictors = {}
errors = {}

for feature in range(X_train.shape[1]):
    predictors, total_error = train(X_train, y_train, feature_index)
    all_predictors[feature] = (predictors, total_error)
    errors[feature] = total_error

best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

model = {'feature': best_feature,
         'predictor': all_predictors[best_feature][0]}
print(model)

{'feature': 0, 'predictor': {0: 0, 1: 2}}


In [36]:
def predict(X_test, model):
    """
    Predict the best category given an array of test features
    
    Arguments:
    - X_test (ndarray) : an array of test variables
    - model (object)   : an object that holds the best feature and all predictors 
    
    Returns:
    - y_predicted (ndarray) : an array of predicted output
    """

    feature = model['feature']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[feature])] for sample in X_test])
    return y_predicted

In [38]:
y_predicted = predict(X_test, model)
print(y_predicted)

[0 0 0 0 0 2 0 0 0 2 0 0 2 2 0 0 0 2 2 0 0 0 0 2 0 2 0 0 0 0 0 0 2 0 0 0 2
 0]


In [42]:
accuracy = np.mean(y_predicted == y_test) * 100
print('The test accuracy is {:.1f}%'.format(accuracy))

The test accuracy is 60.5%
