# OneR Classification Algorithm

- Applied to cancer data
- 67.1% Accuracy

data source: <a href='http://archive.ics.uci.edu/ml/datasets/Breast+Cancer'>archive.ics.uci.edu</a>

In [57]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from collections import defaultdict
from operator import itemgetter
import numpy as np
import pandas as pd

# Dataset Characteristics

In [58]:
with open('breast-cancer.names') as fhand:
    doc = fhand.read()
    print(doc)

Citation Request:
   This breast cancer domain was obtained from the University Medical Centre,
   Institute of Oncology, Ljubljana, Yugoslavia.  Thanks go to M. Zwitter and 
   M. Soklic for providing the data.  Please include this citation if you plan
   to use this database.

1. Title: Breast cancer data (Michalski has used this)

2. Sources: 
   -- Matjaz Zwitter & Milan Soklic (physicians)
      Institute of Oncology 
      University Medical Center
      Ljubljana, Yugoslavia
   -- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
   -- Date: 11 July 1988

3. Past Usage: (Several: here are some)
     -- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The 
        Multi-Purpose Incremental Learning System AQ15 and its Testing 
        Application to Three Medical Domains.  In Proceedings of the 
        Fifth National Conference on Artificial Intelligence, 1041-1045,
        Philadelphia, PA: Morgan Kaufmann.
        -- accuracy range: 66%-72%
     -

In [59]:
names = ['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 
         'deg-malig', 'breast', 'breast-quad', 'irradiat']
dataset = pd.read_csv('breast-cancer.data', header=None, names=names)

# Data Cleaning

There are 9 observations consisting of null values. Hence, those rows shall be dropped.

In [60]:
nulls = []
for i in range(dataset.shape[0]):
    if '?' in dataset.loc[i].values:
        nulls.append(i)
        
for i in reversed(nulls):
    dataset = dataset.drop(i)

# Data Discretization

In [61]:
X = dataset.drop(['class'], axis=1)
y = dataset['class']


##### Discretize all features in X #####

X['age'] = X['age'].apply(lambda n: n[0])

mp = dict.fromkeys(set(X['menopause']))
mp['lt40'] = 0; mp['premeno'] = 1; mp['ge40'] = 2
X['menopause'] = X['menopause'].apply(lambda k: mp[k])

X['tumor-size'] = X['tumor-size'].apply(lambda n: n[0])

X['inv-nodes'] = X['inv-nodes'].apply(lambda n: n[0])

X['node-caps'] = X['node-caps'].apply(lambda n: 1 if n == 'yes' else 0)

X['deg-malig'] = X['deg-malig'].apply(lambda n: n)

X['breast'] = X['breast'].apply(lambda n: 1 if n == 'right' else 0)

quad = dict.fromkeys(set(X['breast-quad']))
quad['right_up'] = 0; quad['left_up'] = 1; quad['left_low'] = 2
quad['right_low'] = 3; quad['central'] = 4
X['breast-quad'] = X['breast-quad'].apply(lambda k: quad[k])

X['irradiat'] = X['irradiat'].apply(lambda n: 1 if n == 'yes' else 0)

########################################

######## Discretize classes in y #######
classes = dict.fromkeys(set(y))
classes['recurrence-events'] = 1; classes['no-recurrence-events'] = 0
y = y.apply(lambda k: classes[k])

X.head()

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,3,1,3,0,0,3,0,2,0
1,4,1,2,0,0,2,1,0,0
2,4,1,2,0,0,2,0,2,0
3,6,2,1,0,0,2,1,1,0
4,4,1,0,0,0,2,1,3,0


In [62]:
# Convert to numpy 2d-arrays for OneR
X, y = X.values, y.values

# OneR Algorithm

In [63]:
def train_feature_value(X, y, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y_val in zip(X, y):
        if sample[feature_index] == value:
            class_counts[y_val] += 1
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    incorrect_predictions = [
        class_count for class_value, class_count in class_counts.items()
        if class_value != most_frequent_class
    ]
    error = sum(incorrect_predictions)
    return most_frequent_class, error

def train_on_feature(X, y, feature_index):
    values = set(X[:,feature_index])
    predictors = {}
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y, feature_index, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    total_error = sum(errors)
    return predictors, total_error

# Train and Test

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)

In [65]:
all_predictors = {}
errors = {}

for feature_index in range(X_train.shape[1]):
    predictors, total_error = train_on_feature(X_train, y_train, feature_index)
    all_predictors[feature_index] = predictors
    errors[feature_index] = total_error
    
best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

print('The best model is base on feature {0} and has error {1:.2f}'.format(
        best_feature, best_error))

The best model is base on feature 4 and has error 52.00


In [72]:
model = {'feature': best_feature, 'predictor': all_predictors[best_feature]}
print(model)

{'feature': 4, 'predictor': {0: 0, 1: 1}}


In [67]:
def predict(X_test, model):
    var = model['feature']
    predictor = model['predictor']
    predicted = np.array([
        predictor[int(sample[var])] for sample in X_test
    ])
    return predicted

In [68]:
predicted = predict(X_test, model)

In [69]:
accuracy = np.mean(predicted == y_test) * 100
print('The test accuracy is {:.1f}%'.format(accuracy))

The test accuracy is 67.1%


In [70]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.74      0.82      0.78        49
           1       0.44      0.33      0.38        21

   micro avg       0.67      0.67      0.67        70
   macro avg       0.59      0.57      0.58        70
weighted avg       0.65      0.67      0.66        70

