# EPOS Data Set composition

## Terms used for columns in the data

- Barcode: https://en.wikipedia.org/wiki/Barcode
- Description: product description
- UnitRRP: Products recommended retail price/selling price
- CategoryID: Surrogate key for Category https://en.wikipedia.org/wiki/Surrogate_key
- Category: Human readable product categorisation

## Data files

### training_data.dat

Training data 526 data items with 6 categories.

### test_data.dat

Training data 191 data items with 6 categories.

In [None]:
# Imports
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

In [None]:
# Training Data
training_raw = pd.read_table("../data/training_data.dat")
df_training = pd.DataFrame(training_raw)
df_training.head()

In [None]:
# test Data
test_raw = pd.read_table("../data/test_data.dat")
df_test = pd.DataFrame(test_raw)
df_test.head()

In [None]:
# target names
target_categories = ['Unclassified','Art','Aviation','Boating','Camping /Walking /Climbing','Collecting']
target_values = ['1','528','529','530','531','532']

In [None]:
# features
feature_names = ['Barcode','Description','UnitRRP']

In [None]:
# Extract features from panda
training_data = df_training[feature_names].values
training_data[:3]

In [None]:
# Extract target results from panda
target = df_training["CategoryID"].values

In [None]:
# Create classifier class
model_dtc = DecisionTreeClassifier()

In [None]:
# train model
model_dtc.fit(training_data, target)

We fail here because the description column is a string.
Lets try again without the description.

In [None]:
# features
feature_names_integers = ['Barcode','UnitRRP']

In [None]:
# Extra features from panda (without description)
training_data_integers = df_training[feature_names_integers].values
training_data_integers[:3]

In [None]:
# train model again
model_dtc.fit(training_data_integers, target)

In [None]:
# Extract test data and test the model
test_data_integers = df_test[feature_names_integers].values
test_target = df_test["CategoryID"].values
expected = test_target
predicted_dtc = model_dtc.predict(test_data_integers)

In [None]:
print(metrics.classification_report(expected, predicted_dtc,    target_names=target_categories))

In [None]:
print(metrics.confusion_matrix(expected, predicted_dtc))

In [None]:
metrics.accuracy_score(expected, predicted, normalize=True, sample_weight=None)

In [None]:
predicted[:5]

Lets try a different Classifier

Linear classifiers (SVM, logistic regression, a.o.) with SGD training.

In [None]:
from sklearn.linear_model import SGDClassifier

In [None]:
# Create classifier class
model_sgd = SGDClassifier()

In [None]:
# train model again
model_sgd.fit(training_data_integers, target)

In [None]:
predicted_sgd = model_sgd.predict(test_data_integers)

In [None]:
print(metrics.classification_report(expected, predicted_sgd,    target_names=target_categories))

In [None]:
print(metrics.confusion_matrix(expected, predicted_sgd))

In [None]:
metrics.accuracy_score(expected, predicted_sgd, normalize=True, sample_weight=None)