# Evaluation

In this notebook I evaluate the classification obtained from the ensemble in notebook 4 and draw final conclusions for the project. 

## Load predictions

Load the labels for the test data, load the predictions for test data set computed in previous notebook:

In [3]:
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder

# load the labels for the test data
with np.load(os.path.join('production', 'features_point_clouds_test.npz'), allow_pickle=True) as data:
    data_dict = dict(data.items())
    metadata_test = data_dict['metadata']
# encode labels as ints
encoder = LabelEncoder().fit(metadata_test[:, 3])
labels_test = encoder.transform(metadata_test[:, 3])

# load predictions for test set computed in previous notebook
with np.load(os.path.join('production', 'predictions.npz'), allow_pickle=True) as data:
    data_dict = dict(data.items())
    pred_knn = data_dict['pred_knn']
    pred_logreg = data_dict['pred_logreg']


## Confusion matrices

Looking a the confusion matrices for the k-NN and for the logistic regression classifiers, it appears that in both classifiers, Bronze age, Greek and Neolithic LBK artifacts where classified mostly correctly, however Roman artifacts where often confounded with Iron Age artifacts and Neolithic SBK with Neolithic LBK. 

As discussed in the overview notebook, Roman industry is a form of Iron Age industry, so this particular confusion was to be expected. 

Regarding Neolithic artifacts, these artifacts are mostly characterized by their exterior decorations: lines for LBK, strokes for SBK, hence the abbreviations (notebook 0, fig. 2). As the 3D-models for these two classes had not been provided with textures, this important information was lost, which could explain the low accuracy for these two classes.

### Confusion matrix for k-NN

In [4]:
from sklearn.metrics import confusion_matrix
import pandas as pd

# Compute confusion matrix
matrix = confusion_matrix(
    y_true=labels_test, # array with true labels
    y_pred=pred_knn # array with predicted labels
)

# Format as a DataFrame
class_names = encoder.inverse_transform([0, 1, 2, 3, 4, 5])
matrix_df = pd.DataFrame(data=matrix, columns=class_names, index=class_names)
matrix_df.columns.name = 'Predictions'
matrix_df.index.name = 'True class'
matrix_df

Predictions,Bronze Age,Greek,Iron Age,Neolithic Linear Pottery Culture (LBK),Neolithic Stroked Pottery culture (SBK),Roman
True class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bronze Age,33,0,0,1,0,0
Greek,0,7,1,0,0,0
Iron Age,2,0,10,0,0,0
Neolithic Linear Pottery Culture (LBK),0,0,0,30,0,0
Neolithic Stroked Pottery culture (SBK),1,0,0,3,4,0
Roman,0,1,3,0,0,3


### Confusion matrix for logistic regression

In [6]:
# Compute confusion matrix
matrix = confusion_matrix(
    y_true=labels_test, # array with true labels
    y_pred=pred_logreg # array with predicted labels
)

# Format as a DataFrame
class_names = encoder.inverse_transform([0, 1, 2, 3, 4, 5])
matrix_df = pd.DataFrame(data=matrix, columns=class_names, index=class_names)
matrix_df.columns.name = 'Predictions'
matrix_df.index.name = 'True class'
matrix_df

Predictions,Bronze Age,Greek,Iron Age,Neolithic Linear Pottery Culture (LBK),Neolithic Stroked Pottery culture (SBK),Roman
True class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bronze Age,33,0,0,1,0,0
Greek,0,7,1,0,0,0
Iron Age,2,0,9,0,0,1
Neolithic Linear Pottery Culture (LBK),0,0,0,29,1,0
Neolithic Stroked Pottery culture (SBK),1,0,0,4,3,0
Roman,0,1,3,0,0,3


## Classification reports
In the previous notebook, I used only the accuracy and the loss measures to determine the best classification. However, scikit learn offers alternative measures and methods that could give a more realistic evaluation of the classifier's performance.

(From [scikit documentation](https://scikit-learn.org/0.20/modules/model_evaluation.html#model-evaluation) and [Datacourses](https://www.datacourses.com/classification-model-evaluation-metrics-in-scikit-learn-924/))

**Accuracy**: The ratio of correct vs. all predictions, as used in the previous notebook, is not a particularly good measure when the data set is imbalanced. 

**Recall**: The ratio of correct positives vs. all positives. Especially interesting when a false negative results in a higher cost than a false positive, which is not the case for a simple image classifier like this one.

**Precision**: The ratio of correct positives vs. all positive predictions. Interesting when a false positive results in a higher cost than a false negative. In a way the opposite use case than for the recall measure.

**F-score**: The harmonic mean (the average rate) of recall and precision. Useful when a false negative and a false positives have equivalent and low impact, as is the case for a simple image classifier. Therefore a good measure for the present use case.



### Report for k-NN

Based on the F-score, the the k-NN classifier performed the least well for the Neolithic SBK (class 4) and Roman (class 5) artifacts. The highest precision was achieved by these two classes as well, and so was the lowest recall.

Based on the F-score, the most accurate predictions where achieved for the Bronze Age (class 0) and Neolithic LBK (class 3), which are also the classes with the most artifacts.

In [5]:
from sklearn.metrics import classification_report
print(classification_report(labels_test, pred_knn))

              precision    recall  f1-score   support

           0       0.92      0.97      0.94        34
           1       0.88      0.88      0.88         8
           2       0.71      0.83      0.77        12
           3       0.88      1.00      0.94        30
           4       1.00      0.50      0.67         8
           5       1.00      0.43      0.60         7

   micro avg       0.88      0.88      0.88        99
   macro avg       0.90      0.77      0.80        99
weighted avg       0.89      0.88      0.87        99



### Report for logistic regression
As for the k-NN classifier, the logistic regression classifier performed the least well for Neolithic SBK (class 4) and Roman (class 5) artifacts, however it performed even worse. 

As for the k-NN classifier, the most accurate predictions where achieved for the Bronze Age (class 0) and Neolithic LBK (class 3), however the score is lower for Neolithic LBK.

In [7]:
print(classification_report(labels_test, pred_logreg))

              precision    recall  f1-score   support

           0       0.92      0.97      0.94        34
           1       0.88      0.88      0.88         8
           2       0.69      0.75      0.72        12
           3       0.85      0.97      0.91        30
           4       0.75      0.38      0.50         8
           5       0.75      0.43      0.55         7

   micro avg       0.85      0.85      0.85        99
   macro avg       0.81      0.73      0.75        99
weighted avg       0.84      0.85      0.84        99



# Conclusions

**Downloading data**: suitable data sets for researching this problem are rather scarce. Sketchfab proved to be a useful resource. Downloading the data set took much longer than expected, due to bandwidth limitations of the API. This particular data set has 21 GB of data.

**Data processing**: The data could be downloaded and processed using Python libraries, specifically the Graphics Language Transmission Format (glTF) library pygltf, the imaging library PIL and the scipy.spatial library for spatial transformations. Handling this imbalanced data set was done by stratifying the data set splits using scikit learn.

**Extracting features**: High-level features could be extracted by applying transfer learning from the Inception V2 model. This model proved capable of extracting useful features from this particular data set, although it has been trained on completely different images. As entire classes had missing texture images, I handled missing data using the imputer object provided by scikit learn. 

**Classification**: If the F-score can be considered the most reliable metric for this particular use case, then the final scores are:

* __0.87__ weighted average F-score for the k-NN classifier ensemble
* 0.84 weighted average F-score for the logistic regression classifier ensemble

The original research question was whether ML can help in the process of mass-digitization by automatically tagging scanned artifacts. This is possible, however this particular setup put 13% of the artifacts in the wrong class. Improving this score would be necessary in a production environment. Some suggestions:
* around 1000 3D-models do not seem to be enough for supervised learning
* while point clouds with tens of thousands points have enough resolution, missing textures certainly are a problem. Therefore 3D-scans should have a texture if the scanning process allows for it
* newer versions of scikit learn probably offer better imputer objects for missing data
* More powerful hardware (or running Keras on the GPU) would allow to run cross validation on top of grid search, which would perhaps optimize the parameters more accurately
* Changing the training strategy to training the whole setup in one go instead of each classifier in parallel could potentially improve the results, but would most certainly also require more powerful hardware

Sketchfab also has several other data sets (animal skulls, prehistoric stone tools) which could be used to deepen the understanding of this particular research question.