# Swiss SMS Classifier Optimisation

Optimisation of real-world AI applications is a difficult process that requires a structured aproach.

For ready-made, out-of-the-box machine learning applications such as MNIST or the IRIS classification task the term optimisation is usually focused on algorithm choice, and hyperparameter tuning.

For real-world use cases where one starts from scratch optimisation has to take also the steps of training collection creation & management, and evaluation (especially in the sense of validity) into consideration.

As described in the slides "Gears of Machine Learning" this means looking at the following three areas with regard to improvement:

* Training Data Curation
* Ml Algorithmic Optimisations
* Evaluation (Validity)

Specifically the last point is one that is often neglected, but absolutely crucial in terms of production-grade level success of AI/ML solutions.

# `Exercise 1: Algorithmic Optimisation`

Algorithmic optimisation is often the first point people turn to when trying to increase observered performance.

As a first exercise try to increase the observed performance for the SMS classifier for content_type.

1. Try to have a structured approach where you define beforehand what scenarios you want to test (e.g. which different algorithms you want to test).
2. Consider the certainty of your observations. How sure are we that the ranking we observe for our classifiers is valid? If we base our choice for the production system on a classifier based on scores; how high is our confidence?

Report your observations to the class on the following sheet:

https://docs.google.com/spreadsheets/d/1zWnt6mwQY9KoWFXZpSQ6rsdOn56YL7-PYVdvu-_IwDo/edit?usp=sharing



# `Exercise 2: Training Set Curation`

Re-visiting the training set is a second possibility in terms of optimisation. In the real world this happens very often and frequently during the lifetime of a machine learning project (Machine trainer is a professional term that has gained traction as of late).

Some practical considerations when revisiting the trainining material:
* Estimate necessity for multiple assessors: Not having 2+ assessors go over the same samples can speed up the creation of labelled samples. It is an optimisation that is often initially applied partially. E.g. by having only a selected sub-set of the labelled samples be annotated by multiple people, and the majority by only 1 person.
* Management of iteratively created sets. Each new addition of the training set is usually created under different circumstances. It is therefore good practice to test performance also on the subsets themselves. This can be very useful in terms of identifying mistakes in the curation, but also is very valuable input for the validity optimisation. 


# `Exercise 3: Evaluation & Validity`

Evaluation & Validity is often the hardest and most crucial part of AI application development. 

Validity itself is a very hard topic to approach: How do we know if a measurement instrument is accurate and measuring what it is supposed to measure? Think about a Thermometer; how do you know that it is accurate?

There are a couple of main approaches towards this:

* Calibration approach: Measuring with multiple sets
* Fine-grained analysis: Evaluating the indidivudal label scores. How close are the scores between some labels? How distinctive is the scoring? This can also go to the level of analysing the learned weights (a.k.a parameters) of a model. What is the model really classifying?
* Qualitative Assessments: Conduct further tests with synthetic input to assess the behaviour of the classifier.

Some technical approaches allow us to 

## `Exercise 3a: Calibration`

Make use of the additional sets created under Exercise 2 as calibration tools. 

One way to do this is to rank multiple classifiers based on these sets. 
The intuition being, that high overlap in these ranking indicates that they are measuring the same thing.

## `Exercise 3b: Drill Down Analysis`

Drill down analysis means to have a look at the score distribution:
* How close are the scores?
* Do they make sense in terms of the task?

Use the tooling that sci-kit classifiers bring to look at the scores per sample.

Drill down analysis can also mean to inspect and have a look at the parameters. sci-kit classifiers have varying support for this.


## `Exercise 3c: Qualitative Assessment & Semi-Supervision`

Lastly, nothing beats taking a close manual look with unseen examples. 
This can often be combined with the drill-down analysis to automate part of this process. 

We can base our assessment on samples where the classifier had very high confidence in order to assess if we are really training for the correct classes.

* Make use of the per sample scores in order to create subsets of high confidence samples and subsets of low confidence samples. Use those to do the manual analysis process. This approach is one example of semi-supervision. 


# Exercise 4: Round II of Algorithm Optimisation

Have another run of algorithm optmisations. Report the new observations on:

https://docs.google.com/spreadsheets/d/1zWnt6mwQY9KoWFXZpSQ6rsdOn56YL7-PYVdvu-_IwDo/edit?usp=sharing

