# Data Science - Assignment 3: Know your Results

In the first two skills classes we focussed on the very basics of interpreting data, running models, and comparing the (unexpected) output to our intuitions. This week, we will focus on improving and interpreting scores for the models in a meaningful way.

## 0.1 - Prepare your Data

We are using the same data as before, and especially the IMDB set that was provided during the first test. Make sure that you got rid of any string-valued features, as well as 'polluting' features. Again, it's important to compare results between other datasets.

## 0.2 - Refresher on Evaluation

Any machine learning task can be formally evaluated by using the labels obtained from the feature that was selected as target for prediction, and comparing them to the predicted labels by an algorithm. Previously we did this by simply looking at the accuracy scores (percentage of correctly predicted instances). In the previous class however, we trained on all of our data, and after we also made predictions about exactly that same instances. As such, the results shown were merely reproduction; showing how well the classifier trained with perfect information is able to reproduce exactly the same labels. What you might have seen as a result of this, is that intuitively uninformative feautres (unique for each instance, does not contain any information about the task we are trying to predict) might yield incredibly high results, while intuitively informative features might not. As this is counter-intuitive, it is correct to deem this behaviour as wrong. Uninformative features should NOT yield high scores (unless they have some surprising characteristic, which is unlikely), and high scores in general should not be trusted at face-value. 

In a realistic setting, we would want to leave some of our data 'unseen' by the algorithm. We provide only the feature vectors, and hope that it will guess the label correctly. Now, in that setting, if we obtain a high score then we might conclude that our algorithm is performing well, and we can actually interpret the _actual_ contribution of features (for some classifiers) to this score. Still, even the interpretation of the performance on the test set should be considered in the light of (i) the size and quality of your data, and (ii) the parameter settings of your classifier.

One of the more common ways of making sure that the classifier is provided with some unseen data is by simply chopping off a part of your data, and leaving it aside for testing. As such, you have A) a big training set, B) a small test test. The actual proportions of these sets depend on the amount of data that you have, but in general it's good to leave at least 20% for testing. Again, **these proportions are not set in stone**, and therefore you cannot just say "always leave 20% for testing". Treat it as a starting point.

Because we now have some guarantee that the classifier is tested fairly, we can actually compare performance of for example a majority baseline (which is also tested on the test set -- majority class is determined on the training set). So what can we conclude afterwards? Given the data that we selected, the proporiton of train and test, and the parameters that were selected, classifier X might perform better than the baseline (meaning it has learnt something in this specific setting). Our final goal is to maximize the classifier performance, and meanwhile to be as critical about the results as possible.

## 1 - Train and Test

For this assignment, we will use WEKA to split the dataset and hopefully get some more interpretable results.

- Open at least the IMDB set, and optionally other sets of your choice in WEKA, follow the usual steps (also see 0.1).
- For the set you've selected, recall what you personally found the most informative features.
- On the classify tab, select `percentage split` and set it to 80% (proportion of the training set).
- Don't forget to select the label you want to predict.

**Performance will be measured by accuracy here.**

### Tasks


1. How well does majority baseline do? Can you explain its performance given the label distribution?
2. What does a high majority baseline tell you? And a low one? Does it tell you anything about the complexity of the prediction task?
3. How well does IBk do now? How does that compare to before (on just the training set)? How does it compare the majority baseline? Did it learn anything? What does this tell you about the complexity of the task?
4. How well does J48 do? Does it do better or worse compared to KNN? And the majority baseline?
5. Can you now say something about the data given the performance of these classifiers and their decision boundaries?
7. Did J48 agree with the features you thought were most informative?
8. Can you spot any features that you think would be better to remove (IMDB specific)?
9. Would you remove features to improve the performance of your classifier? Would you remove above features even though they hurt the performance of your classifier?

## 2 - Improving Performance (Trial)

As was mentioned in the lectures, classifiers have parameters, or options, that you can change. WEKA by default sets these to some value, but you can change them yourself. These can be found by clicking on the classifier name, next to the `Choose` button. Try to tweak the amount of neighbours (KNN parameter) for `IBk` to achieve a better performance.

### Tasks

10. At what value do you see an increase of performance?
11. Are you able to achieve a much higher score than K=1? What does this tell you about the data?
12. Open up the Titanic set and repeat this task. Try to find a value between K=1 and K=20. Do you manage to improve it this time?


- Write down the KNN value that improved your task most on the Titanic set. If you forgot it, it's just above the scores, under:

```
=== Classifier model (full training set) ===

IB1 instance-based classifier
using ... nearest neighbour(s) for classification
```

- Run majority baseline.


13. Does the default value of K=1 improve upon the majority baseline for this test? What about the one you selected?
14. Do you think that the classifier (if we were to have new data) will perform as good on new data?

## 3 - Interpreting Performance (Error)

Suprise! We have new data! For the titanic, it is under `missing.arff` on [GitHub](https://github.com/ericpostma/ep/tree/master/chris/Week%201%20-%20Introduction/Titanic/). This what you could treat as 'future', or 'wild' data; data you have not trained on, tested on, you know nothing about this data. In general, you don't have access to this data, but you hope that under the constraints you have set up, the classifier still performs well. So, let's put it to the test.

- In WEKA, on the Classify tab, select IBk and the KNN value you previously determined yielded the highest score.
- Select `Supplied test set` and open the `missing.arff`.
- It will tell you that train and test don't match (that's because we removed some string variables), luckily WEKA can map the features automatically with the InputMapper, it will give a pop-up if you want to do this, click `Yes`.
- After running KNN, also run ZeroR.
- Now run KNN with K=3.


15. How does this performance compare to your earlier score (between your K, majority baseline, and K=3)? Why do you think this is the case?



