# Social Data Mining 2016 - Know your Performance

In the previous skills class you applied the skills you have learned up until now to attempt to solve the MNIST data challenge. As we've held back information during the weeks, this practical is going to be the first session where you are actually going to be able to say something about how well your algorithms work.

## 0.1 - Prepare your Data

Today we are going to work on the IMDB set. **However**, I have made special splits for you, so please DON'T use the `.arff` we provided you before, this includes the whole dataset. Instead, use `imdb_train.arff`, and `imdb_test.arff`.

## 0.2 - Refresher on Evaluation

As we've seen up until now, any Data Mining routine starts with data; as we're using pre-made datasets the first step is determining the goal of the dataset, informing yourself about the features, spotting any pollution / contamination in your featureset and setting up a classification task for yourself. To know how well you are doing while running your classifiers we can discern *up to* four types of splits in a given dataset:

- Training set
- Validation set
- Testing set
- Held-out set

The size of these with respect to the whole dataset usually differs, and if you are using data for an existing classification task, usually you are already provided with a test set. In general, popular splits are (train / validation / test) 80/10/10 or 50/25/25 (given enough data). As we've seen, however, the validation set can also be replaced with $k$-fold Cross Validation (CV). 

The procedure of testing performance would then be as follows: we train our model on the training set, and evaluate using either are validation set or CV how we are doing. We can choose to tune the parameters of our classifiers (like $k$ for $k$-NN) to maximize the score we are looking at. After you're confident about our set-up, you test it on the test (training on the training set, and evaluating on the test set rather than the validation set). **Important**: after you know your performance on the test set, you have an indication of how well your model will do on unseen data. If you go back to the drawning board again and try to optimize for test set performance, you **lose** this information and can afterwards not say anything about how well your classifier performs on unseen data.

The held-out set is a special case. For many data challenges (like MNIST), they purposefully withold a part of the data (to make the evaluation as fair as possible). This you could consider as actual *new* data you're provided with. Once the challenge has ended and everyone has submitted his best performing model, they are tested on this piece of data and scored (as we did after the challenge).

----

So how do we meaninguflly interpret the performance of our model? So far, we have only looked at accuracy (percentage of correctly classified instances), but also error (percentage of incorrectly classified instances). As you might have noticed in WEKA, there are a bunch of other metrics that capture a bit more information about certain parts of the confusion matrix. What we have looked at in the previous lecture was *precision*, *recall*, and *$F_1$* score.

For this example, let's assume that we have a dataset with a lot of dogs (5000), and some cats (200). We want to see if we can distinguish a `cat` from a `dog`. As such, we have two classes: `cat`, and `dog`. Cat is the minority class here, so we are actually interested to see how well we will do on classifying cats amongst a lot of dogs. As such, `cat` will be our **Positive** instance (that which we are interested in doing well at), and `dog` **Negative**.

Recall: will tell you something about how many of the instances that you **wanted to classify** you actually classified correctly. So in our example, if we managed to classify all cats as actual cats (even if we classified some dogs as a cat) our recall is 100%. This distinguishes itself from accuracy, because accuracy says something about how well we classify ALL our classes. Say that we classified 100 cats as cat, but 100 **cats as dog**, our recall is 50%.

$\text{Recall} = \frac{\text{True Positive}}{\text{True Positive + False Negative}}$ 

Precision: will tell you something about how many instances you **have classified** as being Positive (i.e. cat here), were actually a cat. In our example, if managed to not classify any `dog`s as `cat`s (even if we only classfied two instances as being a cat), our precision is 100%. Say that we classified 100 cats as cat, but also 100 **dogs as cat**, our precision is 50%. So:

$\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}}$ 

Recall that the confusion matrix looks like:

                   classified as
                  | Cat  | Dog |
           | ---- | ---- | --- |
           | Cat  |  tp  | fn  |
    actual | Dog  |  fp  | tn  |
    
    
Way to remember: first determine the thing you want to classify - that is Positive, the other class is Negative. If we classified something as Positive, and it was Positive, our Positive predicition was correct (true), so True Positive (tp). If we classified something as Negative, and it was Positive, our Negative prediction was incorrect (false), so False Negative (fn), etc.  


**NOTE:** We currently only explain this in a binary classification task (with only 2 labels, so either yes or no, dog or cat). In reality you probably have more labels, but we'll get to that another week. Just make sure that you understand it in the binary scenario.


These two metrics therefore trade off two properties of our classification process: how many instances that we want to detect as Positive do we actually detect, and how many instances that we detect are actually Positive. In some scenarios we want to optimize precision, in some we want to optimize recall, but this is only when we __can't do both__ (because of lack of data, unbalanced instances, poor choice of classifiers, etc). In general, we actually want both high recall and high precision. To sort of squeeze this into one metric, we use the 'harmonic mean' (approximately average) between precision and recall, called the $F$ score or $F_1$ score:

$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$



## 1 - How to Evaluate


| nr | (true) label | prediction |
| -- | ----- | ---------- |
| 1  | dog   | cat        |
| 2  | cat   | dog        |
| 3  | dog   | dog        |
| 4  | cat   | cat        |
| 5  | dog   | dog        |
| 6  | dog   | cat        |
| 7  | dog   | cat        |
| 8  | cat   | cat        |
| 9  | cat   | cat        |
| 10 | cat   | cat        |

### Tasks

Given the results above:

- Pick a Positive class (either dog or cat, the choice here is yours).
- Make a confusion matrix.
- Calculate precision, recall and f1 score.
- How well does this classifier do on your prediction task?
- Can you think of another task (so not classifying cats and dogs) where (given lack of overal performance) you would need to optimize precision? And one for recall?

## 2 - Apply your Knowledge

You should now be able to set up a fair evaluation of your classifier. We are going to assume that we are doing the IMDB *classification* from the viewing point of the **director**. Therefore, apart from polluting features, also remove features that are not accesible to the director at point of making a movie (for more details, read the `.arff` header.

* Load `imdb_train.arff`, remove features as described above.
* Choose either percentage split or cross-validation to make a validation set.
* Pick a classifier (or multiple) of your choice and tune its paramaters to get a better result on the validation set.
* When you're satifisfied with your result, test your set-up on `imdb_test.arff`, under `Classify -> Supplied test set`. 

### Tasks

1. Did you manage to do well on the IMDB set?
2. Does your choice of classifier and parameter setting make sense if you compare it to your result?
3. Would you feel comfortable delivering this result to a director?
4. Do you think informative features might help the director?

## 3 - Regression

Refresh the IMDB set by opening it up again. Now, rather than looking at classification for `quality` we are going to do regression over the `imdb_score`. We again assume that we do this from the view of the director, so remove any unneccesary features.

- Under `Classify -> Choose -> functions` select `LinearRegression`.
- Run the model.
- Run ZeroR (takes the mean).

### Tasks

5. Do you understand the coefficients and the formula in general?
6. Can you interpret the Root Mean Squared Error on basis of the sheets in the lecture? 

### Housing

Now open up the `housing.arff`. The example was also shown in the lecture, but if you're still unsure about the content, open it in your text editor, it gives a small explanation.

- Apply the same method again.

### Tasks

| crime-rate | zoned | industry | charles | nitric-oxide | rooms | age | employment-center | radial-highways | property-tax | pupil-teach-ratio | proportion-black-families | poor-people |
| ---      | ---  | ---    | - | ---    | ---    | ---    | ---    | ---  | ---   | ---   | ---    | ---   |
| 25.04610 | 0.00 | 18.100 | 0 | 0.6930 | 5.9870 | 100.00 | 1.5888 | 24   | 666.0 | 20.20 | 396.90 | 26.77 | 
| 14.23620 | 0.00 | 18.100 | 0 | 0.6930 | 6.3430 | 100.00 | 1.5741 | 24   | 666.0 | 20.20 | 396.90 | 20.32 | 
| 9.59571  | 0.00 | 18.100 | 0 | 0.6930 | 6.4040 | 100.00 | 1.6390 | 24   | 666.0 | 20.20 | 376.11 | 20.31 | 

7. Use the formula WEKA outputs to predict the `median-value` (by hand) for the feature vectors above.
8. Use the predicted `median-value` to calculate the Root Mean Squared Error for the **actual** median-value in the table below. You do these by (for each of the feature vectors) substracting the `actual` from the `predicted` value, and squaring them. After, you take the sum over all these values, divide it by the amount of predictions (so 3), and take the root of this number. <br><br> Or: <br><br> $\text{RMSE} = \sqrt{ \frac{ \sum^n_{t=1}(\hat{y}_t - y_t)^2}{n}}$
9. Try comparing the difference between ZeroR and the model.
10. Try comparing the error between the housing set and the IMDB set.


| median-value |
|------------- |
| 5.60         |
| 7.20         |
| 12.10        |