# Social Data Mining 2016 - Know your Process

In the previous skills class we have looked at different metrics to interpret the performance of your classifiers, a fair set-up of determining the performance of your model (both on known data, as on new data), and regression. This week we are again going to dive into WEKA with your new knowledge, and look at how to make your process more convenient.

## 0.1 - Prepare your Data

For this practical, you can work on any set we have used before. The goal will be to maximize performance on these tasks and see which classifiers will perform best. 

## 0.2 - Refresher on Work Flow

Just to recap on our general work flow while using WEKA. One of the first important steps is understanding our data. Up until now, you've worked with provided datasets, for which (as with any dataset in general) it is important to understand their potential. If the dataset is provided, it generally has a purpose, or classification / regression task associated to it. When you collect data yourself, you probably already have this in mind. Therefore, it's always a good idea to start with opening up the `.arff` files, determining the task at hand, and any other potential prediction tasks we might be able to do. For example, as we've seen with the IMDB set, we could either use all possible information in the dataset to see if we can replace IMDB as a website; just by looking at movie, actor, and their respective social media information. However, we can also omit post-release information regarding the movie to set up a prediction task for a director that is making a movie.

We are also interested in the features and their meaning; if we don't know what we're working with, we cannot interpret the results we're getting, and will also fail to spot issues. The latter can for example be meaningless or contaminating (or polluting) features. A good example of a meaningless feature is the `budget` in the IMDB set; because it is a mix of different currencies, there are huge outliers and equal individual numbers (so 1000 and 1000) won't necessarily have the same meaning: one can be baht (100 = 2.56 euros), and the other euros. Contaminating features are those that are directly related to the feature that we are using as a label. These might occur in your data when a numeric value is converted to a nominal one (like `imdb_score`), or if they describe different information of the same concept (like `time_of_death`, and `is_alive`).

After we've prepared our features, know what our prediction task will be, we can run classifiers. As you've seen before, you can run them manually. However, today we're going to elaborate on how to quickly compare different settings.


## 1 - The Experimenter

The experimenter allows us to run batches of experiments and see if their performance differs enough (between classifiers) to be statistically significant. However, it doesn't allow you to edit your dataset, so you would need to that before going into this mode.

- Open WEKA.
- Click `Explorer` and open the dataset of your choice.

Given that we only can deal with numeric and nominal features, we want to filter strings before opening the dataset in the Experimenter. WEKA can use `Filters` to quickly remove these for you (we did this manually before).

- Under `Filter` click `Choose`.
- Go to `Filters -> unsupervised -> attribute -> RemoveType` (by default this selects string).
- Click `Apply`.
- Remove any contaminating / polluting features.
- Click `save` and save the `.arff` file under a new name.

Your data is now prepared.

- Close this window and open up the `Experimenter`.
- Click `New` to start a new set-up.
- Cross-validation is selected by default (but can be changed), and you can indicate if you want to do `Classification` or `Regression`.
- You can also change the number of repetitions of each set-up (more is better).
- Add a dataset of your choice under `Add new...`.

The Experimenter works a bit different than the Explorer. You have to manually indicate which feature you want to treat as class / label.

- Click the dataset name to open up a new window.
- Scroll to the feature (column) you want to treat as the prediction class, right click and click `Attribute as class`.
- WEKA asks if you want to save at this point. If this doesn't work (because Windows), follow these steps:
    - Go back into `Explorer` and open op your dataset.
    - Under Filter click `unsupervised -> attribute -> Reorder`.
    - Each feature has a number, by clicking on Reorder, you can change the order of these features. The Experimenter can only use your LAST feature as a prediction label. So you would want to move the feature you want to use as a label to the back. Say you want to have feature nr 26 (out of 29) as the prediction label, you change `attributeindices` to `1-25,27,28,29,26` (keep 1 to 25 the same order, then 28, 29 26, make sure not to include spaces!
    - Don't forget to click `Apply`.
    - Save your datafile **TO A NEW FILE**, do not overwrite!
    - Open the `Experimenter` again.

Now we can add some classifiers:

- Under the algorithm part, click `Add new`.
- The first classifier you select will be compared to all others (or vice versa), so make sure you select a baseline classifier or some other one you're interested in (a certain parameter setting for example, to compare to other parameter values).
- After you selected a bunch, click the `Run` tab.
- Press `Start`.
- Wait until they are finished.
- Click the `Analyse` tab.

Now we want to import the experiment that we did to run statistical tests on.

- Import your last results by clicking `Experiment`.
- Click `Perform test`.

### Tasks

1. Interpret the results, do your classifiers perform better than the baseline?
2. Can you say something about the (non-baseline) classifiers compared to each other?

## 2 - GridSearch

While tuning parameters by hand allows us to get a lot of insight and feeling for which values perform well, and if we might be making the model too specific or too general; it costs a lot of time. One of the ways to just batch run a lot of different parameter settings and make WEKA (or any other data mining program) figure out the optimal scores is called `GridSearch`. In WEKA, `GridSearch` is only implemented in the developer version, but our version has a similar implementation called `CVParameterSelection`. We will try to apply that to our problem now.

- Close the Experimenter and open the Explorer again.
- Prepare your dataset like before.
- Under the `Classify` tab, click `Choose`.
- Navigate to `classifiers -> meta -> CVParameterSelection`.
- Click on the classifier, and notice that within these options you can change the classifier you want to do parameter selection on.
- Let's take KNN as an example. To change the parameters you want to tune, you need to give their string representation, the range, and the step size in the `CVParameters` field. Say for example that we want to tune the value of K, between 1 and 10, we type `K 1 10 10`. If we want to take steps of 2 say (1, 3, 5, 7, 9), we would type `K 1 9 5`, etc. Click `add`, and close the window if you're done (you can provide multiple settings).
- Run the classifier.
- Somewhere it should say `using X nearest neighbour(s) for classification`, this is the best parameter for the range you have indicated.

## Tasks

1. Try to fiddle around with certain set-ups, and different classifiers / parameters.
2. Do you find that running the parameters in batch gives you an advantage? What are the downsides?
3. Repeat the same procedure in the Experimenter. Did you lose any information compared to adding them by hand?

## 3 - Extra

Under filters, there are several options to change string variables to something meaningful. One of them is for example `StringToNominal`, or `StringToWordVector`. 

### Tasks
1. Try to select the latter for the IMDB set (make sure there are still string variables in the dataset) and see what it does (make sure to `RemoveType` filter afterwards). 
- Run J48 and interpret the trees.
- Did this improve your performance?