# Social Data Mining 2016 - Assignment 1: Know your Data

In this session you will choose **several** of the datasets that we have prepared. Be sure that you find them interesting enough to motivate your scientific curiousity. A bit of information on them to help you choose:

- **Titanic**: Passenger information of the most famous shipwreck in history.

- **UCI HAR**: Participants were asked to carry out several actions while their mobile phones recorded gyroscope and accelerometer information.

- **Game of Thrones**: These are **two** datasets that capture information from Game of Thrones (tv) / A Song of Ice and Fire (book). One contains information on characters, the other on battles. Needless to say, it might contain potential spoilers if you're not caught up.

- **Pokémon**: Shamelessly riding the Pokémon Go hype we prepared a dataset with Pokémon up until the current generation, listing their stats, type, and colour.

---

> Q: Why do I need to see more than one dataset? 

> A: Different sets require different interpretations. Some are big, some small, some have a lot of features, some have very few features, some require expert knowledge, some are very intuitive. Its good to compare these sets on all tasks on these points. 

----


To download, follow the steps below **!!!!!! carefully !!!!!!**:


## 0 - Download Data & Rename


- Make sure you have WEKA installed from http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
- For Windows, make sure you also install Java with it.

#### Before Anything - Windows 8 / 10

- First go to your File Explorer (the thing below).

![](https://upload.wikimedia.org/wikipedia/en/8/83/Screenshot_of_Windows_10_File_Explorer.png)

- Click on the `View` (`Beeld` in Dutch) tab at the top side of the File Explorer screen.
- Turn ON `View File Extentions` or `Toon Bestandsextenties` (or something like that).

#### Doesn't work for your Windows version?

Follow this: http://www.howtohaven.com/system/show-file-extensions-in-windows-explorer.shtml


#### For Everyone:

All above datasets can be found at https://github.com/ericpostma/ep/tree/master/chris/Week%201%20-%20Introduction. As follows:

- Open up the link in Firefox / Chrome / Safari (IE or Edge might cause issues).
- Click on the `.arff` file.
- You should have a view of your data.
- Right corner, click `raw`. <-- **IMPORTANT**
    - For UCI HAR just click `download` instead of `raw`.
- Right click `save page as`.


### --> If file is saved as .arff.txt

#### Windows

- Simply rename to `.arff`.
- Still get errors after opening in WEKA? Make sure you have the extentions enabled (see above)!

#### Mac

- Right click, or double finger click / tap / whatever the file, click `Get Info` (or press Command (⌘)–I).
- Click the triangle next to `Name & Extension` to expand the section.
- To show the filename extension, deselect “Hide extension.”
- Remove the `.txt` part.

---

> Q: Won't changing the extention change the file somehow?

> A: No, extentions are arbitrary and add nothing to the file. They are just there as markers for programs to interpret them in some way, and for Windows / OSX / Linux to know what programs to open them with by default.

---

## 1 - Interpreting Raw Data

Your first step as a data scientist is (as frequently repeated in the lecture) to *know* your data. Everyone can learn to fire up a program and click a few buttons. It is *your* task, however, to understand this data and generate creative insights and be able to communicate these (in the form of a scientific paper or business presentation). It is therefore important to know the possibilities and limitations of your data.

Open the `.arff` found in any of your selected datasets' directories in a flat text editor (e.g. `notepad` for windows, `textedit` for mac, and `gedit` for linux). WEKA has a unique format to record datasets and their structure, information on which can be found at http://www.cs.waikato.ac.nz/ml/weka/arff.html. It globally splits up in three blocks: the documentation (prepended with `%`), the features (prepended with `@`), and the data (separated by `,`).  Make sure that you read the documentation, which provides (a link to) the required information to interpret and reason about your data without doing anything related to Data Mining yet.

### Quick Refresher

Each line of data you see in the `.arff` file is called a **feature vector**. A vector here we see as row of values or attributes (**features**) that describe each instance. An instance can be a passenger (titanic), person (UCI HAR), or character (GoT). Their features can be `numeric` (some number) or `nominal` (one of a restricted set of labels). So for Titanic, the ticket price paid by the passengers is numeric, whilst their gender is nominal. Nominal values can again be transformed to a number (0 for male, 1 for female for example). The beauty of this is that each instance can be represented as a point in some graph. So say that we have a female that bought a 4 dollar ticket, and a male that bought an 8 dollar ticket, we would expect this graph to look like:

             1|   x (female, 4)
              |
       gender |
              |
              |__________x_ (male, 8)
             0     price   10

As we will see a bit further into this assignment, WEKA does a way better job at visualizing these feature vectors.

### !!! Important Note !!!
If you cannot answer a task question fully, or feel unsure about your answer, please ask. It is very important at this stage that you have the correct intuitions for each of the points we discuss here. Sometimes they just don't 'click' by themselves, please let us help you. Moreover, this stage is exploratory **only**, we're not going to do actual classification just yet. Hence, at this point there is no concrete way to validate your answers and hypotheses (yet) other than asking us!

### Tasks

1. Determine the main classification task of the sets you have selected by reading the file's documentation.
2. Try to reason which features you think are informative for this classification problem.
2. Can you come up with more than just this classification task?
3. Was it easy to interpret the documentation? Does the data require expert knowledge? How do different sets compare in this regard? 
4. Can you find any features in these datasets that would allow for regression rather than the classification of the class label?

## 2 - Interpreting Data using WEKA

Now load the datasets you selected in WEKA `-->` Explorer `-->` Open file. We're not going into any tabs or other parts of WEKA yet, just purely looking at the data.

### Tasks

5. Above the histogram, select the class you determined in Task 1.1 is indicative of the classification task.
6. In the left pane, click some of the features and explain for yourself what the colours indicate for this feature and try to interpret the `min`, `max`, `mean`, and `st. dev` and their relation to these features.
7. Deduce from the histogram if there are already some features that are informative for the classification task (i.e. there's a clear proportion difference between classes).

## 3 - Preliminary Manual Cleaning

Now select **one** of the datasets. If this is **NOT** the UCI HAR dataset (only numeric and a LOT of features to go through): click each of features, if it shows that an 'Attribute is neither numeric nor nominal', select it and click `Remove` (down below). Until you only have either numeric or nominal features left.

8. Are there any features left that might 'pollute' your classification (i.e. already give away information that directly relates to the label)? How can you judge from the histogram? Do some have no added value (like student number)? If so, remove those.
9. What do you foresee will happen if we would keep those features?
10. Why don't these features belong in our classification problem?

---

> Q: my set of choice doesn't have any 'pollution', do I skip this?

> A: open the Game of Thrones `character` set and try to answer these questions with it.

---


## 4 - Visualizing Features using WEKA

Now go to the `Visualize` tab (up top). If the plots and dots are too small on your screen, you can increase the PlotSize and / or Pointsize and click `update`. You can change `jitter` to slightly move the points from their original location, so if there are 1000 points bunched up on coordinate (1,1) this would actually look as more dots rather than just one. 

This screen gives you visual information about groups of features, where the features on the left are on the Y-axis, and features on top are on the X-axis (like the example in the quick refresher). 

### Tasks

11. Can you find any feature combinations where you see that there are very clear clouds of one colour (i.e. this feature combination very well describes the classification task)?
12. If not, why do you think this is the case?
13. Do any of the graphs show a (visual) correlation? How can you tell if this is, or is not the case? 
14. If there is --- or would be --- a correlation between two features, what would this tell you about these two features?

## 5 - Extra: Back to the Drawing Board

Now that you've gained some extra insight into these datasets, open their `.arff` files again in your text editor.

### Tasks

14. Can you split or combine current features to create new features?
15. Imagine you have other data available on the subject of one of your datasets, what features would you add to help classification?
