# The elimination algorithm

<img src="x-factor-elimination-beatrice.jpg">

### Implement two versions of the algorithm

One with sets and one with pandas dataframes

In the class we have seen that classification is a way to find a function that maps any input feature vector to an output label. Naturally, what we put in our feature vector is important for the quality of the output. If we encode whether we saw the example on a Thursday or not, for example, we should not expect to get a lot of information.

The Elimination Algorithm is a simple classification algorithm that assumes that the labels will share important features, and that each label class can be expressed as a conjuntion of those features. 

In this exercise, you are supposed to implement the Elimination algorithm in two different ways. 

You are provided with a file `animals.csv` that contains a list of different types of animals, along with set of traits that characterize these animals, as well as a label. Each line in the file represents an animal. E.g. the first line, 

````
aardvark,backbone,breathes,catsize,hair,legs4,milk,predator,toothed,mammal
````

is the aardvark, which is characterized by having a backbone, breathing, being about the same size as a cat, having a hairy body and four legs, producing milk for its offspring, hunting other animals for food, having teeth. The final value on the line informs us that the aarvard is a mammal, which is the label of the example. 

Your task is to read in the file, and then run the algorithm for each label class. 

### Algorithm variant 1: set-based

The goal of the Elimination Algorithm is to find the set of traits that are common to all of the animals in a given class. I.e. it computes the intersection of all the traits of the animals in the class:

$$H_{\text{mammal}} = A_{\text{aardvark}} \cap A_{\text{antelope}} \cap \ldots \cap A_{\text{wolf}}$$

It's called the elimination algorithm because this procedure can be performed step-wise (online), examining one animal at a time. You start out with an initial hypothesis $H_{\text{mammal}}^{(0)}$, which we here define as the set of all possible traits:

$$H_{\text{mammal}}^{(0)} = A_{\text{aardvark}} \cup A_{\text{antelope}} \cup \ldots \cup A_{\text{wolf}}$$

The hypothesis is then continously updated, e.g.

$$H_{\text{mammal}}^{(1)} = H_{\text{mammal}}^{(0)} \cap A_{\text{aardvark}}$$

and

$$H_{\text{mammal}}^{(2)} = H_{\text{mammal}}^{(1)} \cap A_{\text{antelope}}$$

First, you need to read in the file. Unlike the CSV files you have been exposed to so far, the `animals.csv` file cannot be read into a DataFrame by the `read_csv` function in the Pandas library. Instead, you'll need to process the file line by line. Why do you think this is?

Compute the hypothesis for each of the animal classes in the data file.


In [3]:
animal_file = open("animals.csv")
# Maybe some of your code here as well?
for line in animal_file:
    # Your code here: Process the data
    pass


### Algorithm variant 2: `DataFrame`-based

In this variation of the algorithm, you'll be working with a different representation of the data. Instead of associating each animal with a set of traits, we'll represent the animal by a feature vector. When there is $k$ possible animal traits, this will be a $k$-dimensional binary feature vector. In `DataFrame` terms, this type of representation could be initialized with the following code, given that `animal_names` is a list of the animal names, and `traits` is a list of possible traits.

```python
M = pd.DataFrame(False, index=animal_names, columns=traits)
```

Note that `M` does not contain any data yet. An alternative way of creating the `DataFrame` and filling it with values at the same time is to build a list of dictionaries `animal_dicts`, each structured as below:

````python
{'aardvark': True , 'backbone': True, 'breathes': True, 'catsize': True,
 'hair': True, 'legs4': True, 'milk': True, 'predator': True 'toothed': True, 'mammal': True}

````

Given such a list, the `DataFrame` can be created in the following manner:

```python
M = pd.DataFrame(animal_dicts, index=animal_names)
```

Create and fill the `DataFrame` in the way you prefer.

In [None]:
# Your code here

With the matrix-based data structure in hand, we make the following observation. If you sum over all of the traits in a particular animal class, the aggreated vector counts how many time each of the trait occur. Consequently, if a trait has an aggregated count less than $n$ in a class with $n$ instances, it cannot be a part of the definition of the animal class, since it doesn't occur everywhere. 

Compute the aggregated sums for each animal class, printing out the traits with a value of $n$.

In [4]:
# Your code here