# Feature Representation

The format we receive real world data in varies greatly and often needs to be modified before it can be used in machine learning models. The date must often be transformed to create a **representation** of the data which can be used.  

Representing features optimally is the most important technical task in most machine learning problems. It is where practitioners of machine learning spend most of their time and it often has _the largest impact on performance_ of anything you can do.


Data comes in many types:
 
* real valued numbers : petal length
* Categorical data: species name
* constrained data: like ratios or compositional data 


## Real-world data is messy

 For example lets think about a classifier that identified genes involved in resisting a plant disease
 
 If the gene is involved it has a label=1. If it is not involved its label label=0.
 
 It has the following traits:
 
feature | type  | example
-------|--------|-------
functional category | String | "K00680"
gc_content | int between 1,0 | 0.56
length| positive int | 1901
identified promoter | Boolean | true
intron length | positive int | 300
 
 ```
 0:{functional_category: "K00680",
    gc_content: 0.56,
    length: 1901,
    identified_promoter: True,
    intron_length: 300
    }
 ```
 
 ### Encoding Strings with one hot encoding

In the first example the functional category essentially a level in a factor variable.  A common way to encode this in machine learning is the **_one hot encoding_**.

Each unique string in a dataset is given a position in a feature vector and assigned a 1 if it is present.

possible functional categories
```
categories = ["K00001", "k00456", "K00680", ...]
example    = [    0  ,    0   ,    1   , ...]
```

The vector for this example would be:

 ```
 0:{functional_category: [0,0,1],
    gc_content: 0.56,
    length: 1901,
    identified_promoter: True,
    intron_length: 300
    }
```

A one hot vector can get really long, but don't worry, we can encode that long vector as a sparse vector that doesn't take up too much memory.

### Numeric values

* Numeric values both integer and real care already in a form that can be user as features in a ML model.

* Be careful though, sometimes a number is really just a label. Suppose we had trimmed the "K" prefix off our functional category features.  Now our feature would be `00680`, an integer which corresponds to a N'acyltransferase enzyme in the KEGG gene ontology.  That has nothing to do with `00681` which is a  glutathione hydrolase enzyme.  In this case we need to use one hot encoding.

** Does it mater that these numeric values are in different number spaces (real, integer, constrained? **

Yes, it does.  Often data needs to be standardized or binned to make it easer to learn.

### Putting it together encoding this example:

```{python}

0: [0.0, 0.0 ,1.0 , 0.56, 1901, 1, 300 ]
```


## What makes a good feature?

1. It should be non-zero more than a few times. If there are not many examples of it in the training data then it isn't of much value in learning 

2. Represent the values in a way that makes sense to humans if possible. This make's troubleshooting easier 

3. Don't use magic values as flags. For example sdon't use -1 to indicate that there were no introns in the sample above

# Cleaning up data 


The old adage "Garbage in, garbage out" very much applies to machine learning. Much of the work of the data scientist is data cleaning.

Most data cleaning starts with data visualization.  IF the feature space is not too large the distribution of of each variable can show outliers. Reduced representations of data cuch as PCA can show outliers in high dimension data sets/

There are a number of tools for cleaning data including

* [Open Refine](http://openrefine.org/) see [this tutorial](https://datacarpentry.org/OpenRefine-ecology-lesson/00-getting-started/)


Once outliers have been removed you need to evaluate the need for data transformation and standardization.

Some menthos like Random forest classifiers (covered later) do not need data to fit a particular distribution.  other mehtods like Logistic regression work better it the data is transformed and standardized.

