# Omigami Tutorial
Omigami's main class, `FeatureSelector`, implements the minimally biased variable selection algorithm described in _Shi L, Westerhuis JA, Rosén J, Landberg R, Brunius C. Variable selection and validation in multivariate modelling. Bioinformatics. 2019 Mar 15;35(6):972-980. doi: 10.1093/bioinformatics/bty710. PMID: 30165467; PMCID: PMC6419897._ For more details regarding the algorithm, please refer to the original paper. 

In this notebook, a simple application of this tool will be showcased.

### Imports and client setup
We start importing `FeatureSelector` from `omigami`and setting up a Dask cluster to enable parallel computation.
The cluster will run on our local machine and a dashboard that shows the state of the computation will be will be available at http://localhost:8787/status.

In [1]:
from omigami.omigami import FeatureSelector
import dask

# Spin up a local cluster using dask 
from dask.distributed import Client
client = Client()

## Load the data
For this tutorial, we'll use the "mosquito" database (_Buck M. et al. (2016) Bacterial associations reveal spatial population dynamics in Anopheles gambiae mosquitoes. Sci. Rep., 6, 22806._). The database represents 29 measurements of operational taxonomic units (OTU) performed on mosquitoes. The target variable ("Yotu") is the village where the mosquitoes have been collected.

In [2]:
import pandas as pd

df = (
    pd.read_csv("./mosquito.csv")
    .rename(columns={"Unnamed: 0": "sample_id"})
    .set_index("sample_id")
    .sample(frac=1)  # to shuffle
)
df.head()

Unnamed: 0_level_0,OTU_0,OTU_1,OTU_2,OTU_3,OTU_4,OTU_5,OTU_6,OTU_7,OTU_8,OTU_9,...,OTU_6670,OTU_6675,OTU_6681,OTU_6685,OTU_6686,OTU_6691,OTU_6693,OTU_6702,OTU_6709,Yotu
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
VK5_9,1861,0,7,4,993,0,202,0,0,15,...,0,0,0,0,0,0,0,0,0,VK5
VK7_30,24,6,72,58,0,0,1,0,0,97,...,0,0,0,0,0,0,0,0,0,VK7
VK7_31.35,253,0,2,29,0,1,857,2,0,8,...,0,0,0,0,0,0,0,0,0,VK7
VK5_5,317,0,3,0,3259,0,179,0,0,3,...,0,0,0,0,0,0,0,0,0,VK5
VK3_47,396,0,14,19,36,0,0,0,0,123,...,0,0,0,0,0,0,0,0,0,VK3


We have to separate the predictors variables `X` (OTUs) from the target variable `y` ("Yotu").

In [3]:
X = df.drop(columns=["Yotu"]).values
y = df.Yotu.values

## Feature selection
Now we are ready to perform the feature selection. First of all we instantiate the feature selector class. To keep the algorithm runtime short, we'll specify few repetitions, few CV splits and a high feature dropout rate (the fraction of features dropped at every step of the recursive feature elimination). In a real case scenario, a higher number of repetitions and a lower dropout rate would be appropriate.

In [4]:
fs = FeatureSelector(n_outer=4, repetitions=3, metric="MISS", estimator="RFC", features_dropout_rate=0.2)



Now we can call the method `fit`, that would actually compute the selected features. Features will be also available as attributes as `selected_features`.

In [5]:
%%time 
selected_features = fs.fit(X, y)

CPU times: user 8.75 s, sys: 1.35 s, total: 10.1 s
Wall time: 3min 6s


### Sample correlation
The fit method will assume that every sample is independent from the others. If the samples are correlated, e.g. they belong to the same patients, an additional vector `groups` should be passed to the fit method:
```python
>>> groups
numpy.array([1, 1, 1, 2, 2, ..., 3, 1, 2])
>>> fs.fit(X, y, groups=groups)
```
This vector would represent the group (as integer index) to which each sample belongs.

### Results
Now we can print the selected features:

In [6]:
feature_names = df.drop(columns=["Yotu"]).columns
selected_features_min_model = list(fs.selected_features["min"])
print(feature_names[selected_features_min_model])

Index(['OTU_4', 'OTU_39', 'OTU_243', 'OTU_16', 'OTU_3454', 'OTU_1208',
       'OTU_400', 'OTU_28'],
      dtype='object')


## Cleanup
Finally we can close the cluster client

In [7]:
client.close()