&nbsp;
&nbsp;

# Welcome to Feature Factory for Mutagenesis

Feature factory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with what IPython Notebook. IPython Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the IPython Notebook python kernel and can be accessed at any time, by any cell. More information can be found at http://ipython.org/notebook.html

# Creating your own IPython Notebook

To get started with Feature Factory, please clone the Template notebook. To do this, click "File"->"Make a Copy". This should spawn a new tab within your browser with the copied notebook. Rename the notebook to your liking and make all edits on that notebook.

&nbsp;
&nbsp;


# Mutagenesis Machine Learning Competition

## Problem Statement

Mutagenesis is a well-known benchmark machine learning dataset. It comprises 230 molecules trialled for mutagenicity on Salmonella typhimurium. Debnath et al., [2] showed that a subset of 188 molecules are learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining 42 molecules is named the ”regression unfriendly” subset.

Despite the low intrinsic interest from a machine learning perspective, authors have historically concentrated on the set of 188 molecules (hereafter referred to as the mutagenesis dataset). Of the 188 molecules 125 have positive log mutagenicity whereas 63 molecules have zero or negative log mutagenicity. The molecules with positive log mutagenicity are labelled active and the remaining are labelled inactive. In this way predicting mutagenicity for the molecules is a binary categorization problem.

Debanth et al., indicated two chemical features and two indicator features for effective learning. The chemical (C) features are lowest unoccupied molecule orbital (LUMO), water/octanol partition coefficient (LOGP) whereas the other two features are pre-coded structural (PS) attributes.

Srinivasan et al., [1] encoded structural features of the molecules using first-order logic. The key information is given in the form of atom and bond (AB) description. Atom and bond description is used to define functional groups (FG) including methyl groups, nitro groups, aromatic rings, hetero-aromatic rings, connected rings, ring length and benzene rings.

In this competition, you are challenged to work with this dataset and attempt to identify and derive or generate the features which would help the most in predicting whether a particular molecule will be active or inactive.

1. Srinivasan, A., Muggleton, S.H., King, R., Sternberg, M.: Theories for mutagenicity: a study of first-order and feature based induction. Artifical Intelligence 85 (1996) 277–299

2. Debnath, A.K., de Compadre, R.L.L., Debnath, G., Schusterman, A.J., Hansch, C.: Structure-activity relationship of mutagenic aromatic and heteroaromatics nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal Chemistry 34 (1991) 786–797


## Data

The dataset is in a relational format, split among mutliple files. When using **commands.get_sample_dataset()** to retrieve the dataset, the files are provided as a list of *pandas.DataFrame* objects with the following columns:

![Mutagenic Data Model](https://relational.fit.cvut.cz/assets/img/datasets-generated/mutagenesis.svg)

## Step-by-Step Example

Step 1: Import the feature factory infrastructure

In [1]:
from problems.mutagenesis import commands

&nbsp;
&nbsp;

Step 2: Create a username/password or login into an existing account. If you create an account and it is successful, you don't need to login - you are logged in automatically. 

In [2]:
commands.create_user('stair_2', '')

username already exists


In [4]:
commands.login('stair_2', '')

user successfully logged in


&nbsp;
&nbsp;

Step 3: To ensure that this notebook is mapped to your username, it is required that you execute the command below. 

In [5]:
commands.add_notebook('Mutagenesis_stair_2')

Notebook already registered


&nbsp;
&nbsp;

Step 4: Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset as a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [6]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the molecule data
# dataset[1] <- this refers to the atom data
# dataset[2] <- this refers to the bond data

In [7]:
dataset[0][:5]

Unnamed: 0,molecule_id,ind1,inda,logp,lumo
0,d100,0,0,2.68,-1.034
1,d101,1,0,6.26,-1.598
2,d103,1,0,4.69,-1.487
3,d105,1,0,1.84,-1.749
4,d106,1,0,4.34,-1.607


In [8]:
dataset[1][:5]

Unnamed: 0,atom_id,molecule_id,element,type,charge
0,d100_1,d100,c,22,-0.128
1,d100_10,d100,h,3,0.132
2,d100_11,d100,c,29,0.002
3,d100_12,d100,c,22,-0.128
4,d100_13,d100,c,22,-0.128


In [9]:
dataset[2][:5]

Unnamed: 0,atom1_id,atom2_id,type
0,d100_1,d100_2,7
1,d100_1,d100_7,1
2,d100_11,d100_12,7
3,d100_12,d100_13,7
4,d100_12,d100_17,1


&nbsp;
&nbsp;

Step 5: Define your feature extraction function.

The name you give to the function is the name which will be used later on to register your feature extaction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of molecules, one row per molecules, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the molecules table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [17]:
def Mutagenesis_Feature_Extractor_stair_2(dataset):
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    from sklearn.decomposition import PCA

    # mar = Moecule-Atom Relations
    mar = pd.concat([dataset[1]['molecule_id'], 
                    dataset[1].drop(['molecule_id'], axis=1)], axis=1)
    mar = mar.rename(index=str, columns={'type': 'mtype', 'charge': 'mcharge'})
    mar = mar.merge(dataset[2], how='outer', left_on='atom_id', right_on='atom1_id')\
            .drop(['atom1_id'], axis=1)
    mar['atom2_id'] = mar['atom2_id'].fillna('Nan')  
    mar['type'] = mar['type'].fillna(0)  

    le = LabelEncoder()
    mar['atom_id'] = le.fit_transform(mar['atom_id'])
    mar['element'] = le.fit_transform(mar['element'])
    mar['atom2_id'] = le.fit_transform(mar['atom2_id'])

    enc = OneHotEncoder(sparse=False)
    mar = pd.concat([ mar['molecule_id'] , pd.DataFrame(
                enc.fit_transform(mar.drop(['molecule_id', 'mcharge'], axis=1).as_matrix()))],\
              axis=1)
        
    mar = mar.groupby(['molecule_id'], as_index=False).sum()
    features = dataset[0].merge(mar, how='inner', on='molecule_id')
    X = features.drop(['molecule_id'], axis=1)

    return X

&nbsp;
&nbsp;

Step 6: Evaluate the score of your feature extraction function before submitting it.

You can make use of the cross_validate command as many times a needed in order to have a preview of what the score of your function will be.

In [19]:
#X = Mutagenesis_Feature_Extractor_stair_2(dataset)
#X.shape
commands.cross_validate(Mutagenesis_Feature_Extractor_stair_2)

Obtaining dataset
Extracting features
Cross validating


0.87599206349206338

&nbsp;
&nbsp;

Step 7: Register your function in the system

Once you are satisfied with the results, you can call the add_feature command passing your function as an argument.
This will cross_validate the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [20]:
commands.add_feature(Mutagenesis_Feature_Extractor_stair_2)

Obtaining dataset
Extracting features
Cross validating
Your feature Mutagenesis_Feature_Extractor_stair_2 scored 0.8759920634920634
Feature Mutagenesis_Feature_Extractor_stair_2 successfully registered


&nbsp;
&nbsp;

Step 8: (Optional) Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [10]:
def imports():    # We need to import pandas within our functions
    global np
    import numpy as np

def compute_log(feature):
    """Compute log of the given column."""
    return np.log(feature)
    
def example_feature(dataset):
    imports()
    
    df = dataset[0][['lumo', 'ind1']].copy()
    df['log'] = compute_log(-df['lumo'])
    return df

commands.add_feature(example_feature)

Obtaining dataset
Extracting features
Cross validating
Your feature example_feature scored 0.8895975056689341
Feature example_feature successfully registered
