# Challenge on stable compounds: A simplified multilabel symmetric problem
__Author: Dario Rocca__

The labels to be predicted are given in the form [1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0] corresponding to pairs of elements formulaA and formulaB. Specifically, the labels correspond to the 1D binary phase diagram: [100% element A, (90% element A-10% element B), (80% A-20% B), (70% A-30% B), (60% A-40% B), (50% A-50% B), (40% A-60% B), (30% A-70% B), (20% A-80% B), (10% A-90% B), 100% B] where a 1 indicates stability and a 0 indicates unstability of the corresponding compound. This can be seen as a multilabel problem. As all the elements are stable in their pure form we will drop 100% A and 100% B (first and last label). The specific list depends on the order of the elements. For example, the pair Ac-Tl corresponds to [1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0] while Tl-Ac would correspond to [1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0], which is the inverse of the previous diagram. I found difficult for machine learning algorithms (random forest, neural network, one-vs-rest SVC, etc.) to learn this property of the data. <br>

In this notebook I will start by simplifying the problem. Specifically, I will "symmetrize" the labels to create a list (or array) with only 5 entries [(90% A-10% B) or (10% A-90% B), (80% A-20% B) or (20% A-80% B), (70% A-30% B) or (30% A-70% B), (60% A-40% B) or (40% A-60% B), (50% A-50% B)]. For example, within this formulation we transform [1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0] into [1.0,1.0,1.0,1.0,0.0]. These labels cannot be used for the final prediction as they contain less information: For example, if a 1 occurs in the first position it is not anymore possible to distinguish if the stable mixture is (90% A-10% B), (10% A-90% B), or both. These labels are not affected by the problem of the element order and element swapping as formulaA-formulaB and formulaB-formulaA correspond exactly to the same labels. This simplified problem is meant to let us estimate the level of subset accuracy (https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics)
that can be achieved by using the features provided by the challenge and by considering the problem as multilabel. Here I will use random forest which is a quite powerful algorithm and can be trained for multilabel problems. When I will go back to the full problem which has more degrees of freedom (more labels), it will be challenging to improve over the "accuracy" of the simplified problem of this notebook.   

## Loading libraries and data

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


In [2]:
# Loading the training set with Pandas

train=pd.read_csv("training_data.csv") # creating the train dataframe

## Preparation of "symmetric" labels to be predicted

As explained in the introduction I will transform the full set of labels [100% element A, (90% element A-10% element B), (80% A-20% B), (70% A-30% B), (60% A-40% B), (50% A-50% B), (40% A-60% B), (30% A-70% B), (20% A-80% B), (10% A-90% B), 100% B] into a "symmetric" set [(90% A-10% B) or (10% A-90% B), (80% A-20% B) or (20% A-80% B), (70% A-30% B) or (30% A-70% B), (60% A-40% B) or (40% A-60% B), (50% A-50% B)].

In [3]:
# The labels to predict are strings of the form "[1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0]"
# Below I will extract the corresponding list

import ast

list_label = train['stabilityVec'].tolist()    # A list that contains the column 'stabilityVec'
list_label = map(ast.literal_eval,list_label)
list_label_np = np.asarray(list_label)         # Now we have an array containing all the 1D arrays of the labels


#Let's drop the stability index of the pure element, as we do not need to predict it
#and it might bias the model
list_label_compound = list_label_np[:,1:10]

#Creating the numpy array that will contain the simplified symmetric labels
list_label_compound_sym = np.zeros((2572, 5), dtype='float32')

#Creating the simplified symmetrized labels (see introduction)
for i in range(len(list_label_compound)):
    list_label_compound_sym[i][4]=list_label_compound[i][4]
    if (list_label_compound[i][0]==1 or list_label_compound[i][8]==1):
        list_label_compound_sym[i][0]=1.0
    if (list_label_compound[i][1]==1 or list_label_compound[i][7]==1):
        list_label_compound_sym[i][1]=1.0
    if (list_label_compound[i][2]==1 or list_label_compound[i][6]==1):
        list_label_compound_sym[i][2]=1.0
    if (list_label_compound[i][3]==1 or list_label_compound[i][5]==1):
        list_label_compound_sym[i][3]=1.0

## Preparing the data to train the model

In [4]:
# Importing sklearn
from sklearn.model_selection import train_test_split

## I'm dropping the column of labels 'stabilityVec'
## I'm dropping also the element name columns 'formulaA' and 'formulaB'; previously I tried
## to include their one-hot encoding: It did not help the model and adds several features

X_tot = train.drop(['formulaA','formulaB','stabilityVec'], axis=1) 
Y_tot = list_label_compound_sym # Symmetrized target values


# Holding out 10% of the data to test the model 
X_train, X_test, Y_train, Y_test = train_test_split(X_tot, Y_tot, test_size=0.1, random_state=0)

## Training a multilabel model and testing it

In the next cell I will train a random forest multilabel model and I will use it to predict the labels. I used 10-fold cross validation to perform a grid search to find the optimal parameters for the model. As grid search with crossvalidation takes time, the corresponding lines of code have been commented.  

In [5]:
# Importing sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import coverage_error 
from sklearn.metrics import f1_score, accuracy_score
from sklearn.metrics import make_scorer

################################################################################

####### The following commented lines have been used to optimize the RandomForestClassifier parameters
####### This takes some time 

## Important parameters to optimize: n_estimators, max_features, and max_depth
# search_grid_rf = [{'n_estimators': [10, 20, 40, 80, 100, 150, 200, 300, 400],
#                 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
#                 'max_depth': [2, 4, 8, 12, 14, 16, 18, 24, 30]}] 

# I'm using subset accuracy as score function
# subsetacc = make_scorer(accuracy_score)
# multilabel_rf = GridSearchCV(RandomForestClassifier(random_state=0), search_grid_rf, cv=10,
#                        scoring=subsetacc)

# multilabel_rf.fit(X_train, Y_train)

# print "Best parameters set found on traning set:"
# print ""
# print multilabel_rf.best_params_
# print ""
# print "Grid scores on training set:"
# print ""

# means = multilabel_rf.cv_results_['mean_test_score']
# stds = multilabel_rf.cv_results_['std_test_score']
# for mean, std, params in zip(means, stds, multilabel_rf.cv_results_['params']):
#    print("%0.3f (+/-%0.03f) for %r"
#            % (mean, std * 2, params))
# print ""

## Best parameters set found on traning set:

## {'max_features': 0.1, 'n_estimators': 100, 'max_depth': 16}

## Grid scores on training set:

## 0.642 (+/-0.057) for {'max_features': 0.1, 'n_estimators': 100, 'max_depth': 16}

################################################################################

# The multilabel classifier
multilabel_rf = RandomForestClassifier(max_features = 0.1, n_estimators = 100, max_depth = 16, random_state=0)
# With the parameters above I got 0.642 (+/-0.057) as best score (subset accuracy) in the grid search
# random_state=0 allows for reproducibility; alternatively we could release this constraint 
# and average over different final results

###########################

multilabel_rf.fit(X_train, Y_train)
predictions_train = multilabel_rf.predict(X_train)

print "Detailed classification report:"
print ""
print "The model is trained on the training set."
print ""
print("These scores are computed on the training set.")
print ""
print(classification_report(Y_train, predictions_train))
print ""

print "Subset accuracy", accuracy_score(Y_train, predictions_train)


###########################

predictions_test = multilabel_rf.predict(X_test)


print ""
print("These scores are computed on the hold out test set.")
print ""
print(classification_report(Y_test, predictions_test))
print ""

print "Subset accuracy", accuracy_score(Y_test, predictions_test)

Detailed classification report:

The model is trained on the training set.

These scores are computed on the training set.

             precision    recall  f1-score   support

          0       1.00      0.96      0.98        98
          1       1.00      1.00      1.00       421
          2       1.00      1.00      1.00       753
          3       1.00      0.99      0.99       290
          4       1.00      1.00      1.00       522

avg / total       1.00      1.00      1.00      2084


Subset accuracy 0.996110630942

These scores are computed on the hold out test set.

             precision    recall  f1-score   support

          0       0.67      0.17      0.27        12
          1       0.79      0.48      0.60        48
          2       0.82      0.76      0.79        83
          3       0.70      0.45      0.55        31
          4       0.79      0.68      0.73        62

avg / total       0.78      0.61      0.68       236


Subset accuracy 0.670542635659


## Conclusions

- I trained the parameters of the model with 10-fold cross-validation and using subset accuracy as scoring function. __On the test set I obtained a 67% subset accuracy.__ As a reference, by considering all the labels as equal to 0 the subset accuracy would be 52%. Subset accuracy is a very strict metric as it indicates the percentage of samples that have all their labels classified correctly. As shown above certain classes have sizeably lower f1-score and this has implications for the overall subset accuracy.  
- It can be noticed that the class 2 has the highest f1-score followed by class 4. Vice versa the classes 0 and 3 have the lowest f1-scores. This can be easily understood by the fact that the classes 2 (30%-70% and vice versa) and 4 (50%-50%) have the largest number of training examples while the classes 0 (10%-90% and vice versa) and 3 (40%-60% and vice versa) are more rare. I also verified that, even if I uded a random shuffle, there are no problems with the stratification of the data and all the 5 labels are represented in a balanced way between training and test set.
- Precision is sizeably larger than recall. This means in practice that when my model labels an compound as stable, the prediction is "likely" to be correct; on the other side the model still misses some stable compounds. It would be possible to change the precision-recall balance by changing the threshold used for the prediction (0.5 by default).
- It is not shown in the notebook but I tried to introduce new features as weighted averages of the properties of the elements. Specifically, I introduced features of the type 0.9\*featureA+0.1\*featureB, 0.1\*featureA+0.9\*featureB, 0.8\*featureA+0.2\*featureB, etc. I considered several choices for the specific featureA and featureB. I noticed that the derived average features were helping to improve the accuracy of certain labels and deteriorating the accuracy of others. As a matter of fact in a multilabel problem the algorithm might not understand that a certain feature is meant to be used only to improve the prediction of a specific label. For example, the derived feature 0.9\*featureA+0.1\*featureB can influence not only the mixture 90%A-10%B but also the mixture 50%A-50%B. To solve this problem I thought about training a different model with different features for each different label. I understood rapidly that this was not a suitable idea as I would have lost the correlation between labels and I would have risked to create a series of models completely disconnected. In order to exploit compound specific features, such as the weighted averages, I developed the idea presented in the notebook main.ipynb. Soon after I realized that this approach was the "standard" in the scientific literature in the field.   