# `chemsci` demonstration
* A simple use case where an sklearn pipeline is constructed to predict if a given compound is able to inhibit HIV
* The data used in this demonstration was taken from the [AIDS Antiviral Screen Data hosted by the NCI](https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data)
___

In [1]:
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

from chemsci.factory import FeatureFactory

In [2]:
# load smiles strings and trget data

df = pd.read_csv('HIV.csv')
X = df['smiles']
y = df['HIV_active']

In [3]:
# split data into training and testing batches

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [4]:
# create a pipeline containing the FEatureFactory

pipe = Pipeline([
    ('featuriser', FeatureFactory(converter='smiles', featuriser='ecfp_4_1024')),
    ('model', RandomForestClassifier())
])

In [5]:
# fit pipeline model to training data

pipe.fit(X_train, y_train)

100%|██████████| 32901/32901 [00:10<00:00, 3109.91it/s]


Pipeline(steps=[('featuriser',
                 Factory: Converter=<Boost.Python.function object at 0x5565ac12f880>, Featuriser=<chemsci.featurisers.MorganFingerprint object at 0x7f166fb648d0>.),
                ('model', RandomForestClassifier())])

In [6]:
# assess the model performance

y_pred = pipe.predict(X_test)
score = accuracy_score(y_test, y_pred)

print(F'Accuracy score of trained RandomForest is {score:.3f} %')

100%|██████████| 8226/8226 [00:02<00:00, 3034.35it/s]


Accuracy score of trained RandomForest is 0.973 %


___