<div class="alert alert-block alert-info">
<b>NB:</b> This notebook is intended for tgBoost users. It covers the main commands and pretraining</div>

# INTRODUCTION

1) **tgBoost** is a python package composed by a **data pipeline** and a pretrained **QSPR model** for the prediction of *T*$_g$ of simple organic molecules.

2) The tgBoost model is based on an Extreme Gradient Boosting framework (XGBoost), relying on fast computation and accuracy

3) Before its usage the model needs to be retrained, since the xgboost package needs to train the model with the C++ version used by your processor

## 1. Train *T*$_g$ model of tgBoost 

Training within the notebook

In [1]:
from tgboost.train_pipeline import run_training

In [2]:
run_training()

DATASET:  /Users/tommaso/Desktop/tgApp_dev/tgboost/datasets/Koop_dataset.csv 

*** EXTRACTION step
n_input SMILES:  415 

*** TRANSFORMING step
n_output SMILES:  298 

~~ DATA info
Xtrain:  298 ytrain:  298 Xtest:  0 ytest:  0 

*** REGRESSION step

PIPELINE completed:
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _
  __       ___                __ 
 / /____ _/ _ )___  ___  ___ / /_
/ __/ _ `/ _  / _ \/ _ \(_-</ __/
\__/\_, /____/\___/\___/___/\__/ 
   /___/                         
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _


## 2. Working with tgBoost

In [3]:
import numpy as np
import pandas as pd

Import main modules and functions

In [4]:
import tgboost.processing.smiles_manager as sm
from tgboost.predict import make_prediction

tgModel_output_v0.0.1.pkl


### 2.1 First scenario
- Pure list of SMILES, no header
- Source could be a .csv, .txt, .xlsx file


#### a) Extract SMILES, create a "SMILES" column to collect species, and make a readable df

In [5]:
filename = 'first_mols.csv'

In [6]:
df_smiles = sm.DatabaseExtractor().extract(file = filename)
df_smiles

Unnamed: 0,SMILES
0,Oc1ccccc1O
1,CCC=O
2,CC(C)=O
3,CCO
4,CCO


<b>NOTE there are double SMILES in the list

#### b) Eliminate doubles via *SmilesWrapper* and make embeddings via *SmilesEmbedder*

The wrapper eliminates double entries, while also listing the species in alphabetical order

In [7]:
wrapper = sm.SmilesWrapper(variables= ['SMILES'], param_scaler = False)
embedder = sm.SmilesEmbedder(variables= ['SMILES'])

In [8]:
df_embedded_unique_smiles = embedder.fit_transform(wrapper.fit_transform(df_smiles))
df_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings
0,Oc1ccccc1O,"[0.4051229, 0.38696152, -0.595621, 2.5130584, ..."
1,CCC=O,"[-0.24485978, 0.15912962, -1.3363688, 0.327706..."
2,CC(C)=O,"[-0.8803653, -0.2705594, -0.5115221, -0.745279..."
3,CCO,"[0.5830543, 0.105680496, 0.14494535, -0.064349..."


c) Make prediction dictionary

In [9]:
prediction_dict = make_prediction(input_data = df_embedded_unique_smiles['SMILES'])
prediction_dict

[235.9176   86.87503 102.7809   96.35538]


{'predictions': [235.9176, 86.87503, 102.7809, 96.35538],
 'version': '0.0.1',
 'errors': None}

In [10]:
df_embedded_unique_smiles['Tg_pred (K)'] = prediction_dict['predictions']
df_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings,Tg_pred (K)
0,Oc1ccccc1O,"[0.4051229, 0.38696152, -0.595621, 2.5130584, ...",235.917603
1,CCC=O,"[-0.24485978, 0.15912962, -1.3363688, 0.327706...",86.875031
2,CC(C)=O,"[-0.8803653, -0.2705594, -0.5115221, -0.745279...",102.780899
3,CCO,"[0.5830543, 0.105680496, 0.14494535, -0.064349...",96.355377


### 2.2 Second scenario
- File with titled column SMILES (it could be a .csv, .txt, .xlsx file)

In [11]:
filename_doubles = 'doubles_mols.csv'
df_doubles = sm.DatabaseExtractor().extract(file = filename_doubles)

In [12]:
#df_doubles = pd.read_csv('doubles_mols.csv')
df_doubles

Unnamed: 0,SMILES
0,C#C
1,CC(Cl)Cl
2,CC(CO)(CO)[N+](=O)[O-]
3,c1cc(sc1)Cl
4,CCCOC(=O)CC
5,CC(=C)OC(=O)C
6,CCCOC=O
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O
8,CCc1cccc(CC)c1
9,CC(=CCCC(=O)C)C


In [13]:
wrapper_2 = sm.SmilesWrapper(variables=['SMILES'], param_scaler = False)
embedder_2 = sm.SmilesEmbedder(variables=['SMILES'])

In [14]:
df2_embedded_unique_smiles = embedder.fit_transform(wrapper.fit_transform(df_doubles))
df2_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings
0,CC(C)CCCC(C)C,"[-1.7930809, -0.3430113, -1.5896883, -0.296719..."
1,CCC(C)(CC)CC,"[-1.3616825, 2.9518492, -0.77216995, -0.294032..."
2,C=C(C)C(C)(C)C,"[-1.016378, 2.0707304, -0.94561285, -1.0383787..."
3,CCC(CC)C(C)C,"[-1.4465501, 1.365694, -1.0023474, -0.15867056..."
4,CCCCSCC,"[0.30263892, -0.23683022, 0.17580251, 0.476828..."
5,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,..."
6,CC(C)CC(C)C(C)C,"[-1.2361975, 1.0591083, -2.0571063, -0.5048241..."
7,CCCOC=O,"[-0.47865582, -1.3420168, -1.5128081, 0.043614..."
8,CCc1ccc(C)c(C)c1,"[0.6111146, -0.2882797, -1.550622, 3.2725656, ..."
9,CCCCC(C)(C)C,"[-0.9479669, 1.2812092, 0.07547889, -0.7764258..."


In [15]:
df2_embedded = embedder_2.fit_transform(df_doubles)
df2_embedded

Unnamed: 0,SMILES,embeddings
0,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,..."
1,CC(Cl)Cl,"[0.53650045, 0.26291013, -0.77642334, 0.184161..."
2,CC(CO)(CO)[N+](=O)[O-],"[-0.49202204, 1.4901065, 0.23652013, -2.321251..."
3,c1cc(sc1)Cl,"[-1.1649153, -1.0850924, 0.122385025, 2.356581..."
4,CCCOC(=O)CC,"[-0.66946936, -0.16097133, -1.4185302, -0.4906..."
5,CC(=C)OC(=O)C,"[-1.3073444, 0.05759584, -1.8734527, -0.626749..."
6,CCCOC=O,"[-0.47865582, -1.3420168, -1.5128081, 0.043614..."
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O,"[1.4889959, 1.7657193, -3.0121918, 0.31898266,..."
8,CCc1cccc(CC)c1,"[-0.10988998, -0.19000617, -1.431224, 3.074103..."
9,CC(=CCCC(=O)C)C,"[-1.8196306, -0.6502626, -2.26432, 0.49115005,..."


In [16]:
prediction_dict = make_prediction(input_data = df2_embedded['SMILES'])

[141.78735  140.29222  138.94965  114.62498  104.06488  119.42798
 154.83214  111.232155 140.75172  155.4983   228.26994  148.82986
 122.4085   153.65085  125.11103  123.26633  126.92718  104.57098
 198.95     312.27933 ]


In [17]:
df2_embedded['Tg_pred (K)'] = prediction_dict['predictions']
df2_embedded

Unnamed: 0,SMILES,embeddings,Tg_pred (K)
0,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,...",141.787354
1,CC(Cl)Cl,"[0.53650045, 0.26291013, -0.77642334, 0.184161...",140.292221
2,CC(CO)(CO)[N+](=O)[O-],"[-0.49202204, 1.4901065, 0.23652013, -2.321251...",138.949646
3,c1cc(sc1)Cl,"[-1.1649153, -1.0850924, 0.122385025, 2.356581...",114.624977
4,CCCOC(=O)CC,"[-0.66946936, -0.16097133, -1.4185302, -0.4906...",104.06488
5,CC(=C)OC(=O)C,"[-1.3073444, 0.05759584, -1.8734527, -0.626749...",119.427979
6,CCCOC=O,"[-0.47865582, -1.3420168, -1.5128081, 0.043614...",154.832138
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O,"[1.4889959, 1.7657193, -3.0121918, 0.31898266,...",111.232155
8,CCc1cccc(CC)c1,"[-0.10988998, -0.19000617, -1.431224, 3.074103...",140.751724
9,CC(=CCCC(=O)C)C,"[-1.8196306, -0.6502626, -2.26432, 0.49115005,...",155.498306
