# INTRODUCTION

<div class="alert alert-block alert-info">
<b><b>
1) **tgBoost** is a python package composed by a **data pipeline** and a pretrained **QSPR model** for the prediction of *T*$_g$ of organic monomer molecules.

2) The tgBoost model is based on an Extreme Gradient Boosting framework (XGBoost), relying on fast computation and accuracy

3) Before its usage the model needs to be retrained, since the xgboost package needs to train the model with the C++ version used by your processor
</div>

The following notebook is aimed for tgBoost users in atmospheric chemistry, and it includes initial model training and some guidelines.

In [11]:
from tgboost.train_pipeline import run_training

In [12]:
run_training()

DATASET:  /Users/tommaso/Desktop/tgApp_dev/tgboost/datasets/Koop_dataset.csv 

*** EXTRACTION step
n_input SMILES:  415 

*** TRANSFORMING step
n_output SMILES:  298 

~~ DATA info 

Xtrain:  298 ytrain:  298 Xtest:  0 ytest:  0
*** REGRESSION step

~ _ ~ ^ ~ _ ~ PIPELINE completed: trained model


In [None]:
import numpy as np
import pandas as pd

In [2]:
import tgboost.processing.smiles_manager as sm
from tgboost.predict import make_prediction

tgModel_output_v0.0.1.pkl


## 1. Train *T*$_g$ model of tgBoost 

## 2. Working with tgBoost

### 2.1 First scenario
 - File with single list of SMILES (it could be a .csv, .txt, .xlsx file)
 - Pure list of SMILES, no header

a) Extract SMILES and make a readable df

In [3]:
filename = 'first_mols.csv'

In [4]:
df_smiles = sm.DatabaseExtractor().extract(file = filename)
df_smiles

Unnamed: 0,SMILES
0,Oc1ccccc1O
1,CCC=O
2,CC(C)=O
3,CCO
4,CCO


<b>NOTE there are double SMILES in the list

b) Eliminate doubles and make embeddings

The wrapper eliminates double entries, while also listing the species in an alphabetical order

In [5]:
wrapper = sm.SmilesWrapper(variables= ['SMILES'], param_scaler = False)
embedder = sm.SmilesEmbedder(variables= ['SMILES'])

In [6]:
df_embedded_unique_smiles = embedder.fit_transform(wrapper.fit_transform(df_smiles))
df_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings
0,CC(C)=O,"[-0.8803653, -0.2705594, -0.5115221, -0.745279..."
1,Oc1ccccc1O,"[0.4051229, 0.38696152, -0.595621, 2.5130584, ..."
2,CCC=O,"[-0.24485978, 0.15912962, -1.3363688, 0.327706..."
3,CCO,"[0.5830543, 0.105680496, 0.14494535, -0.064349..."


c) Make prediction dictionary

In [7]:
prediction_dict = make_prediction(input_data = df_embedded_unique_smiles['SMILES'])

[102.7809   86.87503 235.9176   96.35538]


In [8]:
df_embedded_unique_smiles['Tg_pred'] = prediction_dict['predictions']
df_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings,Tg_pred
0,CC(C)=O,"[-0.8803653, -0.2705594, -0.5115221, -0.745279...",102.780899
1,Oc1ccccc1O,"[0.4051229, 0.38696152, -0.595621, 2.5130584, ...",86.875031
2,CCC=O,"[-0.24485978, 0.15912962, -1.3363688, 0.327706...",235.917603
3,CCO,"[0.5830543, 0.105680496, 0.14494535, -0.064349...",96.355377


### 2.2 Second scenario
- File with titled column SMILES (it could be a .csv, .txt, .xlsx file)

In [9]:
filename_doubles = 'doubles_mols.csv'
df_doubles = sm.DatabaseExtractor().extract(file = filename_doubles)

In [10]:
#df_doubles = pd.read_csv('doubles_mols.csv')
df_doubles

Unnamed: 0,SMILES
0,C#C
1,CC(Cl)Cl
2,CC(CO)(CO)[N+](=O)[O-]
3,c1cc(sc1)Cl
4,CCCOC(=O)CC
5,CC(=C)OC(=O)C
6,CCCOC=O
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O
8,CCc1cccc(CC)c1
9,CC(=CCCC(=O)C)C


In [11]:
wrapper_2 = sm.SmilesWrapper(variables=['SMILES'], param_scaler = False)
embedder_2 = sm.SmilesEmbedder(variables=['SMILES'])

In [12]:
df2_embedded_unique_smiles = embedder.fit_transform(wrapper.fit_transform(df_doubles))
df2_embedded_unique_smiles

Unnamed: 0,SMILES,embeddings
0,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,..."
1,Cc1c([N+](=O)[O-])cc([N+](=O)[O-])cc1[N+](=O)[O-],"[1.4889961, 1.7657193, -3.0121915, 0.31898278,..."
2,C=C(C)C(C)(C)C,"[-1.016378, 2.0707304, -0.94561285, -1.0383787..."
3,CCCCC(C)(C)C,"[-0.9479669, 1.2812092, 0.07547889, -0.7764258..."
4,CC(=O)CCC=C(C)C,"[-1.8196307, -0.6502626, -2.26432, 0.49115, 3...."
5,CCc1ccc(C)c(C)c1,"[0.6111146, -0.2882797, -1.550622, 3.2725656, ..."
6,CCc1cccc(CC)c1,"[-0.10988998, -0.19000617, -1.431224, 3.074103..."
7,CCC(C)(CC)CC,"[-1.3616825, 2.9518492, -0.77216995, -0.294032..."
8,CC(CO)(CO)[N+](=O)[O-],"[-0.49202204, 1.4901065, 0.23652013, -2.321251..."
9,CC(C)CC(C)C(C)C,"[-1.2361975, 1.0591083, -2.0571063, -0.5048241..."


In [13]:
df2_embedded = embedder_2.fit_transform(df_doubles)
df2_embedded

Unnamed: 0,SMILES,embeddings
0,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,..."
1,CC(Cl)Cl,"[0.53650045, 0.26291013, -0.77642334, 0.184161..."
2,CC(CO)(CO)[N+](=O)[O-],"[-0.49202204, 1.4901065, 0.23652013, -2.321251..."
3,c1cc(sc1)Cl,"[-1.1649153, -1.0850924, 0.122385025, 2.356581..."
4,CCCOC(=O)CC,"[-0.66946936, -0.16097133, -1.4185302, -0.4906..."
5,CC(=C)OC(=O)C,"[-1.3073444, 0.05759584, -1.8734527, -0.626749..."
6,CCCOC=O,"[-0.47865582, -1.3420168, -1.5128081, 0.043614..."
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O,"[1.4889959, 1.7657193, -3.0121918, 0.31898266,..."
8,CCc1cccc(CC)c1,"[-0.10988998, -0.19000617, -1.431224, 3.074103..."
9,CC(=CCCC(=O)C)C,"[-1.8196306, -0.6502626, -2.26432, 0.49115005,..."


In [14]:
prediction_dict = make_prediction(input_data = df2_embedded['SMILES'])

[119.42798  312.27933  138.94965  155.4983   198.95     140.75172
 126.92718  140.29222  228.26994  154.83214  104.06488  122.4085
 148.82986  114.62498  104.57098  111.232155 153.65085  123.26633
 141.78735  125.11103 ]


In [15]:
df2_embedded['Tg_pred (K)'] = prediction_dict['predictions']
df2_embedded

Unnamed: 0,SMILES,embeddings,Tg_pred (K)
0,C#C,"[0.5479703, 0.29214346, 0.57423025, 1.0772303,...",119.427979
1,CC(Cl)Cl,"[0.53650045, 0.26291013, -0.77642334, 0.184161...",312.279327
2,CC(CO)(CO)[N+](=O)[O-],"[-0.49202204, 1.4901065, 0.23652013, -2.321251...",138.949646
3,c1cc(sc1)Cl,"[-1.1649153, -1.0850924, 0.122385025, 2.356581...",155.498306
4,CCCOC(=O)CC,"[-0.66946936, -0.16097133, -1.4185302, -0.4906...",198.949997
5,CC(=C)OC(=O)C,"[-1.3073444, 0.05759584, -1.8734527, -0.626749...",140.751724
6,CCCOC=O,"[-0.47865582, -1.3420168, -1.5128081, 0.043614...",126.927177
7,Cc1c(cc(cc1[N+](=O)[O-])[N+]([O-])=O)[N+]([O-])=O,"[1.4889959, 1.7657193, -3.0121918, 0.31898266,...",140.292221
8,CCc1cccc(CC)c1,"[-0.10988998, -0.19000617, -1.431224, 3.074103...",228.269943
9,CC(=CCCC(=O)C)C,"[-1.8196306, -0.6502626, -2.26432, 0.49115005,...",154.832138


## 3. Side test: opening a large *.sdf file using the File Extractor

In [16]:
tetko_filename = '/Users/tommaso/Desktop/tgApp_dev/tgboost/datasets/Tetko_melting_point.sdf'

In [17]:
df_tetko = sm.DatabaseExtractor().extract(file = tetko_filename)
df_tetko

Unnamed: 0,FromLiterature,OriginalText,Paragraph,Patent,QuantityType,SMILES,StdInChIKey,SuspiciousValue,Value,ID
0,false,melting point 69° - 69.5°C.,,US03930837,MeltingPoint,C(CCC)C1=NC=CC2=C(C=CC=C12)[N+](=O)[O-],XQGQGNFPBOZIFL-UHFFFAOYSA-N,false,69 to 69.5,1-n-butyl-5-nitro-isoquinoline
1,false,m.p. 176° - 177°C,,US03930837,MeltingPoint,ClC=1N=CC2=CC=CC(=C2C1)N,ZITUQCQEDBPHMJ-UHFFFAOYSA-N,false,176 to 177,3-chloro-5-amino-isoquinoline
2,false,melting at 112°C.,,US03930837,MeltingPoint,ClC1=NC(=CC2=C(C=CC=C12)[N+](=O)[O-])C,HIFFKFQOHFVGIM-UHFFFAOYSA-N,false,112,1-chloro-3-methyl-5-nitro-isoquinoline
3,false,Melting point: 131 - 132°C.,,US03930863,MeltingPoint,C(CCCCCCCCCCC)C=1C(C=CC(C1)=O)=S,IVQQPXBWXQGCLR-UHFFFAOYSA-N,false,131 to 132,2-dodecylthio-p-benzoquinone
4,false,m.p. 162°-4°,,US03930970,MeltingPoint,[N+](=O)([O-])OC[C@@]12CC([C@@H]3[C@]4(C=CC(C=...,HNLHWVUHASDCCN-WHDRTAJDSA-N,false,162 to 164,"11,18-dihydroxy-pregna-1,4-diene-3,20-dione 18..."
...,...,...,...,...,...,...,...,...,...,...
228169,false,M.Pt. 151-152° C.,0226,US20090197926A1,MeltingPoint,OC1=CC=C(C=C1)C1CCC(CC1)=CC(=O)OC,JJGBMDHQWXLWSN-UHFFFAOYSA-N,false,151 to 152,Methyl 2-[4-(4-hydroxyphenyl)cyclohexylidene]a...
228170,false,m.p.: 132.4° C.,0769,US20130090340A1,MeltingPoint,CC(C)(ON=C1CCC(CC1)NC(OC1=CC=C(C=C1)[N+](=O)[O...,YEEIAMNUEIYAQX-UHFFFAOYSA-N,false,132.4,"4-nitrophenyl [4-[(1,1-dimethylethoxy)imino]cy..."
228171,false,mp 114-116° C.,0152,US20130203592A1,MeltingPoint,C(C)(C)C1=C(C=CC=C1)N1\C(\SCC1(O)C)=N/N=C\C1=C...,NLAJMCOTGDLBBE-ACHHUDACSA-N,false,114 to 116,(Z)-3-(2-isopropylphenyl)-4-methyl-2-((E)-(4-(...
228172,false,mp 185-187° C.,0160,US20130203592A1,MeltingPoint,C(C)(C)C1=C(C=CC=C1)N1\C(\SCC1(O)C(F)(F)F)=N/N...,YXDSBVCXNNJOMS-NGWISOIBSA-N,false,185 to 187,(Z)-3-(2-isopropylphenyl)-2-((E)-(4-(1-(4-(tri...
