# Chemotherapeutical Agents Dataset, Data Analysis

This notebook provides a first look at the Chemotherapeutical Agents Dataset, "chemo dataset", with some data analysis to grab a better view of the data. 

### Data Description

In [1]:
!pip install rdkit-pypi pandas seaborn mols2grid requests



In [21]:
# Data handling
import pandas as pd
import numpy as np
import math

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Properties of molecules
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors, Draw, AllChem
from rdkit.Chem.Draw import IPythonConsole #RDKit drawing 
from rdkit.Chem import rdDepictor # A few settings to improve the quality of structures
IPythonConsole.ipython_useSVG = True
rdDepictor.SetPreferCoordGen(True)
from rdkit.Chem import PandasTools #Add the ability to add a molecule to a dataframe
import mols2grid #The mols2grid library provides a convenient way of displaying molecules in a grid
import requests

# Machine learning
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import set_config
from builtins import list
set_config(display='diagram')

In [22]:
data= pd.read_csv("Desktop/ML_RUCOMPLEXES/CHEMO_dataset.csv")

In [23]:
metals = data.copy()
metals['Ligands Set'] = metals.apply( lambda row: set(row[['L1', 'L2', 'L3']]), axis=1)

In [24]:
metals

Unnamed: 0,L1,L2,L3,Charge,DOI,Partition Coef logP,Cell Lines,Localisation,Incubation Time (hours),IC50 (μM),Ligands Set
0,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,8,10.1002/anie.201507800,-3.54,HeLa,lysosom,4.0,470.00,{C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC...
1,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,8,10.1002/anie.201507800,-3.05,HeLa,lysosom,4.0,425.00,{CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC...
2,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,8,10.1002/anie.201507800,-1.55,HeLa,lysosom,4.0,301.00,{CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)...
3,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,CC1=CC=NC(C2=NC=CC(C(NCCCCCCN)=O)=C2)=C1,2,10.1002/anie.201916400,,HeLa,golgi apparatus,4.0,13.57,{C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)...
4,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,CC1=CC=NC(C2=NC=CC(C(NCCCCCCN)=O)=C2)=C1,2,10.1002/anie.201916400,,RPE-1,golgi apparatus,4.0,4.22,{C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)...
...,...,...,...,...,...,...,...,...,...,...,...
892,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,2,10.1002/cmdc.201700240,1.40,MCF.7,,,1.50,{C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC...
893,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,2,10.1002/cmdc.201700240,1.40,CCL228,,,4.0,{C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC...
894,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,2,10.1002/cmdc.201700240,1.40,H358,,,1.70,{C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC...
895,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=...,2,10.1002/cmdc.201700240,1.40,MCF.10,,,1.50,{C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC...


### Compounds Available and a quick overview of available ligands

We will count the number of compounds available, and how many display each information. The **diff_compounds** csv file stores one version of each complex *(the first appearance of each complex in the dataset, so the corresponding characteristic values are not to be interpreted)*.

In [25]:
diff_compounds = metals.copy()
diff_compounds.drop_duplicates(subset=['L1', 'L2', 'L3'], keep='first', inplace=True)

In [26]:
diff_compounds

Unnamed: 0,L1,L2,L3,Charge,DOI,Partition Coef logP,Cell Lines,Localisation,Incubation Time (hours),IC50 (μM),Ligands Set
0,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...,8,10.1002/anie.201507800,-3.54,HeLa,lysosom,4.0,470.00,{C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC...
1,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...,8,10.1002/anie.201507800,-3.05,HeLa,lysosom,4.0,425.00,{CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC...
2,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...,8,10.1002/anie.201507800,-1.55,HeLa,lysosom,4.0,301.00,{CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)...
3,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...,CC1=CC=NC(C2=NC=CC(C(NCCCCCCN)=O)=C2)=C1,2,10.1002/anie.201916400,,HeLa,golgi apparatus,4.0,13.57,{C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)...
5,C1(C2=NC=CC=C2C=C3)=C3C=CC=N1,C1(C2=NC=CC=C2C=C3)=C3C=CC=N1,O=CC1=CC=NC(C2=NC=CC(C=O)=C2)=C1,2,10.1016/j.bmc.2019.05.011,,HeLa,,,>100,"{O=CC1=CC=NC(C2=NC=CC(C=O)=C2)=C1, C1(C2=NC=CC..."
...,...,...,...,...,...,...,...,...,...,...,...
872,C12=NC=CC=C1C=CC3=CC=CN=C23,C12=NC=CC=C1C=CC3=CC=CN=C23,C12=NC=CC=C1C3=C(N=C(C=CC=C4)C4=N3)C5=CC=CN=C25,2,10.1002/cmdc.201700240,-1.30,MCF.7,,,50.0,"{C12=NC=CC=C1C=CC3=CC=CN=C23, C12=NC=CC=C1C3=C..."
877,C12=NC=CC=C1C3=C(N=C(C=C(N=C(C(C=CC=N4)=C4C5=C...,C12=NC=CC(C3=CC=CC=C3)=C1C=CC4=C(C5=CC=CC=C5)C...,C12=NC=CC(C3=CC=CC=C3)=C1C=CC4=C(C5=CC=CC=C5)C...,2,10.1002/cmdc.201700240,1.60,MCF.7,,,1.40,{C12=NC=CC=C1C3=C(N=C(C=C(N=C(C(C=CC=N4)=C4C5=...
882,CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1,CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1,C12=CC=CN=C1C3=NC=CC=C3C4=C2N=C5C(C=C(N=C(C(C=...,2,10.1002/cmdc.201700240,-0.60,MCF.7,,,18.0,"{CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1, C12=..."
887,CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1,CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1,CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1,2,10.1002/cmdc.201700240,-0.90,MCF.7,,,23.0,{CC1=C(C)C2=C(C(N=CC(C)=C3C)=C3C=C2)N=C1}


In [32]:
rows = diff_compounds.shape[0]
rows

196

**The numbers of rows is the number of different compounds available in the dataset**. Each of them has several rows (several different cell line tested, several localisations possible for each compound).

Now, let's anaylse the prevalence of different ligands. For each ligand, we will count in how many compounds it appears, either 1, 2 or 3 times in the same compound. 

In [33]:
all_ligands = pd.concat([diff_compounds['L1'], diff_compounds['L2'], diff_compounds['L3']])
ligands = pd.DataFrame(all_ligands.unique(), columns=['SMILES'])

The **ligands** list contains the list of all the different possible ligands SMILES, with no redundancy. As several compounds only had 2 ligands, we have to eliminate the NaN values corresponding to the L3 string entries of such compounds.

In [34]:
ligands.dropna(subset=['SMILES'], inplace=True)
ligands

Unnamed: 0,SMILES
0,C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=...
1,CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)...
2,CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)C...
3,C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C...
4,C1(C2=NC=CC=C2C=C3)=C3C=CC=N1
...,...
154,C12=NC=CC=C1C3=C(C4=CC=CN=C24)N=C5C=CC6=C(N=C(...
155,FC(C=C(C=CC=C1)C1=C2)=C2OC(C=CC3=N4)=CC3=NC5=C...
156,C12=NC=CC=C1C3=C(C4=CC=CN=C24)N=C5C=C(OC6=CC=C...
157,C12=NC=CC=C1C3=C(N=C(C=CC=C4)C4=N3)C5=CC=CN=C25


In [35]:
rows_ligands = diff_compounds.shape[0]
rows_ligands

196

**The number of rows is the number of different ligands** used in total in our complexes. Let's now see the prevalence of certain ligands.

In [36]:
ligands['Occurences'] = 0 #we create a numerical column to count the occurences of each ligand.

### Ligands stored in a pd dataframe 

The following code treats the ligands list as a panda dataframe. 

In [11]:
#ligands_count = ligands.copy()

#for l in ligands_count['SMILES']:
#    for l1 in diff_compounds['L1']:
#        if l1 == l : 
#            ligands_count.loc[ligands_count['SMILES'] == l, 'Occurences'] += 1
#    for l2 in diff_compounds['L2']:
#        if l2 == l : 
#            ligands_count.loc[ligands_count['SMILES'] == l, 'Occurences'] += 1
#    for l3 in diff_compounds['L3']:
#        if l3 == l : 
#            ligands_count.loc[ligands_count['SMILES'] == l, 'Occurences'] += 1

#ligands_sorted = ligands_count.sort_values(by='Occurences', ascending=False) #we organize the ligands by prevalence
#ligands_prevalence = ligands_sorted.reset_index(drop=True, ) # we rename indexes
#ligands_prevalence['Molecule'] = ligands_prevalence['SMILES'].apply(lambda x: Chem.MolFromSmiles(x))

Let's now visualize the ligands, in order of prevalence in our compounds. It is important to note that the prevalence in the compounds studied is not the same as the prevalence in the dataset. In the dataset, ligands appearing in compounds that have a lot of entries (for example, tested on various cell lines with various cellular localisations) will be more prevalent. Whereas the prevalence in compounds is juste the most used liigands in compounds synthesis and studies. 

In [12]:
#ligands_prevalence['Occurences'].to_list()
#mols2grid.display(ligands_prevalence, subset =["img","Occurences"], substruct_highlight=True)

### Ligands stored in a dictionnary 

If we want ligands_count to be a dictionnary, we execute the following :

In [13]:
#if we want ligands_count to be a dictionnary 

ligands_dict = ligands.to_dict(orient='list')

for l1 in diff_compounds['L1']:
    position = ligands_dict['SMILES'].index(l1)
    ligands_dict['Occurences'][position] += 1
for l2 in diff_compounds['L2']:
    position = ligands_dict['SMILES'].index(l2)
    ligands_dict["Occurences"][position] += 1
for l3 in diff_compounds['L3']:
    if pd.notna(l3): 
        position = ligands_dict['SMILES'].index(l3)
        ligands_dict['Occurences'][position] += 1

ligands_dict

{'SMILES': ['C[N+](C)(CC1=CC(C2=CC(C[N+](C)(C)C)=CC=N2)=NC=C1)C',
  'CC[N+](CC)(CC1=CC=NC(C2=NC=CC(C[N+](CC)(CC)CC)=C2)=C1)CC',
  'CCCC[N+](CCCC)(CC1=CC(C2=CC(C[N+](CCCC)(CCCC)CCCC)=CC=N2)=NC=C1)CCCC',
  'C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C(C5=CC=CC=C5)C=CN=C34',
  'C1(C2=NC=CC=C2C=C3)=C3C=CC=N1',
  'C1(C2=CC=CC=C2)=CC=NC3=C1C=CC4=C3N=CC=C4C5=CC=CC=C5',
  'C1(C2=CC=CC=N2)=CC=CC=N1',
  'C12=CC=CN=C1C3=C(C=CC=N3)C=C2',
  'C12=C(C3=CC=CC=C3)C=CN=C1C4=C(C(C5=CC=CC=C5)=CC=N4)C=C2',
  'C1(C(N=CC=C2)=C2C3=C4N=CC=N3)=C4C=CC=N1',
  'C1(C2=CC=CC=N2)=NC=CC=C1',
  'C1(C2=NC=CC=C2)=CC=CC=N1',
  'C1(C(N=CC=C2C3=CC=CC=C3)=C2C=C4)=C4C(C5=CC=CC=C5)=CC=N1',
  'C12=CC=CN=C1C3=NC=CC=C3C4=C2N=C5C(C=CC=C5)=N4',
  'C1(C2=NC=CN=C2)=CN=CC=N1',
  'C1(NC2=NC=CC=C2)=NC=CC=C1',
  'CCN(CC)C1=CC=NC(C2=CC(N(CC)CC)=CC=N2)=C1',
  'C1(C=C2)=C(C3=C2C=CC=N3)N=CC=C1',
  'C1(C2=NC(C3=NC=CC=C3)=CC=C2)=CC=CC=N1',
  'C1(C2=CC=CC=C2)=CC(C3=NC=CC=C3)=NC(C4=CC=CC=N4)=C1',
  'CC1=CC(C2=CC(C)=CC=N2)=NC=C1',
  'C1(C2=C(C3=C4N=C5C(C=CC=

Let's now visualize the ligands, in order of prevalence in our compounds. The problem in dictionnaries is that key values are not linked together : the first SMILES is not 'linked' with the first 'Occurences' number. We have to 'link' them to be able to sort them. 

In [14]:
# First, we 'link' the SMILES and Occurences together
linkingvalues = list(zip(ligands_dict['SMILES'], ligands_dict['Occurences']))

# We sort the linked values based on the 'Occurences' key
sorted_list = sorted(linkingvalues, key=lambda x: x[1], reverse=True) 

ligands_dict['SMILES'] = [x[0] for x in sorted_list] #We put the sorted smiles in the dictionnary 
ligands_dict['Occurences'] = [x[1] for x in sorted_list] #We put the sorted occurences in the dictionnary 

We now add the rdkit representation of molecules in a third key in the dictionnary.

In [15]:
smiles_list = [x[0] for x in sorted_list]
ligands_dict['Molecules'] =[Chem.MolFromSmiles(smiles) for smiles in smiles_list]

In [16]:
mols2grid.display(ligands_dict, subset =["img","Occurences"], substruct_highlight=True)

MolGridWidget()