# Bioassay analysis
In this first section, you will work with a csv containing the results of a bioassay. Again, some exercises are easier some are more complicated. Do not worry if you do not manage to finish them all.

*The file is made up and all data is fake*

### Import the file into a dataframe
Import the file named `bioassay1.csv` and:
* Check its first 5 rows
* Examine how many datapoints we have

In [1]:
import pandas as pd
df = pd.read_csv("bioassay1.csv")
df.head()

Unnamed: 0,SMILES,Date,Series,exp_values,units,Lab
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A


### Find the maximum and minimum values of the bioassay result

In [2]:
df["exp_values"].max()

3.925381879271256

In [3]:
df["exp_values"].min()

0.0002400519751935

### How many molecules correspond to each chemical series?

In [4]:
df.groupby("Series").nunique()

Unnamed: 0_level_0,SMILES,Date,exp_values,units,Lab
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Series1,855,714,855,1,1
Series2,1424,1081,1424,1,1
Series3,2848,1665,2848,1,1
Series4,570,509,570,1,1


### Find specific values:
Create a list of SMILES for which the experimental values are **less than** 0.01

*tip: first, create a new dataframe containing only values < 0.01 and from there create the smiles list*

In [89]:
subset = df[df["exp_values"]<0.01]
smilist = subset["SMILES"].tolist()
len(smilist)

40

### Create a new column in the dataframe with only the years
Think of Session 1 exercises and the split function

*tip: the easiest is to create a list of dates and work from there*

In [12]:
fulldate = df["Date"].tolist()
years = []
for date in fulldate:
    year = date.split("-")[0]
    years += [year]
    
df["Year"]=years
df

Unnamed: 0,SMILES,Date,Series,exp_values,units,Lab,Year
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A,2019
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A,2016
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A,2019
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A,2019
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A,2017
...,...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,2019-10-18,Series4,1.777240,uM,Lab A,2019
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,2018-01-13,Series4,0.408828,uM,Lab A,2018
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,2015-12-19,Series4,1.925698,uM,Lab A,2015
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,2016-02-12,Series4,1.991048,uM,Lab A,2016


In [13]:
df.drop("Date", axis=1)

Unnamed: 0,SMILES,Series,exp_values,units,Lab,Year
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,Series1,0.242221,uM,Lab A,2019
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,Series1,0.624911,uM,Lab A,2016
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,Series1,0.658967,uM,Lab A,2019
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,Series1,0.979338,uM,Lab A,2019
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,Series1,0.161337,uM,Lab A,2017
...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,Series4,1.777240,uM,Lab A,2019
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,Series4,0.408828,uM,Lab A,2018
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,Series4,1.925698,uM,Lab A,2015
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,Series4,1.991048,uM,Lab A,2016


# Merging Results
You have gotten a second dataset, `bioassay_new`, produced by Lab B, for the same bioassay.
Merge the datasets averaging the bioassay result values between labs.

*tip: you can first merge the data and the create a new columnwith the averages

In [7]:
df2 = pd.read_csv("bioassay_new.csv")
df2

Unnamed: 0,SMILES,Date,Series,exp_values,units,Lab
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-09-04,Series1,0.120925,uM,Lab B
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2017-12-25,Series1,0.246136,uM,Lab B
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2016-01-11,Series1,1.363062,uM,Lab B
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2020-10-14,Series1,0.935486,uM,Lab B
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-11-05,Series1,1.447256,uM,Lab B
...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,2020-07-11,Series4,0.134999,uM,Lab B
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,2021-04-27,Series4,2.575272,uM,Lab B
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,2017-01-31,Series4,2.109766,uM,Lab B
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,2016-02-18,Series4,0.613645,uM,Lab B


In [8]:
#option 1
merged = pd.merge(df, df2, how="inner", on = "SMILES")
merged["avg"] = ((merged["exp_values_x"]+merged["exp_values_y"])/2).astype(float)
merged

Unnamed: 0,SMILES,Date_x,Series_x,exp_values_x,units_x,Lab_x,Year,Date_y,Series_y,exp_values_y,units_y,Lab_y,avg
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A,2019,2019-09-04,Series1,0.120925,uM,Lab B,0.181573
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A,2016,2017-12-25,Series1,0.246136,uM,Lab B,0.435524
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A,2019,2016-01-11,Series1,1.363062,uM,Lab B,1.011015
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A,2019,2020-10-14,Series1,0.935486,uM,Lab B,0.957412
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A,2017,2017-11-05,Series1,1.447256,uM,Lab B,0.804296
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,2019-10-18,Series4,1.777240,uM,Lab A,2019,2020-07-11,Series4,0.134999,uM,Lab B,0.956120
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,2018-01-13,Series4,0.408828,uM,Lab A,2018,2021-04-27,Series4,2.575272,uM,Lab B,1.492050
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,2015-12-19,Series4,1.925698,uM,Lab A,2015,2017-01-31,Series4,2.109766,uM,Lab B,2.017732
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,2016-02-12,Series4,1.991048,uM,Lab A,2016,2016-02-18,Series4,0.613645,uM,Lab B,1.302346


In [9]:
#option 2
cols = ["exp_values_x", "exp_values_y"]
merged["mean_vals"]=merged[cols].mean(axis=1)
merged

Unnamed: 0,SMILES,Date_x,Series_x,exp_values_x,units_x,Lab_x,Year,Date_y,Series_y,exp_values_y,units_y,Lab_y,avg,mean_vals
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A,2019,2019-09-04,Series1,0.120925,uM,Lab B,0.181573,0.181573
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A,2016,2017-12-25,Series1,0.246136,uM,Lab B,0.435524,0.435524
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A,2019,2016-01-11,Series1,1.363062,uM,Lab B,1.011015,1.011015
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A,2019,2020-10-14,Series1,0.935486,uM,Lab B,0.957412,0.957412
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A,2017,2017-11-05,Series1,1.447256,uM,Lab B,0.804296,0.804296
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,2019-10-18,Series4,1.777240,uM,Lab A,2019,2020-07-11,Series4,0.134999,uM,Lab B,0.956120,0.956120
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,2018-01-13,Series4,0.408828,uM,Lab A,2018,2021-04-27,Series4,2.575272,uM,Lab B,1.492050,1.492050
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,2015-12-19,Series4,1.925698,uM,Lab A,2015,2017-01-31,Series4,2.109766,uM,Lab B,2.017732,2.017732
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,2016-02-12,Series4,1.991048,uM,Lab A,2016,2016-02-18,Series4,0.613645,uM,Lab B,1.302346,1.302346


In [11]:
#option 3
aggregate = pd.concat([df,df2]).groupby("SMILES", as_index=False).mean()
aggregate

Unnamed: 0,SMILES,exp_values
0,BrC(Br)(Br)C1=NC=C(C=N1)C2=CC=CC=C2,0.986167
1,BrC1=C2C(=NN(C3=CC=CC=C3)C(C4=CC=CC=C4)=C2C(Br...,1.327337
2,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.831327
3,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.664682
4,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.954176
...,...,...
5692,[O-][N+](=O)C1=CC=CC=C1NC(=O)C2CC2(C3=CC=CC=C3...,0.678798
5693,[O-][N+](=O)C1=CC=CC=C1S(=O)(=O)NNC(=O)C2=CC=C...,0.778174
5694,[O-][N+](=O)C1=CC=CN(CC2=CC=C(F)C=C2Cl)C1=O,0.412277
5695,[O-][N+](=O)C1=CN(CC(=O)N2CCN(CC2)C3=CC=C(F)C=...,0.998154


In [95]:
aggregate

Unnamed: 0,SMILES,exp_values
0,BrC(Br)(Br)C1=NC=C(C=N1)C2=CC=CC=C2,0.986167
1,BrC1=C2C(=NN(C3=CC=CC=C3)C(C4=CC=CC=C4)=C2C(Br...,1.327337
2,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.831327
3,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.664682
4,BrC1=CC2=C(N(CC2)C(=O)C3CC3)C(=C1)S(=O)(=O)N4C...,0.954176
...,...,...
5692,[O-][N+](=O)C1=CC=CC=C1NC(=O)C2CC2(C3=CC=CC=C3...,0.678798
5693,[O-][N+](=O)C1=CC=CC=C1S(=O)(=O)NNC(=O)C2=CC=C...,0.778174
5694,[O-][N+](=O)C1=CC=CN(CC2=CC=C(F)C=C2Cl)C1=O,0.412277
5695,[O-][N+](=O)C1=CN(CC(=O)N2CCN(CC2)C3=CC=C(F)C=...,0.998154


# Bonus
If you have successfully completed the above exercises, explore the rdkit package for chemoinformatics:

You need first to pip install rdkit in your environment: check the pypi repository of python packages for instructions
https://pypi.org/project/rdkit-pypi/

Now, select one molecule from the exercise and try to calculate its molecular weight (https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html).

If you managed, try to automatize the exercise for the whole list of smiles in Bioassay 1 and store the calculated molecular weights in a list

*hint: smiles have to be converted to molecules for rdkit (https://www.rdkit.org/docs/GettingStartedInPython.html)*