# Bioassay analysis
In this first section, you will work with a csv containing the results of a bioassay. Again, some exercises are easier some are more complicated. Do not worry if you do not manage to finish them all.

*The file is made up and all data is fake*

### Import the file into a dataframe
Import the file named `bioassay1.csv` and:
* Check its first 5 rows
* Examine how many datapoints we have

*tip: before starting remember to import the pandas package*

In [3]:
import pandas as pd
df = pd.read_csv("bioassay1.csv")
df.head(5)

Unnamed: 0,SMILES,Date,Series,exp_values,units,Lab
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A


In [5]:
len(df)

5697

### Find the maximum and minimum values of the bioassay result

In [11]:
df["exp_values"].max()

3.925381879271256

In [12]:
df["exp_values"].min()

0.0002400519751935

### How many molecules correspond to each chemical series?

In [13]:
df.groupby("Series").nunique()

Unnamed: 0_level_0,SMILES,Date,exp_values,units,Lab
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Series1,855,714,855,1,1
Series2,1424,1081,1424,1,1
Series3,2848,1665,2848,1,1
Series4,570,509,570,1,1


### Find specific values:
Create a list of SMILES for which the experimental values are **less than** 0.01

*tip: first, create a new dataframe containing only values < 0.01 and from there create the smiles list*

In [24]:
smi = df[df["exp_values"]<0.01]["SMILES"].tolist()

In [25]:
smi

['CCOC(=O)C1=CN=C2C=CC(C)=CC2=C1NCCCN3CCOCC3',
 'C[C@@H](O)[C@@H]1OCC[C@@H](C)[C@H](O)C(=O)OC[C@]23CCC(C)=C[C@H]2O[C@@H]4C[C@@H](OC(=O)C=CC=C1)[C@@]3(C)[C@@]45CO5',
 'CCOC(=O)C1=C(N)n2c(=O)c(=CC3=CC=CN=C3)sc2=C(C1C4=CC=CN=C4)C(=O)OC',
 'COC1=CC=C(\\C=C2\\SC(NC3=CC=C(F)C=C3)=NC2=O)C(OC)=C1',
 'CCCN(CCC)[P+](CC)(N(CCC)CCC)N(CCC)CCC',
 'CCN1C(C)=CC(\\C=C2/C(=O)NC(=O)N(C2=O)C3=CC=CC(Cl)=C3)=C1C',
 'CN(CC(=O)NC1=CC=C(C=C1)C(C)=O)S(=O)(=O)C2=CC=C3NC4=C(CCCC4)C3=C2',
 'CCN1C(\\C=CC2=CC(C)=CC=C12)=C\\C3=[N+](CC)C4=CC(OC)=CC=C4S3',
 'CN(C)C1=CC=C(C=C1)\\C=C2/SC(NC2=O)=NC3=CC=C(F)C=C3',
 'CCCCOC(=O)CC1=CC=C(N)C=C1',
 'COC1=CC(NC(C)CCCN(CC2=CC=C(OC)C(OC)=C2)C(=O)C3=CC=C4C=CC=CC4=C3)=C5N=CC=CC5=C1',
 'COC1=CC=C(C=C2SC(NC2=O)=NC3=CC=C(F)C=C3)C(OC)=C1',
 'CCCCN(CC)S(=O)(=O)C1=CC(Br)=CC2=C1N(CC2)C(C)=O',
 'CC(CC(=O)OC1=CC=CC=C1C2=C(NC(C)(C)CC(C)(C)C)N3C=CC=CC3=N2)CC(C)(C)C',
 '[O-][N+](=O)C1=CC=C(O1)\\C=C\\C=C2/SC(=N)NC2=O',
 'CCCOC1=CC=C2C3=CC=C(OCCC)C=C3C(=NNC(N)=N)C2=C1',
 'FC(F)(F)C1=CC=CC(=C1)N2

### Create a new column in the dataframe with only the years
Think of Session 1 exercises and the split function

*tip: the easiest is to create a list of dates and work from there*

In [26]:
years = []
fulldate = df["Date"].tolist()

In [30]:
for item in df["Date"]:
    spl = item.split("-")
    y = spl[0]
    years.append(y)

In [32]:
df["Year"] = years

In [33]:
df

Unnamed: 0,SMILES,Date,Series,exp_values,units,Lab,Year
0,COC1=C2OC3=C(OC)C(OC)=CC4=C3C(CC5=CC=C(O)C(OC6...,2019-03-12,Series1,0.242221,uM,Lab A,2019
1,NC1=NC(N)=C2C=C(C=CC2=N1)S(=O)(=O)N3CCCCC3,2016-12-21,Series1,0.624911,uM,Lab A,2016
2,COC1=CC2=C(C=C1OC)C(=O)N(NC(=O)C3=CC=CC(F)=C3)...,2019-12-23,Series1,0.658967,uM,Lab A,2019
3,C=CCC1(CCCCC1)NC2=CC=CC=C2,2019-10-14,Series1,0.979338,uM,Lab A,2019
4,FC1=CC=CC=C1NC2=NC(NCC3=CC=CC=C3)=C4C=CC=CC4=N2,2017-03-20,Series1,0.161337,uM,Lab A,2017
...,...,...,...,...,...,...,...
5692,O=C(N1CCN(CC1)CC2=CC=C3OCOC3=C2)C4=CC=NC=C4,2019-10-18,Series4,1.777240,uM,Lab A,2019
5693,O=C(C=CC1=CC=CN=C1)C23CC4CC(CC(C4)C2)C3,2018-01-13,Series4,0.408828,uM,Lab A,2018
5694,CC1=CC(N)=C2C=CC=CC2=[N+]1CCCCCCCCCC[N+]3=C(C)...,2015-12-19,Series4,1.925698,uM,Lab A,2015
5695,C[N+]1=CC=C(C=CC2=CNC3=CC=CC=C23)C=C1,2016-02-12,Series4,1.991048,uM,Lab A,2016


# Merging Results
You have gotten a second dataset, `bioassay_new`, produced by Lab B, for the same bioassay.
Merge the datasets averaging the bioassay result values between labs.

*tip: you can first merge the data and then create a new column with the averages*

# Bonus
If you have successfully completed the above exercises, explore the rdkit package for chemoinformatics:

You need first to pip install rdkit in your environment: check the pypi repository of python packages for instructions
https://pypi.org/project/rdkit-pypi/

Now, select one molecule from the exercise and try to calculate its molecular weight (https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html).

If you managed, try to automatize the exercise for the whole list of smiles in Bioassay 1 and store the calculated molecular weights in a list

*hint: smiles have to be converted to molecules for rdkit (https://www.rdkit.org/docs/GettingStartedInPython.html)*