#**Breakout Sessions:**

Next we'll move into a series of breakout sessions. The format will be to discuss the given topics as a group, take notes of any key points on a flipchart, and present these points during the feedback session afterwards.
________________________

#**Computational Tools Discussion** (30 mins)

Take ~10 minutes to get to know each other:

*   Give each person a chance to introduce themselves and their field of expertise.
*   Decide on a group name. 

Now take ~15 minutes to discuss the following questions as a group, taking notes on your flipchart. You'll be asked for some feedback on these discussion points afterwards. 
*   What current computational tools and/or skills, if any, do you use in your research?
*   What challenges/limitations do you face that prevent you from making further use of data science tools?
*   How could some of these gaps be addressed?



#**Chemical Space Discussion** (30 mins)

20 mins discussion:

Discuss the following questions and take notes for the feedback session:
*   What do you understand by the term 'chemical space'?
*   How does this differ to the concept of 'drug-like molecules'?
*   Why should the data we use to train models be similar to the types of compounds we aim to obtain predictions for?
*   How might our training data requirements vary between a virtual screening campaign for novel chemical hits and a model to score analogues of a particular chemical series that is further along the drug discovery pipeline? 





10 mins activity:

Re-connect to google drive with the cell below and re-install the plotting libraries in the 2nd cell.

In [None]:
#Run this again if you need to re-connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%%capture
!pip install rdkit-pypi
!pip install umap-learn
import sys
sys.path.append("drive/MyDrive/DataScience_Workshop/data/day1/")
from courseFunctions import *

Now load the images in the next cell and answer the questions that follow. (You can click run on the remaining code blocks in this section to start the processing in the mean time).


In these images, we have a number of chemical series in colour plotted against some general training data in black.

*   Which of these series are outside the chemical space of our training data and why?

*   Does this make them more likely or less likely to be accurately predicted by a model trained on this data (the data in black)?

In [None]:
import requests
from IPython.display import Image
from IPython.display import display

img0 = Image("/content/drive/MyDrive/DataScience_Workshop/data/day1/PCA_chem_space.png") 
img1 = Image("/content/drive/MyDrive/DataScience_Workshop/data/day1/UMAP_chem_space.png") 

display(img0,img1)

The next 6 cells plot a set of inhibitors for the *P. falciparum* PI3K enzyme against both the GSK and St Jude datasets we plotted earlier.

*   Which dataset would you rather use to train a model to predict PI3K inhibition and why?
*   Are there any concerns with your chosen dataset?


In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/day1/"
file_list = ["gsk_3d7.csv", "gsk_3d7_actives.csv", "pi3k_actives.csv"]
my_plots3 = plots(path, file_list)

In [None]:
my_plots3.plot_pca()

In [None]:
my_plots3.plot_umap()

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/day1/"
file_list = ["st_jude_3d7.csv", "st_jude_3d7_actives.csv", "pi3k_actives.csv"]
my_plots4 = plots(path, file_list)

In [None]:
my_plots4.plot_pca()

In [None]:
my_plots4.plot_umap()

#**Data Cleaning** (20 mins)

*Acinetobacter baumannii* is a member of the ESKAPE pathogens - bacteria with a worrying increase in the prevalance of antibiotic resistance. ChEMBL has a number of compounds from different sources that have been screened for *A. baumannii* inhibition.

Open the "ChEMBL_A_Baumannii.csv" data file in your google drive to help you with ideas for the next section (you're not limited to this).

In your groups, brainstorm the types of inconsistencies you might need to address in drug discovery data in order to clean and standardise the dataset for machine learning. You'll provide feedback for these points during the feedback session.

Remember: We want our dataset to be complete, consistent, accurate and reliable. It should aim to have one common format.
