#**Python Primer and Google Colab**



### A note on Notebooks
Notebooks are organized in cells, each cell contains a portion of code that will be run together, producing an output. Aside from code cells, notebooks also have text cells such as this one. 

To add plain text inside a code cell, you must comment it using a hash (#). Hashed comments will not be run as code, and they are really helpful for following the code.

In [None]:
# <--This '#' symbol means that this line is a 'comment'. Comments do not run or affect the script.

In [None]:
#print("Hello There") <- this is not processed
print("Hello World")  # <- but this is

## Data types

### String
Text is passed as a string (str) in python, always inside single ('...') or double quotes ("...")

### Numbers
- Integers (int): whole numbers (positive or negative)
- Floating Point Numbers (float): real numbers with decimal positions

### Booleans
A boolean (bool) value is either a True or False

### Lists
A list is an ordered collection of items which can include different data types inside. Each item in a list is recognized by its position. Keep in mind that lists start at position 0 in python, not 1. 
A list is defined by square brackets [1,2,2,3]

In [None]:
# Strings
type("This is a string")

In [None]:
type("42") # <- this is also a string

In [None]:
# Numbers
type(42)

In [None]:
type(42.0)

In [None]:
# Boolean
type(False)

In [None]:
# List
type([2, 4, 6, 8])

In [None]:
my_list = [2, 4, 6, 8]
print("List at first position (element 0):", my_list[0])

## Variables
Before continuing, we need to understand the "variable" concept. A variable stores data values, and its name must be UNIQUE.

Variables are simply assigned by the = operator, and can be reassigned at any moment.

In [None]:
my_variable = 24 * 60 * 60
print("Seconds in a day:", my_variable)

In [None]:
my_variable = 1234
print("The new value:", my_variable)

In [None]:
a = input("Type anything you want: ")
print("The reply was:", a)

##Conditionals

If/elif/else statements create branches in code. The code indented beneath them will only run if the condition is true.

In [None]:
num = 33
answer = "I don't know yet"
print(answer)

if num > 20:
  answer = "Bigger"
else:
  answer = "Smaller"
print(answer)

##Loops

Loops allow us to repeat blocks of code. They come in two varieties: 'for' and 'while'

In [None]:
my_list = [2, 4, 6, 8]

# For loops repeat a set number of times
for value in my_list:
  print("Value:", value, "Squared:", value*value)

In [None]:
# While loops repeat as long as the condition is true.
# Watch out for INFINITE LOOPS, if the condition never turns false the loop won't stop

x = 1
while (x<10):
    print("run number " + str(x))
    x = x + 1

## Functions
Functions allow to call a block of code without having to rewrite it again.

Use def() to set up a new function

In [None]:
def name_funct(name):
    print("My name is " + name)

name_funct("Jason")

We can also use functions and variables written by others.

In [None]:
from math import pi

print(round(pi, 2))

In [None]:
my_list = [3, 5, 2, 1, 4]

print("List length:", len(my_list))
print("Sorted list:", sorted(my_list))

Note: After a Google Colab runtime is closed, all variables will lose their values and you'll need to re-run the relevant cells.



#Connecting to Google Drive

Let's connect this Colab notebook to our Google Drive so that we can save files when we need to.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Google Colab runs in a linux environment on the cloud. 

To run an instruction as a linux command instead of a Python statement, we start the line with a '!'.

Let's make a new folder in our Google Drive.

In [None]:
!mkdir /content/drive/MyDrive/DataScience_Workshop

Now we'll copy some additional course materials to this folder.

In [None]:
!git clone https://github.com/ersilia-os/event-fund-ai-drug-discovery.git /content/drive/MyDrive/DataScience_Workshop

Lastly, let's check that we can read the test file in our new folder.

In [None]:
with open("drive/MyDrive/DataScience_Workshop/data/Day1/workshop_test_file.txt", "r") as f:
  print(f.read())

#**Intro to ChEMBL**

ChEMBL data files have fields separated by a ';'. Let's load this into a Pandas dataframe and save a new file that will be comma-separated.

In [None]:
import pandas as pd

dataframe = pd.read_csv("drive/MyDrive/DataScience_Workshop/data/Day1/gsk_3d7.csv", sep=";")
dataframe

Now let's get the subset that have < 100 nM activity.

In [None]:
actives = dataframe.loc[dataframe['Standard Value'] < 100]
actives.to_csv("drive/MyDrive/DataScience_Workshop/data/Day1/gsk_3d7_actives.csv")
actives

#**Visualising Compound Similarity**

In [None]:
%%capture
!pip install rdkit-pypi
!pip install umap-learn
import sys
sys.path.append("drive/MyDrive/DataScience_Workshop/data/Day1/")
from courseFunctions import *

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/Day1/"
file_list = ["gsk_3d7.csv", "gsk_3d7_actives.csv"]
my_plots = plots(path, file_list)

In [None]:
my_plots.plot_pca()

In [None]:
my_plots.plot_umap()

###Now it's your turn.

Let's get a new plasmodium dataset from chembl and visualise it and its actives:


1.   Search in ChEMBL for 'plasmodium falciparum' and select 'Assays'.
2.   Sort by most-to-least number of compounds.
3.   Look for the 'St Jude Malaria Screening' dataset on the first page (CHEMBL730080).
4.   Download the molecules for the assay and unzip the file.
5.   Rename the file to 'st_jude_3d7.csv'.
6.   Drag and drop the file into the 'DataScience_Workshop/data/Day1/' folder on google drive.
7.   Save a new dataset of molecules with < 100 nM activity by running the code block below.
8. Plot the visualisations of the st jude data and 'actives' subset by running the next 3 cells.




In [None]:
st_jude_dataframe = pd.read_csv("drive/MyDrive/DataScience_Workshop/data/Day1/st_jude_3d7.csv", sep=";")
actives = st_jude_dataframe.loc[st_jude_dataframe['Standard Value'] < 100]
actives.to_csv("drive/MyDrive/DataScience_Workshop/data/Day1/st_jude_3d7_actives.csv")

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/Day1/"
file_list = ["st_jude_3d7.csv", "st_jude_3d7_actives.csv"]
my_plots2 = plots(path, file_list)

In [None]:
my_plots2.plot_pca()

In [None]:
my_plots2.plot_umap()

#**Breakout Session:** (3 sessions of 20-30 mins)
________________________

#**Computational Tools Discussion**



*   Take turns to introduce yourself, your project/field of expertise and any computational skills you have acquired. (5-10 mins)
*   What current computational tools, if any, do you use for your research? (5 mins)
*   What challenges/limitations hinder you from making further use of data science tools? (10 mins)
*   How could these issed be addressed? (5 mins)



#**Chemical Space Discussion**

(10 mins)

Take 10 minutes to discuss:
*   What is meant by chemical space?
*   How is this different to drug-like molecules?
*   To build reliable models, should our training data be similar to what we're aiming to predict?
*   Could this vary depending on the phase of the drug discovery pipeline we're applying these tools to?





(10 mins)

If you've closed the notebook, re-connect to google drive with the cell below and re-install the plotting libraries in the 2nd cell.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%%capture
!pip install rdkit-pypi
!pip install umap-learn
import sys
sys.path.append("drive/MyDrive/DataScience_Workshop/data/Day1/")
from courseFunctions import *

Then load the images in the next cell.

*   Which of these series fall out of the chemical space?

*   Does this make them more likely or less likely to be predicted?

In [None]:
import requests
from IPython.display import Image
from IPython.display import display

img0 = Image("/content/drive/MyDrive/DataScience_Workshop/data/Day1/PCA_chem_space.png") 
img1 = Image("/content/drive/MyDrive/DataScience_Workshop/data/Day1/UMAP_chem_space.png") 

display(img0,img1)

Now run the next 6 cells to plot a set of inhibitors for the *P. falciparum* PI3K enzyme against both the GSK and St Jude datasets.



*   Which dataset is more suitable for training a model to predict PI3K inhibition and why?



In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/Day1/"
file_list = ["gsk_3d7.csv", "gsk_3d7_actives.csv", "pi3k_actives.csv"]
my_plots3 = plots(path, file_list)

In [None]:
my_plots3.plot_pca()

In [None]:
my_plots3.plot_umap()

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/Day1/"
file_list = ["st_jude_3d7.csv", "st_jude_3d7_actives.csv", "pi3k_actives.csv"]
my_plots4 = plots(path, file_list)

In [None]:
my_plots4.plot_pca()

In [None]:
my_plots4.plot_umap()

#**Data Cleaning**

Discussion: What inconsistencies in drug discovery data might one need to address in order to standardise a dataset?

Look at the raw ChEMBL_A_Baumannii data for ideas (not limited to this).



*   Multiple data points per smiles
*   Highly Variable Data (data discrepancies)
*   Different Units
*   Standardise SMILES
*   Unnecessary columns
*   Points beyond assay limits of detection
*   Effect of different assay conditions




###Extension exercise for the brave:

Let's see some examples for basic cleaning of a dataset downloaded from ChEMBL.

**Clean a ChEMBL Dataset** (Extension exercise)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
import pandas as pd

PATH = "/content/drive/MyDrive/DataScience_Workshop/data/Day1"
FILENAME = "gsk_3d7.csv"

In [None]:
my_dataframe = pd.read_csv(os.path.join(PATH, FILENAME + ".csv"), sep=";")
my_dataframe

Let's investigate which columns are present and keep only those that are of interest.

In [None]:
my_dataframe.columns

In [None]:
my_dataframe = my_dataframe[['Molecule ChEMBL ID', 'Smiles', 'Standard Value', 'Standard Units']]
my_dataframe

Now let's double check how many unique values are within each column. 

If there is more than one "standard unit", we'll need to standardise the values.

In [None]:
my_dataframe.nunique()

We can use the rdkit package to standardise the SMILES strings to make duplicate detection easier.

In [None]:
!pip install rdkit
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs

In [None]:
mols = [Chem.MolFromSmiles(str(smi)) for smi in my_dataframe['Smiles'].tolist()]
canonical_smiles = [Chem.MolToSmiles(m) for m in mols]
my_dataframe["CAN_Smiles"] = canonical_smiles
my_dataframe

Let's check for multiple entries for the same compound.

Here we delete the duplicates but we could also, for example, take the average of the values.

In [None]:
my_dataframe.dropna(subset=["CAN_Smiles"],inplace=True)
my_dataframe.drop_duplicates()
my_dataframe

Lastly, we can save our clean dataset.

In [None]:
my_dataframe.to_csv(os.path.join(PATH, FILENAME + "_processed.csv"), index=False)