#**Python Primer and Google Colab**



### A note on Notebooks
Notebooks are organized in a set of cells, where each cell contains a portion of code that will be run together. Some cells contain code while others can contain plain text (such as this one). 

Plain text can be added within a code cell - this is called a 'comment' and is denoted by using a hash (#). Comments will not be run as code and they are really helpful to allow others to follow the logic of your code.

In [None]:
# <--This '#' symbol means that this line is a 'comment'. Comments do not run or affect the script.

In [None]:
 #print("Hello There") <- this is not processed
print("Hello World")  # <- but this is

In [None]:
4+3

## Variables
A variable stores data values and its name must be UNIQUE.

Variables are assigned by the '=' operator and can be reassigned.

In [None]:
my_variable = 1234
print("The value:", my_variable)

In [None]:
my_variable = 24 * 60 * 60
print("Seconds in a day:", my_variable)

## Data types

All data in Python has an associated 'type'. One very useful data type are 'lists':

### Lists
A list is an ordered collection of items which can include different data types inside. Each item in a list is recognized by its position. Keep in mind that lists start at position 0 in python, not 1. 
A list is defined by square brackets [1,2,2,3]

In [None]:
my_list = [2, 4, 6, 8]
print("List at first position (element 0):", my_list[0])

In [None]:
my_list[2] = 100
print(my_list)

## Functions
Functions allow us to call a block of code without having to rewrite it again.

Functions are called using its name, followed by parenthesis.

Python has a number of functions built-in.

In [None]:
my_list = [3, 5, 2, 1, 4]

print("List length:", len(my_list))
print("Sorted list:", sorted(my_list))

We can use functions and variables written by others by importing them into the current workspace. This let's us build up more sophisticated programs without having to code everything from scratch.

In [None]:
from math import pi

print(round(pi, 2))

In [None]:
import random

print(random.randint(1, 10))

Note: After a Google Colab runtime is closed, all variables will lose their values and you'll need to re-run the relevant cells.



#Connecting to Google Drive

Let's connect this Colab notebook to our Google Drive so that we can save files when we need to.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Google Colab runs in a linux environment on the cloud. 

To run an instruction as a linux command instead of a Python statement, we start the line with a '!'.

Let's make a new folder in our Google Drive.

In [None]:
!mkdir /content/drive/MyDrive/DataScience_Workshop

Now we'll copy some additional course materials to this folder.

In [None]:
!git clone https://github.com/ersilia-os/event-fund-ai-drug-discovery.git /content/drive/MyDrive/DataScience_Workshop

Lastly, let's check that we can read the test file in our new folder.

In [None]:
with open("drive/MyDrive/DataScience_Workshop/data/day1/workshop_test_file.txt", "r") as f:
  print(f.read())

#**Intro to ChEMBL**

ChEMBL is an online database of bioactivity data. 

We can load a pre-downloaded malaria screening dataset with the popular 'pandas' library. Note: ChEMBL data files have fields separated by a semi-colon ';'.

In [None]:
import pandas as pd

dataframe = pd.read_csv("drive/MyDrive/DataScience_Workshop/data/day1/gsk_3d7.csv", sep=";")
dataframe

Now let's get the subset that have < 100 nM activity and save it separately.

In [None]:
actives = dataframe.loc[dataframe['Standard Value'] < 100]
actives.to_csv("drive/MyDrive/DataScience_Workshop/data/day1/gsk_3d7_actives.csv")
actives

#**Visualising Compound Similarity**

Now that we have some data, let's visualise the distribution of compounds in our example library. First we need to install some additional Python packages.

In [None]:
%%capture
!pip install rdkit-pypi
!pip install umap-learn
import sys
sys.path.append("drive/MyDrive/DataScience_Workshop/data/day1/")
from courseFunctions import *

Our compounds are represented by 100s or even 1000s of chemical descriptors. We can use 'dimensionality reduction' algorithms to reduce our data to two dimensions for visualisation. Two such methods are 'Principal Component Analysis' (PCA) and 'UMAP'.

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/day1/"
file_list = ["gsk_3d7.csv", "gsk_3d7_actives.csv"]
my_plots = plots(path, file_list)

In [None]:
my_plots.plot_pca()

In [None]:
my_plots.plot_umap()

###Now it's your turn:

The St Jude Children's Research Hospital (USA) conducted a High Throughput Screen for novel anti-malarials and made public the dose-response for the ~1300 most active compounds. Let's get this plasmodium dataset from chembl and visualise it along with its actives. 


1.   Search in ChEMBL for 'plasmodium falciparum' and select 'Assays'.
2.   Sort by most-to-least number of compounds.
3.   Look for the 'St Jude Malaria Screening' dataset on the first page **(ID: CHEMBL730080)**.
4.   Download the molecules for the assay and unzip the file.
5.   Rename the file to 'st_jude_3d7.csv'.
6.   Drag and drop the file into the 'DataScience_Workshop/data/day1/' folder on google drive.
7.   Save a new dataset of molecules with < 100 nM activity by running the code block below.
8. Plot the visualisations of the st jude data and 'actives' subset by running the next 3 cells.


In [None]:
st_jude_dataframe = pd.read_csv("drive/MyDrive/DataScience_Workshop/data/day1/st_jude_3d7.csv", sep=";")
actives = st_jude_dataframe.loc[st_jude_dataframe['Standard Value'] < 100]
actives.to_csv("drive/MyDrive/DataScience_Workshop/data/day1/st_jude_3d7_actives.csv")

In [None]:
path = "drive/MyDrive/DataScience_Workshop/data/day1/"
file_list = ["st_jude_3d7.csv", "st_jude_3d7_actives.csv"]
my_plots2 = plots(path, file_list)

In [None]:
my_plots2.plot_pca()

In [None]:
my_plots2.plot_umap()