#**Python Primer and Google Colab**



### A note on Notebooks
Notebooks are organized in a set of cells, where each cell contains a portion of code that will be run together. Some cells contain code while others can contain plain text (such as this one). 

Plain text can also be added within a code cell - this is called a 'comment' and is denoted by using a hash (#). Comments will not be run as code and they are really helpful to allow others to follow the logic of your code.

In [None]:
# <--This '#' symbol means that this line is a 'comment'. Comments do not run or affect the script.

In [None]:
 #print("Hello There") <- this is not processed
print("Hello World")  # <- but this is

## Data types

All data in Python has an associated 'type'.

### String
Text is passed as a string (str) in python, always inside single ('...') or double quotes ("...")

### Numbers
- Integers (int): whole numbers (positive or negative)
- Floating Point Numbers (float): real numbers with decimal positions

### Booleans
A boolean (bool) value is either a True or False

### Lists
A list is an ordered collection of items which can include different data types inside. Each item in a list is recognized by its position. Keep in mind that lists start at position 0 in python, not 1. 
A list is defined by square brackets [1,2,2,3]

In [None]:
# Strings
type("This is a string")

In [None]:
type("42") # <- this is also a string

In [None]:
# Numbers
type(42)

In [None]:
type(42.0)

In [None]:
# Boolean
type(False)

In [None]:
# List
type([2, 4, 6, 8])

In [None]:
my_list = [2, 4, 6, 8]
print("List at first position (element 0):", my_list[0])

## Variables
A variable stores data values and its name must be UNIQUE.

Variables are assigned by the '=' operator and can be reassigned at any moment.

In [None]:
my_variable = 24 * 60 * 60
print("Seconds in a day:", my_variable)

In [None]:
my_variable = 1234
print("The new value:", my_variable)

In [None]:
a = input("Type anything you want: ")
print("The reply was:", a)

##Conditionals

If/elif/else statements create branches in code. The block of code indented beneath them will only run if the condition is true.

In [None]:
num = 33
answer = "I don't know yet"
print(answer)

if num > 20:
  answer = "Bigger"
else:
  answer = "Smaller"
print(answer)

##Loops

Loops allow us to repeat blocks of code. They come in two varieties: 'for' and 'while'

In [None]:
my_list = [2, 4, 6, 8]

# For loops repeat a set number of times
for value in my_list:
  print("Value:", value, "Squared:", value*value)

In [None]:
# While loops repeat as long as the condition is true.
# Watch out for INFINITE LOOPS, if the condition never turns false the loop won't stop

x = 1
while (x<10):
    print("run number " + str(x))
    x = x + 1

## Functions
Functions allow to call a block of code without having to rewrite it again.

Use def() to set up a new function.

In [None]:
def name_funct(name):
    print("My name is " + name)

name_funct("Jason")

In [None]:
name_funct("Peter")

A number of functions are already built into Python.

In [None]:
my_list = [3, 5, 2, 1, 4]

print("List length:", len(my_list))
print("Sorted list:", sorted(my_list))

We can also use functions and variables written by others by importing them into the current workspace.

In [None]:
from math import pi

print(round(pi, 2))

In [None]:
import random

print(random.randint(1, 10))

Note: After a Google Colab runtime is closed, all variables will lose their values and you'll need to re-run the relevant cells.



#**Data Cleaning**

###Extension exercise (for the brave):

Let's see some examples for basic cleaning of a dataset downloaded from ChEMBL.

**Clean a ChEMBL Dataset** (Extension exercise)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
import pandas as pd

PATH = "/content/drive/MyDrive/DataScience_Workshop/data/day1"
FILENAME = "gsk_3d7.csv"

In [None]:
my_dataframe = pd.read_csv(os.path.join(PATH, FILENAME), sep=";")
my_dataframe

Let's investigate which columns are present and keep only those that are of interest.

In [None]:
my_dataframe.columns

In [None]:
my_dataframe = my_dataframe[['Molecule ChEMBL ID', 'Smiles', 'Standard Value', 'Standard Units']]
my_dataframe

Now let's double check how many unique values are within each column. 

If there is more than one "standard unit", we'll need to standardise the values.

In [None]:
my_dataframe.nunique()

We can use the rdkit package to standardise the SMILES strings to make duplicate detection easier.

In [None]:
!pip install rdkit
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs

In [None]:
mols = [Chem.MolFromSmiles(str(smi)) for smi in my_dataframe['Smiles'].tolist()]
canonical_smiles = [Chem.MolToSmiles(m) for m in mols]
my_dataframe["CAN_Smiles"] = canonical_smiles
my_dataframe

Let's remove rows that are missing data - either because they have no experimental data or because a canonical SMILES could not be generated.

In [None]:
#First we change all strings to NaN (not a number) that can be detected in the next command
my_dataframe.replace('', np.nan, inplace=True)

#Drop all rows with any empty/Null/NaN values 
my_dataframe.dropna(inplace=True)

Let's check for multiple entries for the same compound.

Here we delete the duplicates but we could also, for example, take the average of the values.

In [None]:
my_dataframe.dropna(subset=["CAN_Smiles"],inplace=True)
my_dataframe.drop_duplicates()
my_dataframe

Lastly, we can save our clean dataset.

In [None]:
my_dataframe.to_csv(os.path.join(PATH, FILENAME + "_processed.csv"), index=False)