# Introduction
This project proposed by [Data Professor](https://www.youtube.com/channel/UCV8e2g4IWQqK71bbzGDEI4Q) aims to build models predicting molecules binding to the Beta-Lactamase protein. For the complete project description, see this [video](https://www.youtube.com/watch?v=_GtEgiWWyK4) on the [Data Professor](https://www.youtube.com/channel/UCV8e2g4IWQqK71bbzGDEI4Q) channel.
The EDA and data pre-processing were performed in the notebook [Beta-Lactamase_001_Data_Wrangling_and_EDA](https://www.kaggle.com/wguesdon/beta-lactamase-001-data-wrangling-and-eda/edit/run/83400192).  
This notebook contains the baseline classification model.

# I Initialization

In [1]:
###################
# I Initialization
###################

#+++++++++++++++
# Load libraries
#+++++++++++++++

## Load libraries
import os
from pathlib import Path # for path in Windows and Unix
import zipfile
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

#+++++++++++++++++++++++++++
# Define the working folders
#+++++++++++++++++++++++++++

# see https://careerkarma.com/blog/python-list-files-in-directory/
# See https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f

project_data_folder = Path('/kaggle/input/beta-lactamase-001-data-wrangling-and-eda/')

#++++++++++++++++++++++++
# Load the processed data
#++++++++++++++++++++++++

file_path = project_data_folder / 'beta-lactamase_filtered_dataset.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_relation,standard_value,standard_units,standard_type,pchembl_value,target_pref_name,bao_label,Name,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,CHEMBL1401836,COc1ccc(CCNC(=O)CSCc2ccc(F)cc2)cc1OC,=,79432.8,nM,Potency,4.1,Beta-lactamase AmpC,assay format,CHEMBL1401836,...,0,0,1,1,1,0,0,0,0,1
1,CHEMBL554891,Cl.c1ccc(C2CN3CCSC3=N2)cc1,=,631.0,nM,Potency,6.2,Beta-lactamase AmpC,assay format,CHEMBL554891,...,0,0,1,1,1,0,0,0,0,1
2,CHEMBL1519543,CCOc1ccc(CCNC(=O)Cn2ncn3c(cc4ccccc43)c2=O)cc1OCC,=,631.0,nM,Potency,6.2,Beta-lactamase AmpC,assay format,CHEMBL1519543,...,0,0,1,1,1,0,0,0,0,1
3,CHEMBL1401837,O=C(Nc1ccc2c(c1)OCO2)c1cc(C2CC2)on1,=,5623.4,nM,Potency,5.25,Beta-lactamase AmpC,assay format,CHEMBL1401837,...,0,0,1,1,1,0,0,0,0,1
4,CHEMBL2369239,CCCCCCOc1ccc(N2C(=O)CC(SC(=N)N/N=C(\C)c3cccs3)...,=,63095.7,nM,Potency,4.2,Beta-lactamase AmpC,assay format,CHEMBL2369239,...,0,0,1,1,1,0,0,0,0,1


# II Baseline Model

In [2]:
###################
# II Baseline Model
###################

#+++++++++++++++++++
# Feature enginering
#+++++++++++++++++++

# Drop content no useful for prediction
# canonical_smiles, standard_relation, standard_value, standard_units, 
# standard_type, target_pref_name, bao_label

columns_to_drop = ['canonical_smiles', 'standard_relation', 'standard_value', 'standard_units', 
                   'standard_type', 'target_pref_name', 'bao_label', 'Name', 'molecule_chembl_id']
df1 = df.drop(columns_to_drop, axis = 1)

# Recode pchembl_value	
# pChEMBL values <5 == 'Inactive' pChEMBL values > 6 == 'Active' pChEMBL values 5-6 == 'Intermediate'
# Inactive = 0
# Intermediate = 1
# Active = 2

# pchembl_code
# Use a lambda function

def pchembl_value_encoding(pchembl_value):
    if pchembl_value < 5:
        return 'Inactive'
    elif pchembl_value > 5 and pchembl_value < 6:
        return 'Intermediate'
    else:
        return 'Active'

df1['pchembl_value_code'] = df1.apply(lambda row : pchembl_value_encoding(row['pchembl_value']), axis = 1)

df1 = df1.drop(['pchembl_value'], axis = 1)

#+++++++++++++++++++
# Random Forest model
#+++++++++++++++++++

# see Basic Code Stencil (PRACTICAL) Data Science infinity

# Split data into input and output objects
X = df1.drop(['pchembl_value_code'], axis = 1)
y = df1['pchembl_value_code']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42, stratify=y)

# Instanciate the model object
clf = RandomForestClassifier(random_state=42)

# Train our model
clf.fit(X_train, y_train)

# Assess the model accuracy
# see https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case

y_pred = clf.predict(X_test)
print('accuracy score:', accuracy_score(y_test, y_pred))
print('precision score:', precision_score(y_test, y_pred, average="macro"))
print('recall score:', recall_score(y_test, y_pred, average="macro"))

accuracy score: 0.42568390291419755
precision score: 0.3597568770906383
recall score: 0.3550948261574362
