# Introduction

Let's investigate how we can use a Langchain SQL Agent to explore ChEMBL database.  

## What is ChEMBL?

Excerpt from https://www.ebi.ac.uk/chembl/ site: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. 



# Packages installation and import


In [1]:
!pip install -q -U langchain
!pip install -q -U openai

In [2]:
from langchain.agents import create_sql_agent 
from langchain.agents.agent_toolkits import SQLDatabaseToolkit 
from langchain.sql_database import SQLDatabase 
from langchain.llms.openai import OpenAI 
from langchain.agents import AgentExecutor 
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI

# Setup the database connection

In [3]:
sqlite_path = "/kaggle/input/chembl-database/chembl_33.db"
sqlite_uri = f"sqlite:///{sqlite_path}"
db = SQLDatabase.from_uri(sqlite_uri)

# Configure ChatOpenAI   

We initialize the LLM that will be used to power the SQLAgent. For this experiment, we will use ChatGPT (gpt-3.5-turbo).

In [4]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
llm = ChatOpenAI(temperature=0, 
                 openai_api_key=user_secrets.get_secret("OPENAI_API_KEY"), 
                 model_name='gpt-3.5-turbo')

# Prepare the Langchain SQLAgent  

The Agent executor is initialized with `create_sql_agent` from Langchain, setting the LLM parameter, the toolkit parameter, and the agent type. We want to be able to monitor the entire process, therefore we set the parameter `verbose` to `True`.

In [5]:
toolkit = SQLDatabaseToolkit(db=db, 
                             llm=llm)

In [6]:
agent_executor = create_sql_agent(
    llm=llm,
    toolkit=toolkit,
    verbose=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

# Test the agent  


Now we will test the agent with few simple tasks.

In [7]:
question = "How many activities are?"
agent_executor.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables
Action Input: ""[0m
Thought:[32;1m[1;3mThere are many tables in the database. I should query the schema of the "activities" table to see what columns it has.
Action: sql_db_schema
Action Input: "activities"[0m
Observation: [33;1m[1;3m
CREATE TABLE activities (
	activity_id BIGINT NOT NULL, 
	assay_id BIGINT NOT NULL, 
	doc_id BIGINT, 
	record_id BIGINT NOT NULL, 
	molregno BIGINT, 
	standard_relation VARCHAR(50), 
	standard_value NUMERIC, 
	standard_units VARCHAR(100), 
	standard_flag SMALLINT, 
	standard_type VARCHAR(250), 
	activity_comment VARCHAR(4000), 
	data_validity_comment VARCHAR(30), 
	potential_duplicate SMALLINT, 
	pchembl_value NUMERIC(4, 2), 
	bao_endpoint VARCHAR(11), 
	uo_units VARCHAR(10), 
	qudt_units VARCHAR(70), 
	toid INTEGER, 
	upper_value NUMERIC, 
	standard_upper_value NUMERIC, 
	src_id INTEGER, 
	type VARCHAR(250) NOT NULL, 
	relation VARCHAR(50), 
	value NUMERIC, 
	un

'There are 20,334,684 activities.'

In [8]:
question = "How many SMILES codes are?"
agent_executor.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables
Action Input: ""[0m
Thought:[32;1m[1;3mI can query the "molecule_dictionary" table to get the SMILES codes.
Action: sql_db_schema
Action Input: "molecule_dictionary"[0m
Observation: [33;1m[1;3m
CREATE TABLE molecule_dictionary (
	molregno BIGINT NOT NULL, 
	pref_name VARCHAR(255), 
	chembl_id VARCHAR(20) NOT NULL, 
	max_phase NUMERIC(2, 1), 
	therapeutic_flag SMALLINT NOT NULL, 
	dosed_ingredient SMALLINT NOT NULL, 
	structure_type VARCHAR(10) NOT NULL, 
	chebi_par_id BIGINT, 
	molecule_type VARCHAR(30), 
	first_approval INTEGER, 
	oral SMALLINT NOT NULL, 
	parenteral SMALLINT NOT NULL, 
	topical SMALLINT NOT NULL, 
	first_in_class SMALLINT NOT NULL, 
	chirality SMALLINT NOT NULL, 
	prodrug SMALLINT NOT NULL, 
	inorganic_flag SMALLINT NOT NULL, 
	usan_year INTEGER, 
	availability_type SMALLINT, 
	usan_stem VARCHAR(50), 
	polymer_flag SMALLINT, 
	usan_substem VARCHAR(50), 
	usan_stem_definition

'2,399,743'

In [9]:
question = "List 3 molecules actives for Parkinson"
agent_executor.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables
Action Input: ""[0m
Thought:[32;1m[1;3mThere are many tables in the database, but the most relevant ones for this question seem to be "activities" and "molecule_dictionary". I should query the schema of these tables to see what columns I can use to find the active molecules for Parkinson.

Action: sql_db_schema
Action Input: "activities, molecule_dictionary"[0m
Observation: [33;1m[1;3m
CREATE TABLE activities (
	activity_id BIGINT NOT NULL, 
	assay_id BIGINT NOT NULL, 
	doc_id BIGINT, 
	record_id BIGINT NOT NULL, 
	molregno BIGINT, 
	standard_relation VARCHAR(50), 
	standard_value NUMERIC, 
	standard_units VARCHAR(100), 
	standard_flag SMALLINT, 
	standard_type VARCHAR(250), 
	activity_comment VARCHAR(4000), 
	data_validity_comment VARCHAR(30), 
	potential_duplicate SMALLINT, 
	pchembl_value NUMERIC(4, 2), 
	bao_endpoint VARCHAR(11), 
	uo_units VARCHAR(10), 
	qudt_units VARCHAR(70), 
	toid INT

'The 3 molecules that are active for Parkinson are ROPINIROLE with standard values of 7.2, 19, and 1706.'

In [10]:
question = "For what disease is active ROPINIROLE?"
agent_executor.run(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: sql_db_list_tables
Action Input: ""[0m
Thought:[32;1m[1;3mThere are many tables in the database. I should query the schema of the most relevant tables to find the information I need.
Action: sql_db_schema
Action Input: "drug_indication, molecule_dictionary"[0m
Observation: [33;1m[1;3m
CREATE TABLE drug_indication (
	drugind_id BIGINT NOT NULL, 
	record_id BIGINT NOT NULL, 
	molregno BIGINT, 
	max_phase_for_ind NUMERIC(2, 1), 
	mesh_id VARCHAR(20) NOT NULL, 
	mesh_heading VARCHAR(200) NOT NULL, 
	efo_id VARCHAR(20), 
	efo_term VARCHAR(200), 
	CONSTRAINT drugind_pk PRIMARY KEY (drugind_id), 
	CONSTRAINT drugind_molregno_fk FOREIGN KEY(molregno) REFERENCES molecule_dictionary (molregno) ON DELETE CASCADE, 
	CONSTRAINT drugind_rec_fk FOREIGN KEY(record_id) REFERENCES compound_records (record_id) ON DELETE CASCADE, 
	CONSTRAINT drugind_uk UNIQUE (record_id, mesh_id, efo_id)
)

/*
3 rows from drug_indication table:
dr

'Restless Legs Syndrome, Stroke, Cocaine-Related Disorders, Fibromyalgia, Hyperprolactinemia, Parkinson Disease, and Unverricht-Lundborg Syndrome.'