# Curve data actives

Create a summary table of distinct active compounds for each target.

In [1]:
# ChEMBL connection...

engine = create_engine(open('database.txt').read().strip())

## Active compounds

Create table of distinct active compounds for each target.


* Compounds are defined in terms of USMILES.


* Targets are defined as Symbol/Species pairs


* Filtering on size (heavy atom count) is performed at this stage.

See SQL file [tt_curve_data_actives.sql](SQL/tt_curve_data_actives.sql) for detail of the creation of table '`tt_curve_data_actives`'.

See notebook [02_Map_ChEMBL_targets_and_get_curve_data](02_Map_ChEMBL_targets_and_get_curve_data.ipynb#get_curve_data) for details on how data table '`tt_curve_data_v1`' was created.

See notebook [03_Target_Fixes](03_Target_Fixes.ipynb#exclude) for detailed of the '`exclude`' flag.

In [2]:
print(open("SQL/tt_curve_data_actives.sql").read())

----------------------------------------------------------------------------------------------------
-- 
-- tt_curve_data_actives.sql
-- 
-- Create summary table of distinct active compounds for each target.
-- 
-- * Compounds are defined by USMILES.
-- 
-- * Targets are defined as symbol/species pairs.
-- 
-- * Size filtering (based on minimum and maximum heavy atom counts) is applied at this stage.
-- 
-- * Note the use of source table 'tt_curve_data_v1', as only actives are of interest here.
-- 
----------------------------------------------------------------------------------------------------

-- drop table tt_curve_data_actives;

--

create table tt_curve_data_actives as
select
    symbol
  , species
  , usmiles
  , wm_concat(distinct cmpd) as cmpds -- NB distinct clause here orders concatenated values
  , count(cmpd) as count
from (
  select distinct
      a.symbol
    , a.species
    , a.parent_cmpd_chemblid as cmpd
    , b.usmiles
  from
    tt_curve_data_v1 a
    join tt_stru

In [3]:
actives = pd.read_sql_table('tt_curve_data_actives', engine)

actives.shape

(194402, 5)