# Mechanisms of Action (MoA) Prediction
Can you improve the algorithm that classifies drugs based on their biological activity?

The [Connectivity Map](https://clue.io), a project within the Broad Institute of MIT and Harvard, the [Laboratory for Innovation Science at Harvard (LISH)](http://lish.harvard.edu), and the [NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS)](http://lincsproject.org), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.

**What is the Mechanism of Action (MoA) of a drug? And why is it important?**

In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

**How do we determine the MoAs of a new drug?**

One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cells’ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset.

As is customary, the dataset has been split into testing and training subsets. Hence, your task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.

**How to evaluate the accuracy of a solution?**

Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the [logarithmic loss function](https://www.kaggle.com/c/lish-moa/overview/evaluation) applied to each drug-MoA annotation pair.

If successful, you’ll help to develop an algorithm to predict a compound’s MoA given its cellular signature, thus helping scientists advance the drug discovery process.

> **This is a Code Competition. Refer to [Code Requirements](/c/lish-moa/overview/code-requirements) for details.**

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4688294%2Fa7c39a710116cc60ab0e0707020df4f5%2FUnknown-31?generation=1601643378654178&alt=media)

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4688294%2F7b66ae0c294d9ca67272209d4756e0e9%2Flogo_largetext_preview-4.png?generation=1601643409931253&alt=media)

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4688294%2F18004b33573a510867eac50ae6c68ec2%2FUnknown-32?generation=1601643476632984&alt=media)

Dataset Description
-------------------

In this competition, you will be predicting multiple targets of the Mechanism of Action (MoA) response(s) of different samples (`sig_id`), given various inputs such as gene expression data and cell viability data.

Two notes:

*   the training data has an additional (optional) set of MoA labels that are _not_ included in the test data and not used for scoring.
*   the re-run dataset has approximately 4x the number of examples seen in the Public test.

Files
-----

*   `train_features.csv` - Features for the training set. Features `g-` signify gene expression data, and `c-` signify cell viability data. `cp_type` indicates samples treated with a compound (`cp_vehicle`) or with a control perturbation (`ctrl_vehicle`); control perturbations have no MoAs; `cp_time` and `cp_dose` indicate treatment duration (24, 48, 72 hours) and dose (high or low).
*   `train_drug.csv` - This file contains an anonymous drug\_id for the training set only.
*   `train_targets_scored.csv` - The binary MoA targets that are scored.
*   `train_targets_nonscored.csv` - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
*   `test_features.csv` - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
*   `sample_submission.csv` - A submission file in the correct format.

Link: https://www.kaggle.com/competitions/lish-moa

In [1]:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool, sum_models, to_classifier
from sklearn import preprocessing
from sklearn.model_selection import StratifiedKFold, train_test_split
from tqdm.notebook import tqdm

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
le = preprocessing.LabelEncoder()

<IPython.core.display.Javascript object>

In [4]:
!ls ../../data/lish-moa

sample_submission.csv  train_drug.csv		    train_targets_scored.csv
submission.csv	       train_features.csv
test_features.csv      train_targets_nonscored.csv


<IPython.core.display.Javascript object>

In [5]:
sample_submission_df = pd.read_csv(
    "../../data/lish-moa/sample_submission.csv"
).set_index("sig_id")
sample_submission_df

Unnamed: 0_level_0,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_0004d9e33,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_001897cda,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_002429b5b,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_00276f245,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_0027f1083,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_ff7004b87,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_ff925dd0d,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_ffb710450,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
id_ffbb869f2,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


<IPython.core.display.Javascript object>

In [6]:
test_features_df = pd.read_csv("../../data/lish-moa/test_features.csv").set_index(
    "sig_id"
)
test_features_df

Unnamed: 0_level_0,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_0004d9e33,trt_cp,24,D1,-0.5458,0.1306,-0.5135,0.4408,1.5500,-0.1644,-0.2140,...,0.0981,0.7978,-0.1430,-0.2067,-0.2303,-0.1193,0.0210,-0.0502,0.1510,-0.7750
id_001897cda,trt_cp,72,D1,-0.1829,0.2320,1.2080,-0.4522,-0.3652,-0.3319,-1.8820,...,-0.1190,-0.1852,-1.0310,-1.3670,-0.3690,-0.5382,0.0359,-0.4764,-1.3810,-0.7300
id_002429b5b,ctl_vehicle,24,D1,0.1852,-0.1404,-0.3911,0.1310,-1.4380,0.2455,-0.3390,...,-0.2261,0.3370,-1.3840,0.8604,-1.9530,-1.0140,0.8662,1.0160,0.4924,-0.1942
id_00276f245,trt_cp,24,D2,0.4828,0.1955,0.3825,0.4244,-0.5855,-1.2020,0.5998,...,0.1260,0.1570,-0.1784,-1.1200,-0.4325,-0.9005,0.8131,-0.1305,0.5645,-0.5809
id_0027f1083,trt_cp,48,D1,-0.3979,-1.2680,1.9130,0.2057,-0.5864,-0.0166,0.5128,...,0.4965,0.7578,-0.1580,1.0510,0.5742,1.0900,-0.2962,-0.5313,0.9931,1.8380
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_ff7004b87,trt_cp,24,D1,0.4571,-0.5743,3.3930,-0.6202,0.8557,1.6240,0.0640,...,-1.1790,-0.6422,-0.4367,0.0159,-0.6539,-0.4791,-1.2680,-1.1280,-0.4167,-0.6600
id_ff925dd0d,trt_cp,24,D1,-0.5885,-0.2548,2.5850,0.3456,0.4401,0.3107,-0.7437,...,0.0210,0.5780,-0.5888,0.8057,0.9312,1.2730,0.2614,-0.2790,-0.0131,-0.0934
id_ffb710450,trt_cp,72,D1,-0.3985,-0.1554,0.2677,-0.6813,0.0152,0.4791,-0.0166,...,0.4418,0.9153,-0.1862,0.4049,0.9568,0.4666,0.0461,0.5888,-0.4205,-0.1504
id_ffbb869f2,trt_cp,48,D2,-1.0960,-1.7750,-0.3977,1.0160,-1.3350,-0.2207,-0.3611,...,0.3079,-0.4473,-0.8192,0.7785,0.3133,0.1286,-0.2618,0.5074,0.7430,-0.0484


<IPython.core.display.Javascript object>

In [7]:
train_drug_df = pd.read_csv("../../data/lish-moa/train_drug.csv").set_index("sig_id")
train_drug_df

Unnamed: 0_level_0,drug_id
sig_id,Unnamed: 1_level_1
id_000644bb2,b68db1d53
id_000779bfc,df89a8e5a
id_000a6266a,18bb41b2c
id_0015fd391,8c7f86626
id_001626bd3,7cbed3131
...,...
id_fffb1ceed,df1d0a5a1
id_fffb70c0c,ecf3b6b74
id_fffc1c3f4,cacb2b860
id_fffcb9e7c,8b87a7a83


<IPython.core.display.Javascript object>

In [8]:
train_features_df = pd.read_csv("../../data/lish-moa/train_features.csv").set_index(
    "sig_id"
)
train_features_df

Unnamed: 0_level_0,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,g-6,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_000644bb2,trt_cp,24,D1,1.0620,0.5577,-0.2479,-0.6208,-0.1944,-1.0120,-1.0220,...,0.2862,0.2584,0.8076,0.5523,-0.1912,0.6584,-0.3981,0.2139,0.3801,0.4176
id_000779bfc,trt_cp,72,D1,0.0743,0.4087,0.2991,0.0604,1.0190,0.5207,0.2341,...,-0.4265,0.7543,0.4708,0.0230,0.2957,0.4899,0.1522,0.1241,0.6077,0.7371
id_000a6266a,trt_cp,48,D1,0.6280,0.5817,1.5540,-0.0764,-0.0323,1.2390,0.1715,...,-0.7250,-0.6297,0.6103,0.0223,-1.3240,-0.3174,-0.6417,-0.2187,-1.4080,0.6931
id_0015fd391,trt_cp,48,D1,-0.5138,-0.2491,-0.2656,0.5288,4.0620,-0.8095,-1.9590,...,-2.0990,-0.6441,-5.6300,-1.3780,-0.8632,-1.2880,-1.6210,-0.8784,-0.3876,-0.8154
id_001626bd3,trt_cp,72,D2,-0.3254,-0.4009,0.9700,0.6919,1.4180,-0.8244,-0.2800,...,0.0042,0.0048,0.6670,1.0690,0.5523,-0.3031,0.1094,0.2885,-0.3786,0.7125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_fffb1ceed,trt_cp,24,D2,0.1394,-0.0636,-0.1112,-0.5080,-0.4713,0.7201,0.5773,...,0.1969,0.0262,-0.8121,0.3434,0.5372,-0.3246,0.0631,0.9171,0.5258,0.4680
id_fffb70c0c,trt_cp,24,D2,-1.3260,0.3478,-0.3743,0.9905,-0.7178,0.6621,-0.2252,...,0.4286,0.4426,0.0423,-0.3195,-0.8086,-0.9798,-0.2084,-0.1224,-0.2715,0.3689
id_fffc1c3f4,ctl_vehicle,48,D2,0.3942,0.3756,0.3109,-0.7389,0.5505,-0.0159,-0.2541,...,0.5409,0.3755,0.7343,0.2807,0.4116,0.6422,0.2256,0.7592,0.6656,0.3808
id_fffcb9e7c,trt_cp,24,D1,0.6660,0.2324,0.4392,0.2044,0.8531,-0.0343,0.0323,...,-0.1105,0.4258,-0.2012,0.1506,1.5230,0.7101,0.1732,0.7015,-0.6290,0.0740


<IPython.core.display.Javascript object>

In [9]:
train_targets_nonscored_df = pd.read_csv(
    "../../data/lish-moa/train_targets_nonscored.csv"
).set_index("sig_id")
train_targets_nonscored_df

Unnamed: 0_level_0,abc_transporter_expression_enhancer,abl_inhibitor,ace_inhibitor,acetylcholine_release_enhancer,adenosine_deaminase_inhibitor,adenosine_kinase_inhibitor,adenylyl_cyclase_inhibitor,age_inhibitor,alcohol_dehydrogenase_inhibitor,aldehyde_dehydrogenase_activator,...,ve-cadherin_antagonist,vesicular_monoamine_transporter_inhibitor,vitamin_k_antagonist,voltage-gated_calcium_channel_ligand,voltage-gated_potassium_channel_activator,voltage-gated_sodium_channel_blocker,wdr5_mll_interaction_inhibitor,wnt_agonist,xanthine_oxidase_inhibitor,xiap_inhibitor
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_000644bb2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000779bfc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000a6266a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_0015fd391,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_001626bd3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_fffb1ceed,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffb70c0c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffc1c3f4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffcb9e7c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [10]:
train_targets_scored_df = pd.read_csv(
    "../../data/lish-moa/train_targets_scored.csv"
).set_index("sig_id")
train_targets_scored_df

Unnamed: 0_level_0,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_000644bb2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000779bfc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000a6266a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_0015fd391,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_001626bd3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_fffb1ceed,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffb70c0c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffc1c3f4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_fffcb9e7c,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

# Understanding target

In [11]:
targets_counts_df = train_targets_scored_df.T.apply(
    pd.Series.value_counts, axis=1, normalize=True
).fillna(0)
targets_counts_df

Unnamed: 0,0,1
5-alpha_reductase_inhibitor,0.999286,0.000714
11-beta-hsd1_inhibitor,0.999244,0.000756
acat_inhibitor,0.998992,0.001008
acetylcholine_receptor_agonist,0.992021,0.007979
acetylcholine_receptor_antagonist,0.987360,0.012640
...,...,...
ubiquitin_specific_protease_inhibitor,0.999748,0.000252
vegfr_inhibitor,0.992861,0.007139
vitamin_b,0.998908,0.001092
vitamin_d_receptor_agonist,0.998362,0.001638


<IPython.core.display.Javascript object>

In [12]:
targets_counts_df.mean()

0    0.996566
1    0.003434
dtype: float64

<IPython.core.display.Javascript object>

In [13]:
train_targets_scored_df.sum(axis=1).describe()

count    23814.000000
mean         0.707315
std          0.679532
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          7.000000
dtype: float64

<IPython.core.display.Javascript object>

In [14]:
cnt = train_targets_scored_df.loc["id_000a6266a"]
cnt[cnt > 0]

bcr-abl_inhibitor    1
kit_inhibitor        1
pdgfr_inhibitor      1
Name: id_000a6266a, dtype: int64

<IPython.core.display.Javascript object>

In [15]:
train_targets_scored_df.drop_duplicates()

Unnamed: 0_level_0,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
sig_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id_000644bb2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000779bfc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_000a6266a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_001626bd3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_0020d0484,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
id_8caf0dc28,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_8eff3528a,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_b6eb21420,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
id_ba5edffaf,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

# Prepare

In [16]:
cat_columns = set(train_features_df.columns) - set(
    train_features_df._get_numeric_data().columns
)
cat_columns

{'cp_dose', 'cp_type'}

<IPython.core.display.Javascript object>

In [17]:
# train
train_features_df["cp_dose"] = le.fit_transform(train_features_df["cp_dose"])
train_features_df["cp_type"] = le.fit_transform(train_features_df["cp_type"])

# test
test_features_df["cp_dose"] = le.fit_transform(test_features_df["cp_dose"])
test_features_df["cp_type"] = le.fit_transform(test_features_df["cp_type"])

<IPython.core.display.Javascript object>

In [18]:
X_test = test_features_df
X_train = train_features_df

X_test.shape, X_train.shape

((3982, 875), (23814, 875))

<IPython.core.display.Javascript object>

# Train

## Test sampling

In [19]:
y_train = train_targets_scored_df.iloc[:, 0]
y_train.value_counts(normalize=True)

0    0.999286
1    0.000714
Name: 5-alpha_reductase_inhibitor, dtype: float64

<IPython.core.display.Javascript object>

In [20]:
X_sub_train, X_sub_true, y_sub_train, y_sub_true = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_sub_train.shape, X_sub_true.shape, y_sub_train.shape, y_sub_true.shape

((21432, 875), (2382, 875), (21432,), (2382,))

<IPython.core.display.Javascript object>

In [21]:
X_sub_train, X_sub_val, y_sub_train, y_sub_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_sub_train.shape, X_sub_val.shape, y_sub_train.shape, y_sub_val.shape

((21432, 875), (2382, 875), (21432,), (2382,))

<IPython.core.display.Javascript object>

In [22]:

model = CatBoostClassifier(logging_level="Silent")

model.fit(
    Pool(X_sub_train, y_sub_train),
    eval_set=Pool(X_sub_val, y_sub_val),
    verbose=False,
    plot=True,
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7fc60d5b3880>

<IPython.core.display.Javascript object>

In [23]:
(model.predict(X_sub_true) == y_sub_true).sum() / len(y_sub_true)

0.9995801847187238

<IPython.core.display.Javascript object>

## Loop

In [24]:
submission_df = pd.DataFrame()

for col_name in tqdm(train_targets_scored_df.columns):
    y_col_train = train_targets_scored_df[col_name]

    X_sub_train, X_sub_val, y_sub_train, y_sub_val = train_test_split(
        X_train, y_col_train, test_size=0.1, random_state=42
    )

    model = CatBoostClassifier(logging_level="Silent")
    model.fit(
        Pool(X_sub_train, y_sub_train),
        eval_set=Pool(X_sub_val, y_sub_val),
        verbose=False,
    )

    submission_df[col_name] = model.predict_proba(X_test)[:, 1]

  0%|          | 0/206 [00:00<?, ?it/s]

  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submis

  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]
  submission_df[col_name] = model.predict_proba(X_test)[:, 1]

In [26]:
submission_df

Unnamed: 0,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,0.000399,0.000114,0.001254,0.013330,0.017206,0.004033,0.000487,0.004548,0.000009,0.015735,...,0.000007,0.001086,0.001400,0.000259,0.002052,0.000012,0.001299,0.000973,0.000123,0.000886
1,0.000042,0.000091,0.001185,0.007036,0.011002,0.002282,0.000184,0.004763,0.000007,0.008311,...,0.000009,0.000544,0.001613,0.000036,0.007362,0.000017,0.005759,0.000072,0.000071,0.000708
2,0.000063,0.000045,0.001093,0.006566,0.006104,0.002004,0.001083,0.003234,0.000006,0.007524,...,0.000030,0.000477,0.002587,0.000039,0.003101,0.000010,0.001458,0.000432,0.000093,0.001302
3,0.000070,0.000056,0.001580,0.008998,0.005660,0.002878,0.000365,0.001013,0.000005,0.008037,...,0.000012,0.000398,0.000906,0.000501,0.003969,0.000011,0.000552,0.000108,0.000011,0.000624
4,0.000483,0.000082,0.001161,0.013505,0.017153,0.001164,0.001193,0.002698,0.000004,0.009571,...,0.000012,0.000406,0.001733,0.000141,0.001411,0.000007,0.000494,0.000551,0.000029,0.000423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3977,0.000064,0.000103,0.001601,0.005926,0.011938,0.002207,0.000438,0.002402,0.000016,0.004471,...,0.000008,0.000293,0.002129,0.013586,0.003649,0.000013,0.006320,0.000411,0.000044,0.001121
3978,0.000440,0.000385,0.001963,0.008966,0.038547,0.003649,0.001772,0.008255,0.000005,0.007837,...,0.000009,0.000552,0.003091,0.000099,0.002773,0.000013,0.001453,0.000333,0.000030,0.000770
3979,0.000250,0.000079,0.001043,0.008979,0.014069,0.003795,0.001591,0.002849,0.000007,0.011246,...,0.000019,0.000291,0.001683,0.000031,0.001451,0.000007,0.000879,0.000148,0.000012,0.001138
3980,0.000143,0.000029,0.000802,0.012261,0.016453,0.000882,0.000344,0.003108,0.000016,0.004973,...,0.000005,0.000202,0.001758,0.000201,0.002356,0.000009,0.000982,0.000372,0.000008,0.000666


<IPython.core.display.Javascript object>

In [25]:
submission_df.to_csv("../../data/lish-moa/submission.csv")

<IPython.core.display.Javascript object>