# Basic info about this Competition (Directly copied from the page)
**What is the Mechanism of Action (MoA) of a drug? And why is it important?**

In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

**How do we determine the MoAs of a new drug?**

One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cells’ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset.

As is customary, the dataset has been split into testing and training subsets. Hence, your task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.

In [None]:
import plotly 
plotly.offline.init_notebook_mode (connected = True)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import seaborn as sns 
import matplotlib.pyplot as plt 
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = '../input/lish-moa'
os.listdir(path) 

In [None]:
test_features = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')
train_features = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
train_targets_scored = pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv')
train_targets_nonscored = pd.read_csv('/kaggle/input/lish-moa/train_targets_nonscored.csv')


# let us explore the files 

### Features for the training set. 
- **Features g-** signify gene expression data, 
- **c-** signify cell viability data. 
- **cp_type -** indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; 
- **cp_time and cp_dose:** indicate treatment duration (24, 48, 72 hours) and dose (high or low).

In [None]:
train_features.head()

### train_targets_scored.csv - The binary MoA targets that are scored.

In [None]:
train_targets_scored.head()

### train_targets_nonscored.csv - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.

In [None]:
train_targets_nonscored.head()

# information about the all the csv files  

In [None]:
print('______________train_features________________')
train_features.info()
print('______________________________')
print('______________test_features________________')
test_features.info()
print('______________________________')
print('_____________train_targets_scored_________________')
train_targets_scored.info()
print('______________________________')
print('________________train_targets_nonscored______________')
train_targets_nonscored.info()

# EDA of training and testing set: 


In [None]:

print("Shape of the training set: ", train_features.shape)
print('unique ids: ', len(train_features.sig_id.unique()))

print("Shape of the training set: ", test_features.shape)
print('unique ids: ', len(test_features.sig_id.unique()))

### for training file:
- There are 23814 rows and 876 colums in the csv file. 
- there are 23814 unique ids.

### for testing file: 
- There are 3982 rows and 876 colums in the csv file. 
- there are 3982 unique ids.

In [None]:
import seaborn as sns 
fig,ax= plt.subplots(1,2)
sns.countplot(train_features['cp_time'], ax = ax[0]).set_title('For Training Set')
sns.countplot(test_features['cp_time'], ax = ax[1]).set_title('For Testing Set')
plt.tight_layout()

In [None]:
import seaborn as sns 
fig,ax= plt.subplots(1,2)
sns.countplot(train_features['cp_type'], ax = ax[0]).set_title('For Training Set')
sns.countplot(test_features['cp_type'], ax = ax[1]).set_title('For Testing Set')
plt.tight_layout()

In [None]:
import seaborn as sns 
fig,ax= plt.subplots(1,2)
sns.countplot(train_features['cp_dose'], ax = ax[0]).set_title('For Training Set')
sns.countplot(test_features['cp_dose'], ax = ax[1]).set_title('For Testing Set')
plt.tight_layout()

## Let us use Pandas profiler for more exploration 

In [None]:
x = train_features.drop(['sig_id'], axis=1)
corr = x.corr()
corr.style.background_gradient(cmap='coolwarm')

# Let us explore the Target of training set 

# Let us count the positive cases 

In [None]:
# drop the first column ('sig_id'), and 
df = train_targets_scored.drop(['sig_id'], axis=1).sum(axis=0).sort_values(ascending=False).reset_index()


# positive cases 

In [None]:
df.columns = ['column', 'nonzero_records']
df

In [None]:
# plot the bar 

fig = px.bar(
    df.head(50), 
    x='nonzero_records', 
    y='column', 
    orientation='h', 
    title='Columns with the positive samples (Only top 50)', 
    height=1000, 
    width=800
)
fig.show()

In [None]:
# drop the first column ('sig_id') and count the 0s in 
df1 = train_targets_scored.drop(['sig_id'], axis=1).sum(axis=0).sort_values(ascending=False).reset_index()
df1.columns = ['column', '% nonzero_records']
df1['% nonzero_records'] = (df1['% nonzero_records']/len(train_targets_scored))*100
# plot the bar 

fig = px.bar(
    df1.head(50), 
    x='% nonzero_records', 
    y='column', 
    orientation='h', 
    title='Columns with the % positive samples (Only top 50) ', 
    height=1000, 
    width=800
)
fig.show()
