<a href="https://colab.research.google.com/github/bejon23/Apr19_Saimon_Heart_Disease_Pred/blob/main/Bejon_genomics_of_drug_sensitivity_in_cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Genomics of Drug Sensitivity in Cancer (GDSC) dataset**

The Genomics of Drug Sensitivity in Cancer (GDSC) dataset is a comprehensive resource designed for therapeutic biomarker discovery in cancer research. It contains drug response data, specifically the half-maximal inhibitory concentration (IC50) values, for a wide range of anti-cancer drugs tested on over a thousand human cancer cell lines. The features in this dataset include genomic profiles such as gene expression levels, mutation statuses, and copy number variations, alongside the corresponding drug identifiers and cancer types. The primary task is to predict drug sensitivity based on these genomic features, making it a regression problem where the target variable is the log-normalized IC50 value.

In [None]:
!pip install openpyxl

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from scipy import stats
from scipy.stats import skew, boxcox
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from category_encoders import TargetEncoder
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

# **1. Data Collection and Consolidation**

In [None]:
# Load the datasets
gdsc_data = pd.read_csv('/kaggle/input/genomics-of-drug-sensitivity-in-cancer-gdsc/GDSC2-dataset.csv')
cell_line_data = pd.read_excel('/kaggle/input/genomics-of-drug-sensitivity-in-cancer-gdsc/Cell_Lines_Details.xlsx', sheet_name='Cell line details')
compound_data = pd.read_csv('/kaggle/input/genomics-of-drug-sensitivity-in-cancer-gdsc/Compounds-annotation.csv')

# Display column names of each dataset to understand their structure
print(gdsc_data.columns)
print(cell_line_data.columns)
print(compound_data.columns)

**Columns of GDSC dataset:**
1. **DATASET:** Identifier for the specific GDSC dataset version.
2. **NLME_RESULT_ID:** Unique identifier for the non-linear mixed effects model result.
3. **NLME_CURVE_ID:** Identifier for the dose-response curve fitted by NLME.
4. **COSMIC_ID:** Unique identifier for the cell line from the COSMIC database.
5. **CELL_LINE_NAME:** Name of the cancer cell line used in the experiment.
6. **SANGER_MODEL_ID:** Identifier used by the Sanger Institute for the cell line model.
7. **TCGA_DESC:** Description of the cancer type according to The Cancer Genome Atlas.
8. **DRUG_ID:** Unique identifier for the drug used in the experiment.
9. **DRUG_NAME:** Name of the drug used in the experiment.
10. **PUTATIVE_TARGET:** The presumed molecular target of the drug.
11. **PATHWAY_NAME:** The biological pathway affected by the drug.
12. **COMPANY_ID:** Identifier for the company that provided the drug.
13. **WEBRELEASE:** Date or version of web release for this data.
14. **MIN_CONC:** Minimum concentration of the drug used in the experiment.
15. **MAX_CONC:** Maximum concentration of the drug used in the experiment.
16. **LN_IC50:** Natural log of the half-maximal inhibitory concentration (IC50).
17. **AUC:** Area Under the Curve, a measure of drug effectiveness.
18. **RMSE:** Root Mean Square Error, indicating the fit quality of the dose-response curve.
19. **Z_SCORE:** Standardized score of the drug response, allowing comparison across different drugs and cell lines.

These columns provide a comprehensive set of information about the drug sensitivity experiments, including identifiers for cell lines and drugs, experimental conditions, and various measures of drug response.

**Columns of Cell Line Details:**
1. **Sample Name:** Unique identifier for the cell line sample.
2. **COSMIC identifier:** Unique ID from the COSMIC database for the cell line.
3. **Whole Exome Sequencing (WES):** Genetic mutation data from whole exome sequencing.
4. **Copy Number Alterations (CNA):** Data on gene copy number changes in the cell line.
5. **Gene Expression:** Information on gene expression levels in the cell line.
6. **Methylation:** Data on DNA methylation patterns in the cell line.
7. **Drug Response:** Information on how the cell line responds to various drugs.
8. **GDSC Tissue descriptor 1:** Primary tissue type classification.
9. **GDSC Tissue descriptor 2:** Secondary tissue type classification.
10. **Cancer Type (matching TCGA label):** Cancer type according to TCGA classification.
11. **Microsatellite instability Status (MSI):** Indicates the cell line's MSI status.
12. **Screen Medium:** The growth medium used for culturing the cell line.
13. **Growth Properties:** Characteristics of how the cell line grows in culture.

**Columns of Compounds Annotation:**
1. **DRUG_ID:** Unique identifier for the drug.
2. **SCREENING_SITE:** Location where the drug screening was performed.
3. **DRUG_NAME:** Name of the drug compound.
4. **SYNONYMS:** Alternative names for the drug.
5. **TARGET:** The molecular target(s) of the drug.
6. **TARGET_PATHWAY:** The biological pathway(s) targeted by the drug.

These columns provide detailed information about the cell lines used in the experiments and the drugs tested, including genomic characteristics, growth conditions, and drug properties.

In [None]:
# For GDSC2-dataset, keep these columns:
gdsc_columns = ['COSMIC_ID', 'CELL_LINE_NAME', 'TCGA_DESC', 'DRUG_ID', 'DRUG_NAME',
                'PUTATIVE_TARGET', 'PATHWAY_NAME', 'LN_IC50', 'AUC', 'Z_SCORE']

# For Cell-line-data, we'll join on 'COSMIC identifier' and keep:
cell_columns = ['COSMIC identifier', 'Sample Name', 'GDSC\nTissue descriptor 1',
                'GDSC\nTissue\ndescriptor 2', 'Cancer Type\n(matching TCGA label)',
                'Microsatellite \ninstability Status (MSI)', 'Screen Medium', 'Growth Properties',
                'Whole Exome Sequencing (WES)','Copy Number Alterations (CNA)', 'Gene Expression', 'Methylation']

# For Compounds-annotation, we'll join on 'DRUG_ID' and keep:
compound_columns = ['DRUG_ID', 'TARGET', 'TARGET_PATHWAY']

# Select relevant columns
gdsc_data = gdsc_data[gdsc_columns]
cell_line_data = cell_line_data[cell_columns]
compound_data = compound_data[compound_columns]

# Rename columns for consistency
cell_line_data = cell_line_data.rename(columns={'COSMIC identifier': 'COSMIC_ID', 'Sample Name': 'CELL_LINE_NAME' ,
                                                'GDSC\nTissue descriptor 1':'GDSC Tissue descriptor 1',
                                                'GDSC\nTissue\ndescriptor 2':'GDSC Tissue descriptor 2',
                                                'Cancer Type\n(matching TCGA label)':'Cancer Type (matching TCGA label)',
                                                'Microsatellite \ninstability Status (MSI)':'Microsatellite instability Status (MSI)',
                                                'Whole Exome Sequencing (WES)': 'WES',
                                                'Copy Number Alterations (CNA)': 'CNA'
                                               })

## 1.1. Merging Tables

In [None]:
# Merge GDSC2-dataset with Cell-line-annotation
merged_data = pd.merge(gdsc_data, cell_line_data, on=['COSMIC_ID', 'CELL_LINE_NAME'], how='left')

# Merge with Compounds-annotation
final_data = pd.merge(merged_data, compound_data, on='DRUG_ID', how='left')


In [None]:
# Check the shape of the final dataset
print(final_data.shape)

## 1.2. Check for Duplication

In [None]:
duplicated_rows = final_data.duplicated()
sum(duplicated_rows)

In [None]:
# Display the first few rows to verify the merge
final_data.head(2).T

## 1.3. Remove Redundant Features

In [None]:
GDSC_DATASET = final_data.drop(columns=['PUTATIVE_TARGET','PATHWAY_NAME','WES'], index=1)

In [None]:
GDSC_DATASET.shape

In [None]:
GDSC_DATASET.to_csv('GDSC_DATASET.csv',index=False)

# 2. Dataset Overview

In [None]:
# Load the datasets
Data = pd.read_csv('/kaggle/input/genomics-of-drug-sensitivity-in-cancer-gdsc/GDSC_DATASET.csv')

# Display column names of each dataset to understand their structure
print(Data.columns)

In [None]:
# Check the shape of the GDSC dataset
print(Data.shape)

In [None]:
# Display the first few rows of Dataset
Data.head().T

In [None]:
#some information about the attributes(datatypes & null values)
Data.info()

In [None]:
# Check statistical information of Numeric Features

numeric_features = Data.select_dtypes(include=[np.number])
Data.describe(include=[np.number]).transpose()

In [None]:
# Check statistical information of Categorical Features

categorical_features = Data.select_dtypes(include=object)
Data.describe(include=object).transpose()

# 3. Data Preprocessing

## 3.1. Duplication

In [None]:
duplicated_rows = Data.duplicated()
sum(duplicated_rows)

## 3.2. Unique Values

In [None]:
# Get the number of unique values for each column
unique_counts = Data.nunique()
print(unique_counts)

In [None]:
# Set a threshold for the maximum number of unique values to display frequencies
threshold = 1000

# Dictionary to hold value frequencies
value_frequencies = {}

# Iterate over columns to compute value frequencies
for col in Data.columns:
    if unique_counts[col] <= threshold:
        value_counts = Data[col].value_counts()
        value_frequencies[col] = value_counts

# Print the value frequencies for columns with fewer unique values
for col, frequencies in value_frequencies.items():
    print(f"Column '{col}':")
    print(f"Number of unique values: {unique_counts[col]}")
    print("Value frequencies:")
    print(frequencies)
    print()

In [None]:
def get_unique_counts_by_drug(df):
    unique_counts = {}

    for drug in df['DRUG_NAME'].unique():
        drug_data = df[df['DRUG_NAME'] == drug]
        unique_counts[drug] = drug_data.nunique()

    return unique_counts

# Get the unique counts for each drug
drug_unique_counts = get_unique_counts_by_drug(Data)

# Print the results
for drug, counts in drug_unique_counts.items():
    print(f"\nUnique counts for {drug}:")
    print(counts)
    print("-" * 50)  # Separator for readability


### 3.2.1. DRUG_ID and DRUG_NAME discrepancy:
The difference in unique counts between DRUG_ID (295) and DRUG_NAME (286) suggests that some drugs might have multiple IDs or there might be some inconsistencies in the data.

In [None]:
drug_mapping = Data[['DRUG_ID', 'DRUG_NAME']].drop_duplicates()
duplicates = drug_mapping[drug_mapping.duplicated('DRUG_NAME', keep=False)]
print(duplicates)

During our data preparation, we noticed that some drugs in the GDSC dataset have more than one DRUG_ID for the same DRUG_NAME. We looked into this and found that even with different IDs, these drugs have the same TARGET and TARGET_PATHWAY information. To keep things simple and clear, we decided to use DRUG_NAME as our main way to identify drugs in our analysis. This approach helps us avoid confusion while still keeping all the important drug information intact. We've kept a record of how the original DRUG_IDs match up with DRUG_NAMEs, but we'll mainly use DRUG_NAME in our analysis to keep everything consistent and easy to understand.

## 3.3. Missing Values

In [None]:
# Check for missing values
print(Data.isnull().sum())

In [None]:
def check_missing_values_by_drug(df):
    missing_values = {}

    for drug in df['DRUG_NAME'].unique():
        drug_data = df[df['DRUG_NAME'] == drug]
        missing_values[drug] = drug_data.isnull().sum()

    return missing_values

# Assuming your DataFrame is named 'Data'
drug_missing_values = check_missing_values_by_drug(Data)

# Print the results
for drug, missing_counts in drug_missing_values.items():
    print(f"\nMissing values for {drug}:")
    print(missing_counts)
    print(f"Total missing values: {missing_counts.sum()}")
    print("-" * 50)  # Separator for readability


## 3.4. **Missing Value Handling in GDSC Dataset: A Drug-by-Drug Approach**

Our method for handling missing values in the GDSC dataset is implemented on a drug-by-drug basis. This approach is crucial because:

1. Different drugs may have unique patterns of missing data.
2. Drug-specific biological mechanisms can influence how missing values should be imputed.
3. It allows for more precise imputation by considering drug-specific relationships between features.

Let's break down each step of our approach:

### 1. Tissue Descriptors and Cancer Type Handling

**Why**: Tissue and cancer type information is fundamental to understanding drug response variability across different biological contexts.

**How**: We impute missing values using related tissue information within each drug subset. This preserves the biological relevance of the imputed values and maintains consistency across related tissue descriptors.

### 2. TARGET and TARGET_PATHWAY Handling

**Why**: These features are crucial for understanding a drug's mechanism of action, which is typically consistent across samples for a given drug.

**How**: If all values are missing for a drug, we label it as 'Unknown for this drug'. Otherwise, we use the known value for that specific drug, ensuring consistency in the drug's molecular target information.

### 3. Other Categorical Variables

**Why**: Features like MSI status, screen medium, and growth properties can significantly influence drug response and are often related to tissue type.

**How**: We impute based on the most common value within the same primary tissue type for each drug. This maintains the biological relationship between these properties and tissue types.

### 4. Genomic Features Handling

**Why**: Genomic features (CNA, Gene Expression, Methylation) are key determinants of drug response and can vary significantly across tissue types.

**How**: We first attempt to impute based on tissue type within each drug subset. If missing values persist, we use KNN imputation, which can capture more complex relationships in the genomic data.

### 5. Numeric Variables Handling

**Why**: Variables like LN_IC50, AUC, and Z_SCORE directly measure drug response and are critical for downstream analyses.

**How**: We use Random Forest imputation when sufficient data is available, leveraging the complex relationships between genomic features, tissue types, and drug response. For drugs with limited data, we fall back to median imputation grouped by tissue type.

This drug-by-drug approach ensures that we:
1. Preserve drug-specific patterns and relationships in the data.
2. Account for the unique biological context of each drug's action.
3. Maintain consistency in drug-related information across samples.
4. Leverage the most appropriate imputation method based on data availability for each drug.

By combining biological knowledge with advanced statistical techniques, this method provides a robust and biologically relevant solution to missing data in the GDSC dataset, setting a strong foundation for subsequent analyses and drug response predictions.

In [None]:
def handle_missing_values_by_drug(df):
    knn_imputer = KNNImputer(n_neighbors=5)
    numeric_imputers = {}
    label_encoder = LabelEncoder()

    for drug in df['DRUG_NAME'].unique():
        drug_data = df[df['DRUG_NAME'] == drug].copy()

        # 1. Tissue Descriptors and Cancer Type Handling
        tissue_cols = ['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2', 'Cancer Type (matching TCGA label)', 'TCGA_DESC']
        for col in tissue_cols:
            if drug_data[col].isnull().any():
                # Impute based on other tissue information
                for other_col in [c for c in tissue_cols if c != col]:
                    drug_data[col] = drug_data.groupby(other_col)[col].transform(
                        lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown')
                    )
                # If still null, use overall mode
                drug_data[col] = drug_data[col].fillna(drug_data[col].mode()[0] if not drug_data[col].mode().empty else 'Unknown')

        # 2. TARGET Handling
        if drug_data['TARGET'].isnull().all():
            drug_data['TARGET'] = 'Unknown for this drug'
        else:
            known_target = drug_data['TARGET'].dropna().iloc[0]
            drug_data['TARGET'] = drug_data['TARGET'].fillna(known_target)

        # 2. TARGET_PATHWAY Handling
        if drug_data['TARGET_PATHWAY'].isnull().all():
            drug_data['TARGET_PATHWAY'] = 'Unknown for this drug'
        else:
            known_pathway = drug_data['TARGET_PATHWAY'].dropna().iloc[0]
            drug_data['TARGET_PATHWAY'] = drug_data['TARGET_PATHWAY'].fillna(known_pathway)

        # 3. Other Categorical Variables
        other_categorical_cols = ['Microsatellite instability Status (MSI)', 'Screen Medium', 'Growth Properties']
        for col in other_categorical_cols:
            drug_data[col] = drug_data.groupby('GDSC Tissue descriptor 1')[col].transform(
                lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown')
            )

        # 4. Genomic Features Handling
        genomic_features = ['CNA', 'Gene Expression', 'Methylation']
        for feature in genomic_features:
            if drug_data[feature].isnull().any():
                # First, try to impute based on tissue type
                drug_data[feature] = drug_data.groupby(['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2'])[feature].transform(
                    lambda x: x.fillna(x.mode()[0] if not x.mode().empty else np.nan)
                )
                # If still null, use KNN imputation
                if drug_data[feature].isnull().any():
                    feature_data = pd.get_dummies(drug_data[feature], prefix=feature)
                    imputed_data = knn_imputer.fit_transform(feature_data)
                    imputed_df = pd.DataFrame(imputed_data, columns=feature_data.columns, index=feature_data.index)
                    drug_data[feature] = imputed_df.idxmax(axis=1).str.split('_').str[1]

        # 5. Numeric Variables Handling
        numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']

        # Prepare features for imputation
        features_for_imputation = pd.get_dummies(drug_data[genomic_features + ['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2']])

        for col in numeric_cols:
            if drug_data[col].isnull().any():
                if col not in numeric_imputers:
                    numeric_imputers[col] = RandomForestRegressor(n_estimators=100, random_state=42)

                available_data = drug_data.dropna(subset=[col])
                if len(available_data) > 10:
                    X_train = features_for_imputation.loc[available_data.index]
                    y_train = available_data[col]
                    numeric_imputers[col].fit(X_train, y_train)

                    missing_data = drug_data[drug_data[col].isnull()]
                    X_missing = features_for_imputation.loc[missing_data.index]
                    drug_data.loc[drug_data[col].isnull(), col] = numeric_imputers[col].predict(X_missing)
                else:
                    # If not enough data, use median grouped by tissue type
                    drug_data[col] = drug_data.groupby(['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2'])[col].transform(
                        lambda x: x.fillna(x.median())
                    )

        df.loc[df['DRUG_NAME'] == drug] = drug_data

    return df

Data = handle_missing_values_by_drug(Data)


In [None]:
# Check for missing values
print(Data.isnull().sum())

## 3.5. Encoding Categorical Features:

Given the high cardinality of some features, we'll use a combination of encoding techniques:

* Simple binary encoding for features with 2 unique values
* One-hot encoding for low-cardinality features (3 unique values)
* Target encoding for high-cardinality features
* Label encoding for ordinal features (IDs)

In [None]:
def encode_features(df, target_column='LN_IC50'):

    # Identify features with only two unique values
    binary_features = [col for col in df.columns if df[col].nunique() == 2]

    # Binary encoding for features with two unique values
    for feature in binary_features:
        df[feature] = (df[feature] == df[feature].unique()[0]).astype(int)

    # One-hot encoding for low-cardinality features (3 unique values)
    onehot_features = ['Growth Properties']
    onehot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
    onehot_encoded = onehot_encoder.fit_transform(df[onehot_features])

    # Use get_feature_names_out to get the names of the one-hot encoded features
    onehot_columns = onehot_encoder.get_feature_names_out(onehot_features)
    df_onehot = pd.DataFrame(onehot_encoded, columns=onehot_columns, index=df.index)

    # Target encoding for high-cardinality features
    target_features = ['TCGA_DESC', 'DRUG_NAME', 'GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2',
                       'Cancer Type (matching TCGA label)', 'TARGET', 'TARGET_PATHWAY']
    target_encoder = TargetEncoder()
    df_target_encoded = target_encoder.fit_transform(df[target_features], df[target_column])

    # Label encoding for DRUG_ID and COSMIC_ID
    label_features = ['DRUG_ID', 'COSMIC_ID', 'CELL_LINE_NAME']
    label_encoder = LabelEncoder()
    df_label_encoded = df[label_features].apply(label_encoder.fit_transform)

    # Combine all encoded features
    df_encoded = pd.concat([df[binary_features], df_onehot, df_target_encoded, df_label_encoded], axis=1)

    return df_encoded


encoded_data = encode_features(Data)

In [None]:
print(encoded_data.columns)

In [None]:
print(Data.columns)

In [None]:
encoded_data['LN_IC50']=Data['LN_IC50']
encoded_data['AUC']=Data['AUC']
encoded_data['Z_SCORE']=Data['Z_SCORE']


# 4. Visualization Gallery

## 4.1. Distribution of Numeric Features

This interactive plot shows the distribution of key numeric features in our dataset: LN_IC50 (drug sensitivity), AUC (area under the curve), and Z_SCORE. The histograms provide an overview of the data distribution, while the scatter points below each histogram show the actual data points, allowing for a detailed examination of the data spread and potential outliers.

In [None]:
def plot_numeric_features(df, numeric_cols):
    fig = make_subplots(rows=1, cols=3, subplot_titles=numeric_cols)

    for i, col in enumerate(numeric_cols, 1):
        fig.add_trace(
            go.Histogram(x=df[col], name=col, marker_color='#4169E1', opacity=0.7),
            row=1, col=i
        )
        fig.add_trace(
            go.Scatter(x=df[col], y=[0]*len(df), mode='markers',
                       marker=dict(color='#4169E1', symbol='line-ns-open'), name='Data points'),
            row=1, col=i
        )

    fig.update_layout(
        title_text="Distribution of Key Numeric Features",
        height=500, width=1200,
        showlegend=False
    )

    fig.show()

numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']
plot_numeric_features(Data, numeric_cols)

![Screenshot 2024-08-25 at 11.53.13 PM.png](attachment:512235f7-fe3c-46fc-a41c-949c9787f56f.png)


**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.2. GDSC Tissue Distribution

This horizontal bar chart displays the distribution of the top 20 tissue types in our GDSC dataset. It provides a clear visualization of the most common cancer tissues studied, which is crucial for understanding the focus areas of the drug sensitivity experiments.

In [None]:
def plot_gdsc_tissue_distribution(data):
    tissue_counts = data['GDSC Tissue descriptor 1'].value_counts().nlargest(20)

    fig = go.Figure(go.Bar(
        x=tissue_counts.values,
        y=tissue_counts.index,
        orientation='h',
        marker_color='#4169E1'
    ))

    fig.update_layout(
        title='Top 20 GDSC Tissue Types',
        xaxis_title='Count',
        yaxis_title='Tissue Type',
        height=600,
        width=1000
    )

    fig.show()

plot_gdsc_tissue_distribution(Data)

![Screenshot 2024-08-25 at 11.53.44 PM.png](attachment:07ed8833-0aa5-41f4-81f7-bc17cb8bd791.png)


**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.3. Pairplot of Numeric Features

This pairplot matrix shows the relationships between our key numeric features: LN_IC50, AUC, and Z_SCORE. The diagonal plots show the distribution of each feature, while the off-diagonal plots show the relationships between pairs of features. This visualization helps in identifying correlations and patterns among these important drug sensitivity metrics.


In [None]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.offline as pyo

pyo.init_notebook_mode(connected=True)

def plot_numeric_pairplot(data, numeric_cols, sample_size=1000):
    # Sample the data to reduce plot size
    if len(data) > sample_size:
        data = data.sample(sample_size, random_state=42)

    fig = make_subplots(rows=3, cols=3, subplot_titles=[f"{x} vs {y}" for x in numeric_cols for y in numeric_cols])

    for i, x in enumerate(numeric_cols, 1):
        for j, y in enumerate(numeric_cols, 1):
            if x == y:
                trace = go.Histogram(x=data[x], name=x, marker_color='#4169E1', opacity=0.7)
            else:
                trace = go.Scatter(x=data[x], y=data[y], mode='markers',
                                   marker=dict(color='#4169E1', size=3, opacity=0.5),
                                   name=f"{x} vs {y}")
            fig.add_trace(trace, row=i, col=j)

    fig.update_layout(height=900, width=900, title_text="Pairplot of Numeric Features")

    # Update axes labels
    for i, col in enumerate(numeric_cols):
        fig.update_xaxes(title_text=col, row=3, col=i+1)
        fig.update_yaxes(title_text=col, row=i+1, col=1)

    # Use iplot for inline plotting
    pyo.iplot(fig)


plot_numeric_pairplot(Data, ['LN_IC50', 'AUC', 'Z_SCORE'])

![Screenshot 2024-08-25 at 11.54.46 PM.png](attachment:a63ced12-02db-41a0-a62d-38e4b48252dc.png)


**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.4. Drug Sensitivity Across Cancer Types

This boxplot illustrates the variation in drug sensitivity (LN_IC50) across different cancer types. The cancer types are ordered by median LN_IC50 values, allowing for easy comparison of drug responsiveness among different cancers. This visualization is crucial for identifying cancer types that may be more or less responsive to the drugs in our dataset.


In [None]:
def plot_drug_sensitivity_by_cancer(data):
    cancer_types = data.groupby('Cancer Type (matching TCGA label)')['LN_IC50'].median().sort_values(ascending=False)

    fig = go.Figure()

    fig.add_trace(go.Box(
        y=data['LN_IC50'],
        x=data['Cancer Type (matching TCGA label)'],
        name='LN_IC50',
        marker_color='#4169E1'
    ))

    fig.update_layout(
        title='Distribution of Drug Sensitivity Across Cancer Types',
        xaxis_title='Cancer Type',
        yaxis_title='LN_IC50',
        height=600,
        width=1200,
        xaxis={'categoryorder':'array', 'categoryarray':cancer_types.index}
    )

    fig.show()

plot_drug_sensitivity_by_cancer(Data)

![Screenshot 2024-08-25 at 11.55.31 PM.png](attachment:d5264af7-8407-40be-a4f6-e80524fbc836.png)



**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.5. Drug Efficacy Across Tissue Types

This boxplot shows how drug efficacy (AUC) varies across different tissue types. The tissue types are ordered by median AUC values, providing insights into tissue-specific drug responses. This visualization is essential for understanding which tissue types tend to be more responsive to the drugs in our dataset.

In [None]:
def plot_drug_efficacy_by_tissue(data):
    tissue_types = data.groupby('GDSC Tissue descriptor 1')['AUC'].median().sort_values(ascending=False)

    fig = go.Figure()

    fig.add_trace(go.Box(
        y=data['AUC'],
        x=data['GDSC Tissue descriptor 1'],
        name='AUC',
        marker_color='#4169E1'
    ))

    fig.update_layout(
        title='Drug Efficacy Across Different Tissue Types',
        xaxis_title='Tissue Type',
        yaxis_title='AUC',
        height=600,
        width=1200,
        xaxis={'categoryorder':'array', 'categoryarray':tissue_types.index}
    )

    fig.show()

plot_drug_efficacy_by_tissue(Data)

![Screenshot 2024-08-25 at 11.56.00 PM.png](attachment:e629493f-5ee8-43b7-8945-88883a8bec01.png)



**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.6. Top Drug Targets

This horizontal bar chart highlights the top 10 most common drug targets in our dataset. Understanding the frequency of different drug targets provides insights into the focus areas of drug development and the molecular pathways being targeted in cancer treatment.


In [None]:
def plot_top_drug_targets(data):
    top_targets = data['TARGET'].value_counts().nlargest(10)

    fig = go.Figure(go.Bar(
        x=top_targets.values,
        y=top_targets.index,
        orientation='h',
        marker_color='#4169E1'
    ))

    fig.update_layout(
        title='Top 10 Drug Targets',
        xaxis_title='Count',
        yaxis_title='Target',
        height=500,
        width=900
    )

    fig.show()

plot_top_drug_targets(Data)

![Screenshot 2024-08-25 at 11.56.23 PM.png](attachment:75f88070-f962-4be2-89a6-b0e98350c108.png)


**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.7. Distribution of Target Pathways

This horizontal bar chart shows the distribution of different target pathways in our dataset. It provides a clear view of which cellular pathways are most frequently targeted by the drugs in our study, offering insights into the mechanisms of action being explored in cancer treatment.


In [None]:
def plot_target_pathways(data):
    pathway_counts = data['TARGET_PATHWAY'].value_counts()

    fig = go.Figure(go.Bar(
        x=pathway_counts.values,
        y=pathway_counts.index,
        orientation='h',
        marker_color='#4169E1'
    ))

    fig.update_layout(
        title='Distribution of Target Pathways',
        xaxis_title='Count',
        yaxis_title='Target Pathway',
        height=600,
        width=1000
    )

    fig.show()

plot_target_pathways(Data)

![Screenshot 2024-08-25 at 11.56.42 PM.png](attachment:3e5fdfe0-259d-48b9-b62c-52894fbb955a.png)


**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.8. Impact of Microsatellite Instability on Drug Response

This boxplot illustrates how microsatellite instability (MSI) status affects drug response (LN_IC50). The plot includes individual data points, allowing for a detailed view of the data distribution. This visualization is crucial for understanding the relationship between MSI status and drug sensitivity, which has important implications for personalized cancer treatment strategies.

In [None]:
import plotly.graph_objs as go
import plotly.offline as pyo
import pandas as pd

pyo.init_notebook_mode(connected=True)

def plot_msi_impact(data, sample_size=10000):
    # Sample the data if it's too large
    if len(data) > sample_size:
        data = data.sample(sample_size, random_state=42)

    fig = go.Figure()

    # Create a mapping for MSI status
    msi_mapping = {0: 'MSS/MSI-L', 1: 'MSI-H'}

    for msi_status in data['Microsatellite instability Status (MSI)'].unique():
        msi_label = msi_mapping.get(msi_status, str(msi_status))
        msi_data = data[data['Microsatellite instability Status (MSI)'] == msi_status]['LN_IC50']

        # Calculate summary statistics
        q1, median, q3 = msi_data.quantile([0.25, 0.5, 0.75])
        iqr = q3 - q1
        whisker_low, whisker_high = q1 - 1.5 * iqr, q3 + 1.5 * iqr

        fig.add_trace(go.Box(
            y=msi_data,
            name=msi_label,
            boxpoints='outliers',  # Only show outliers
            jitter=0.3,
            pointpos=-1.8,
            lowerfence=[whisker_low],  # Wrap in list
            upperfence=[whisker_high],  # Wrap in list
            q1=[q1],
            median=[median],
            q3=[q3]
        ))

    fig.update_layout(
        title='Impact of Microsatellite Instability Status on Drug Response',
        xaxis_title='MSI Status',
        yaxis_title='LN_IC50',
        height=500,
        width=800
    )

    # Use iplot for inline plotting
    pyo.iplot(fig)

plot_msi_impact(encoded_data)

![Screenshot 2024-08-25 at 11.57.24 PM.png](attachment:ad7f21cd-ccb2-4398-943a-ded2ec7d5297.png)




**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

In [None]:
def plot_msi_impact(data, sample_size=50000):
    # Sample the data if it's too large
    if len(data) > sample_size:
        data = data.sample(sample_size, random_state=42)

    # Create a mapping for MSI status
    msi_mapping = {0: 'MSS/MSI-L', 1: 'MSI-H'}

    # Apply the mapping to create a new column
    data['MSI_Status'] = data['Microsatellite instability Status (MSI)'].map(msi_mapping)

    # Set up the plot
    plt.figure(figsize=(10, 6))

    # Create the boxplot
    sns.boxplot(x='MSI_Status', y='LN_IC50', data=data, palette='Set3')

    # Add strip plot for individual points
    sns.stripplot(x='MSI_Status', y='LN_IC50', data=data, color='black', alpha=0.1, size=2)

    # Customize the plot
    plt.title('Impact of Microsatellite Instability Status on Drug Response', fontsize=16)
    plt.xlabel('MSI Status', fontsize=12)
    plt.ylabel('LN_IC50', fontsize=12)

    # Show the plot
    plt.tight_layout()
    plt.show()

# Run the function
plot_msi_impact(encoded_data)

## 4.9. Treemap of GDSC Tissue Descriptors

This interactive treemap visualizes the hierarchical relationship between GDSC Tissue descriptor 1 (main tissues) and GDSC Tissue descriptor 2 (sub-tissues). The size of each box represents the count of samples in that category, while the color intensity indicates the relative proportion. Hover over each box to see detailed information including the tissue name, count, and percentage within its parent category. This visualization provides a comprehensive overview of the tissue distribution in our dataset, highlighting both the main tissue types and their subtypes.


In [None]:
# Prepare data for treemap
df_grouped = Data.groupby(['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2']).size().reset_index(name='count')

# Inspect the grouped DataFrame
print("Grouped DataFrame:")
print(df_grouped)

# Check if the grouped DataFrame is empty
if df_grouped.empty:
    print("Error: Grouping resulted in an empty DataFrame.")
else:
    print("Grouped DataFrame is not empty. Rows:", len(df_grouped))

In [None]:
import plotly.express as px

def plot_tissue_treemap(data):
    df_grouped = data.groupby(['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2']).size().reset_index(name='count')

    fig = px.treemap(df_grouped,
                     path=['GDSC Tissue descriptor 1', 'GDSC Tissue descriptor 2'],
                     values='count',
                     color='count',
                     color_continuous_scale='Blues',
                     title='Hierarchical View of GDSC Tissue Descriptors')

    fig.update_layout(width=1000, height=800)
    fig.show()

plot_tissue_treemap(Data)

![Screenshot 2024-08-25 at 11.57.46 PM.png](attachment:288183fd-49b9-4af9-831b-6a41c3619eaa.png)



**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.10. Sunburst Chart for Cancer Types and Growth Properties

This sunburst chart illustrates the relationship between cancer types and their growth properties. Each ring represents a level in the hierarchy: the inner ring shows cancer types, while the outer ring displays the growth properties for each cancer type. The size and color of each segment represent the count of samples. This visualization helps in understanding the distribution of growth properties across different cancer types, which can be crucial for understanding cancer behavior and drug responses.


In [None]:
def plot_cancer_growth_sunburst(data):
    # Prepare data for sunburst chart
    df_grouped = data.groupby(['Cancer Type (matching TCGA label)', 'Growth Properties']).size().reset_index(name='count')

    # Create sunburst chart using plotly express
    fig = px.sunburst(
        df_grouped,
        path=['Cancer Type (matching TCGA label)', 'Growth Properties'],
        values='count',
        color='count',
        color_continuous_scale='Viridis',
        title='Cancer Types and Their Growth Properties'
    )

    fig.update_layout(width=1000, height=1000)
    fig.show()

plot_cancer_growth_sunburst(Data)

![Screenshot 2024-08-25 at 11.58.15 PM.png](attachment:2b05c85c-acb0-4296-b26f-d6aee11011d0.png)




**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

## 4.11. Sankey Diagram for Drug-Target-Pathway Relationship

This Sankey diagram visualizes the relationships between drugs, their targets, and the associated pathways. The width of the flows represents the frequency of each relationship in our dataset. This complex visualization helps in understanding how drugs are connected to their molecular targets and the broader cellular pathways they affect. It's particularly useful for identifying drugs that share targets or pathways, which could suggest similar mechanisms of action or potential for drug repurposing.

In [None]:
def plot_drug_target_pathway_sankey(data):
    # Prepare data for Sankey diagram
    df_grouped = data.groupby(['DRUG_NAME', 'TARGET', 'TARGET_PATHWAY']).size().reset_index(name='count')
    df_grouped = df_grouped.sort_values('count', ascending=False).head(50)  # Top 50 combinations

    # Create node lists
    drugs = df_grouped['DRUG_NAME'].unique().tolist()
    targets = df_grouped['TARGET'].unique().tolist()
    pathways = df_grouped['TARGET_PATHWAY'].unique().tolist()

    # Create node labels and colors
    node_labels = drugs + targets + pathways
    node_colors = ['#1f77b4'] * len(drugs) + ['#ff7f0e'] * len(targets) + ['#2ca02c'] * len(pathways)

    # Create links
    source = [drugs.index(drug) for drug in df_grouped['DRUG_NAME']] + \
             [len(drugs) + targets.index(target) for target in df_grouped['TARGET']]
    target = [len(drugs) + targets.index(target) for target in df_grouped['TARGET']] + \
             [len(drugs) + len(targets) + pathways.index(pathway) for pathway in df_grouped['TARGET_PATHWAY']]
    value = df_grouped['count'].tolist() + df_grouped['count'].tolist()

    # Create Sankey diagram
    fig = go.Figure(data=[go.Sankey(
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(color = "black", width = 0.5),
          label = node_labels,
          color = node_colors
        ),
        link = dict(
          source = source,
          target = target,
          value = value
    ))])

    fig.update_layout(
        title_text="Drug-Target-Pathway Relationships",
        font_size=10,
        width=1200,
        height=800
    )

    fig.show()

plot_drug_target_pathway_sankey(Data)

![Screenshot 2024-08-25 at 11.58.42 PM.png](attachment:3d9670ab-8e40-4bc2-9d2f-e87e8bc501d7.png)



**Note on Visualization:** The image above is a static screenshot of the original interactive Plotly plot.
This version is used to improve loading times.

If you want to see interactive plots, please check [version 3 of this notebook](https://www.kaggle.com/code/samiraalipour/genomics-of-drug-sensitivity-in-cancer?scriptVersionId=193125569).
The interactive version provides more detailed exploration but may take longer to load.

# 5. Advanced Data Analysis

## 5.1. Correlation Analysis

In [None]:
# Correlation heatmap of encoded features
plt.figure(figsize=(20, 16))
correlation_matrix = encoded_data.corr()
sns.heatmap(correlation_matrix, cmap='Blues', annot=True)
plt.title('Correlation Heatmap of Encoded Features')
plt.tight_layout()
plt.show()


In [None]:
# Check the correlations between all of the features
corr_matrix = encoded_data.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(bool))
# Find index of feature columns with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] >= 0.9)]
to_drop

In [None]:
# Find other pair of correlated features
pd.set_option('display.width', 1000)
for i in range(len(to_drop)):
  print(corr_matrix.loc[corr_matrix[to_drop[i]].abs() > 0.9, corr_matrix[to_drop[i]].abs() > 0.9].to_markdown() , '\n\n')


## 5.2. Handling Outliers and Skewness in GDSC Dataset

Our approach to handling outliers and skewness in the GDSC dataset is tailored to the specific characteristics of drug response data. We use a three-step process: first, we find outliers for each drug, then we check for outliers and skewness, and finally, we handle them appropriately.

### 1. Finding Outliers for Each Drug

**Function:** `find_outliers_by_drug()`

This function performs the following tasks for each drug and numeric variable:

- Identifies outliers using the Interquartile Range (IQR) method.
- Calculates the percentage and count of outliers for each drug and variable.
- Prints a summary of outliers for each drug and variable.

**Why this approach?**
- **Drug-specific analysis:** Different drugs may have different distributions of response variables.
- **IQR method:** Robust to extreme outliers and doesn't assume a normal distribution.
- **Comprehensive overview:** Provides a summary of outliers across all drugs and variables.

### 2. Checking Outliers and Skewness

**Function:** `check_outliers_and_skewness()`

This function performs the following tasks for the first two drugs and each numeric variable:

- Calculates skewness.
- Identifies outliers using the Interquartile Range (IQR) method.
- Visualizes the distribution and box plot of each variable.

**Why this approach?**
- **Focused analysis:** Examines the first two drugs for a detailed view without overwhelming output.
- **Dual visualization:** Provides both histogram and box plot for a comprehensive view of the data distribution.
- **Skewness calculation:** Quantifies the asymmetry of the distribution.

### 3. Handling Outliers and Skewness

**Function:** `handle_outliers_and_skewness()`

This function applies the following treatments:

1. **Outlier Handling:**
   - Uses IQR capping to limit extreme values.
   - **Why?** Preserves the data while reducing the impact of extreme outliers.

2. **Skewness Handling:**
   - Applies no transformation if absolute skewness ≤ 1.
   - Attempts Yeo-Johnson transformation if absolute skewness > 1.
   - Falls back to log transformation if Yeo-Johnson fails.
   - **Why?** Adapts to different levels of skewness and handles both positive and negative values.

3. **Visualization:**
   - Shows before and after distributions for one example of each transformation type.
   - **Why?** Allows for easy assessment of the transformation's effectiveness.

**Why this approach?**
- **Drug-specific treatment:** Ensures that the unique characteristics of each drug's data are preserved.
- **Flexible transformations:** Adapts to different types of skewness in the data.
- **Preservation of data:** Uses capping instead of removal for outliers, maintaining sample size.
- **Visual confirmation:** Provides immediate feedback on the effectiveness of the transformations.

This method allows us to address outliers and skewness issues while maintaining the integrity of the drug-specific patterns in the GDSC dataset. By handling these issues, we improve the reliability of subsequent analyses and ensure that our data meets the assumptions of many statistical methods.

**Note:** Due to the high number of drugs in the dataset, we only visualize the distributions and transformations for select examples of each transformation type. This approach allows for a clearer and more concise examination of outliers and skewness while still processing all data.

In [None]:
def find_outliers_by_drug(df, numeric_cols=['LN_IC50', 'AUC', 'Z_SCORE']):
    outliers = {}

    for drug in df['DRUG_NAME'].unique():
        drug_data = df[df['DRUG_NAME'] == drug]
        drug_outliers = {}

        for col in numeric_cols:
            v = drug_data[col]
            q1 = v.quantile(0.25)
            q3 = v.quantile(0.75)
            iqr = q3 - q1
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr
            outliers_count = ((v < lower_bound) | (v > upper_bound)).sum()
            perc = outliers_count * 100.0 / len(drug_data)
            drug_outliers[col] = (perc, outliers_count)
            print(f"Drug: {drug}, Column: {col} outliers = {perc:.2f}% ({outliers_count} out of {len(drug_data)})")

        outliers[drug] = drug_outliers

    return outliers

# Find outliers in the DataFrame for each drug
outliers = find_outliers_by_drug(Data)

In [None]:
def check_outliers_and_skewness(df, numeric_cols):
    skewness_info = {}
    outlier_info = {}

    for drug in df['DRUG_NAME'].unique()[:2]:  # Only first two drugs
        drug_data = df[df['DRUG_NAME'] == drug]
        skewness_info[drug] = {}
        outlier_info[drug] = {}

        for col in numeric_cols:
            # Calculate skewness
            col_skewness = skew(drug_data[col].dropna())
            skewness_info[drug][col] = col_skewness

            # Identify outliers using IQR method
            Q1 = drug_data[col].quantile(0.25)
            Q3 = drug_data[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = drug_data[(drug_data[col] < lower_bound) | (drug_data[col] > upper_bound)][col]
            outlier_info[drug][col] = outliers

            # Plot distribution and box plot
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
            fig.suptitle(f'Distribution and Box Plot of {col} for {drug}')

            # Histogram
            sns.histplot(drug_data[col], kde=True, ax=ax1)
            ax1.axvline(lower_bound, color='r', linestyle='--', label='IQR bounds')
            ax1.axvline(upper_bound, color='r', linestyle='--')
            ax1.set_title(f'Histogram (Skewness: {col_skewness:.2f})')
            ax1.legend()

            # Box plot
            sns.boxplot(x=drug_data[col], ax=ax2)
            ax2.set_title('Box Plot')

            plt.tight_layout()
            plt.show()

            print(f"Drug: {drug}, Column: {col}")
            print(f"Skewness: {col_skewness:.2f}")
            print(f"Number of outliers: {len(outliers)}")
            print("-" * 50)

    return skewness_info, outlier_info


numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']
skewness_info, outlier_info = check_outliers_and_skewness(Data, numeric_cols)

In [None]:
def check_outliers_and_skewness(df, numeric_cols):
    skewness_info = {}
    outlier_info = {}

    for drug in df['DRUG_NAME'].unique()[:2]:  # Only first two drugs
        drug_data = df[df['DRUG_NAME'] == drug]
        skewness_info[drug] = {}
        outlier_info[drug] = {}

        for col in numeric_cols:
            # Calculate skewness
            col_skewness = skew(drug_data[col].dropna())
            skewness_info[drug][col] = col_skewness

            # Identify outliers using IQR method
            Q1 = drug_data[col].quantile(0.25)
            Q3 = drug_data[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = drug_data[(drug_data[col] < lower_bound) | (drug_data[col] > upper_bound)][col]
            outlier_info[drug][col] = outliers

            # Create subplot
            fig = make_subplots(rows=1, cols=2, subplot_titles=['Histogram', 'Box Plot'])

            # Histogram
            hist_data = go.Histogram(x=drug_data[col], name='Distribution', opacity=0.7)
            fig.add_trace(hist_data, row=1, col=1)

            # Add KDE to histogram
            kde_x = np.linspace(drug_data[col].min(), drug_data[col].max(), 100)
            kde_y = drug_data[col].plot.kde(bw_method=0.5).get_lines()[0].get_ydata()
            kde_line = go.Scatter(x=kde_x, y=kde_y, mode='lines', name='KDE', line=dict(color='red'))
            fig.add_trace(kde_line, row=1, col=1)

            # Add IQR bounds to histogram
            fig.add_vline(x=lower_bound, line_dash="dash", line_color="green", row=1, col=1)
            fig.add_vline(x=upper_bound, line_dash="dash", line_color="green", row=1, col=1)

            # Box plot
            box_data = go.Box(y=drug_data[col], name='Box Plot', boxpoints='outliers')
            fig.add_trace(box_data, row=1, col=2)

            # Update layout
            fig.update_layout(
                title_text=f'Distribution and Box Plot of {col} for {drug}',
                height=500, width=1000,
                annotations=[
                    dict(
                        x=0.25, y=1.05,
                        xref='paper', yref='paper',
                        text=f'Skewness: {col_skewness:.2f}',
                        showarrow=False
                    ),
                    dict(
                        x=0.75, y=1.05,
                        xref='paper', yref='paper',
                        text=f'Outliers: {len(outliers)}',
                        showarrow=False
                    )
                ]
            )

            # Show plot
            fig.show()

            print(f"Drug: {drug}, Column: {col}")
            print(f"Skewness: {col_skewness:.2f}")
            print(f"Number of outliers: {len(outliers)}")
            print("-" * 50)

    return skewness_info, outlier_info


numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']
skewness_info, outlier_info = check_outliers_and_skewness(Data, numeric_cols)

In [None]:
def handle_outliers_and_skewness(df, numeric_cols):
    transformation_examples = {'None': None, 'Yeo-Johnson': None, 'Log': None}

    for drug in df['DRUG_NAME'].unique():
        drug_data = df[df['DRUG_NAME'] == drug].copy()

        for col in numeric_cols:
            # Calculate initial skewness
            initial_skewness = skew(drug_data[col].dropna())

            # Handle Outliers: Using IQR capping
            Q1 = drug_data[col].quantile(0.25)
            Q3 = drug_data[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            # Cap the outliers
            drug_data[col] = np.clip(drug_data[col], lower_bound, upper_bound)

            # Determine and apply transformation
            if abs(initial_skewness) <= 1:
                transformation = "None"
                if transformation_examples['None'] is None:
                    transformation_examples['None'] = (drug, col)
            elif abs(initial_skewness) > 1:
                try:
                    # Use Yeo-Johnson transformation (works for negative values)
                    drug_data[col], _ = yeojohnson(drug_data[col])
                    transformation = "Yeo-Johnson"
                    if transformation_examples['Yeo-Johnson'] is None:
                        transformation_examples['Yeo-Johnson'] = (drug, col)
                except:
                    # If Yeo-Johnson fails, use log transformation
                    drug_data[col] = np.log1p(drug_data[col] - drug_data[col].min() + 1)
                    transformation = "Log"
                    if transformation_examples['Log'] is None:
                        transformation_examples['Log'] = (drug, col)

            # If this is one of our example transformations, visualize and print info
            if (drug, col) in transformation_examples.values():
                # Calculate final skewness
                final_skewness = skew(drug_data[col].dropna())

                # Visualization of before and after
                fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

                sns.histplot(df.loc[df['DRUG_NAME'] == drug, col], kde=True, ax=ax1)
                ax1.set_title(f'Original Distribution of {col} for {drug}\nSkewness: {initial_skewness:.2f}')

                sns.histplot(drug_data[col], kde=True, ax=ax2)
                ax2.set_title(f'Transformed Distribution of {col} for {drug}\nTransformation: {transformation}\nSkewness: {final_skewness:.2f}')

                plt.tight_layout()
                plt.show()

                print(f"Drug: {drug}, Column: {col}")
                print(f"Initial Skewness: {initial_skewness:.2f}")
                print(f"Final Skewness: {final_skewness:.2f}")
                print(f"Transformation: {transformation}")
                print("-" * 50)

        # Replace the original data with the transformed data
        df.loc[df['DRUG_NAME'] == drug, numeric_cols] = drug_data[numeric_cols]

    return df, transformation_examples

numeric_cols = ['LN_IC50', 'AUC', 'Z_SCORE']
gdsc_data, transformation_examples = handle_outliers_and_skewness(Data, numeric_cols)

# Print summary of transformation examples
print("\nTransformation Examples:")
for transform, example in transformation_examples.items():
    if example:
        print(f"{transform}: Drug - {example[0]}, Column - {example[1]}")
    else:
        print(f"{transform}: No example found")