<a href="https://colab.research.google.com/github/VicDc/VIC_/blob/main/drug_repositioning_randomforest_v4_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Description**
Unlocking New Cures: A Journey into AI-Powered Drug Repositioning
In this notebook, we embark on a fascinating exploration into the field of drug repositioning. We will harness the power of Machine Learning to analyze complex data and predict new, unexpected relationships between existing drugs and diseases. The goal? To accelerate the discovery of new therapies, reducing both time and cost.

Tools of the Trade: We'll be using Python with the data scientist's essential toolkit: `pandas` and `numpy` for data manipulation, and `scikit-learn` to build and validate our predictive model.

In [None]:
# --- Step 1: Importing Libraries ---
# Description: This cell prepares the working environment by loading all the necessary tools.

print("--- Step 1: Importing Libraries ---")
# Library for handling tabular data (DataFrame)
import pandas as pd
# Library for efficient numerical operations
import numpy as np
# Modules from Scikit-learn for Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer      # Converts text into numerical vectors
from sklearn.model_selection import train_test_split           # Splits the dataset into training and test sets
from sklearn.ensemble import RandomForestClassifier            # The chosen classification algorithm
from sklearn.metrics import classification_report, accuracy_score # Tools to evaluate the model's performance
# Colab/Jupyter function for better DataFrame visualization
from IPython.display import display

print("Libraries imported successfully.\n")

--- Step 1: Importing Libraries ---
Libraries imported successfully.



# **Data Loading**

In this phase, we load our foundational datasets: a catalog of drugs, an encyclopedia of diseases, and a map of their known connections. These CSV files represent the bedrock upon which our artificial intelligence model will learn to "reason."connections, which you have previously uploaded to the Colab session.

In [None]:
# --- Step 2: Loading Base Data ---
# Description: We load the three CSV files that you uploaded into the Colab environment.

print("--- Step 2: Loading Base Data ---")
try:
    # Colab will look for the files in the main session directory.
    drugs_df = pd.read_csv('drugsInfo.csv')
    diseases_df = pd.read_csv('diseasesInfo.csv')
    mapping_df = pd.read_csv('mapping.csv')
    print("CSV files loaded successfully.\n")
except FileNotFoundError as e:
    # If a file is not found, print an error message and stop the script.
    print(f"\nERROR: File not found. Ensure you have uploaded the three CSV files to the Colab session.\n")
    exit()

--- Step 2: Loading Base Data ---
CSV files loaded successfully.



# **Dataset Creation**

Teaching the Machine to Recognize Patterns
An AI model learns from examples. Here, we build its digital textbook. We create:

**Positive Samples (label=1):** Drug-disease pairs whose effectiveness is already scientifically proven.

**Negative Samples (label=0):** Randomly generated and unknown drug-disease pairs that help the model learn what doesn't work.

By merging and shuffling these examples, we create a balanced and robust dataset, ready for training.

In [None]:
# --- Step 3: Creating the Supervised Dataset ---
# Description: We build the labeled dataset for model training,
# combining positive samples (known associations) and negative samples (randomly generated associations).

print("--- Step 3: Creating the Dataset for Classification ---")
# We define the text columns that will make up our textual 'profile'.
text_cols = ['DrugDescription', 'DrugMechanism', 'DrugConditions', 'DrugCategories', 'DiseaseDescription', 'SlimMapping', 'PathwayNames']

# 1. Preparing positive samples (label = 1)
# We merge the data to enrich each association with textual descriptions.
positive_samples = pd.merge(mapping_df, drugs_df, on='DrugID')
positive_samples = pd.merge(positive_samples, diseases_df, on='DiseaseID')
positive_samples['label'] = 1
# We remove rows with missing textual data.
positive_samples.dropna(subset=text_cols, inplace=True)
print(f"Number of positive samples (known associations): {len(positive_samples)}")

# 2. Generating negative samples (label = 0)
# We get a list of all drug and disease IDs.
all_drugs_ids = drugs_df['DrugID'].unique()
all_diseases_ids = diseases_df['DiseaseID'].unique()
# We create a set of known associations for a quick check.
known_positives_set = set(zip(mapping_df['DrugID'], mapping_df['DiseaseID']))
negative_list = []
# We balance the dataset to avoid bias.
num_negative_samples = len(positive_samples)

# We generate random pairs until we reach the desired number.
while len(negative_list) < num_negative_samples:
    drug_id = np.random.choice(all_drugs_ids)
    disease_id = np.random.choice(all_diseases_ids)
    # We add the pair only if it is not among the known ones.
    if (drug_id, disease_id) not in known_positives_set:
        negative_list.append({'DrugID': drug_id, 'DiseaseID': disease_id})

# We create the DataFrame of negative samples.
negative_samples_df = pd.DataFrame(negative_list)
negative_samples = pd.merge(negative_samples_df, drugs_df, on='DrugID')
negative_samples = pd.merge(negative_samples, diseases_df, on='DiseaseID')
negative_samples['label'] = 0
negative_samples.dropna(subset=text_cols, inplace=True)
print(f"Number of negative samples generated: {len(negative_samples)}")

# 3. Merging, shuffling, and creating the 'CombinedText' feature
dataset = pd.concat([positive_samples, negative_samples], ignore_index=True)
# We shuffle the data to ensure the model sees the examples in random order.
dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)

# We define a function to join the content of the text columns.
def combine_text_features(row):
    return ' '.join([str(row[col]) for col in text_cols])
# We apply the function to create the combined text column.
dataset['CombinedText'] = dataset.apply(combine_text_features, axis=1)

print(f"Final dataset ready, with {len(dataset)} total samples.\n")
print("Example of the training dataset:")
display(dataset[['DrugName', 'DiseaseName', 'label']].head())

--- Step 3: Creating the Dataset for Classification ---
Number of positive samples (known associations): 42200
Number of negative samples generated: 42200
Final dataset ready, with 84400 total samples.

Example of the training dataset:


Unnamed: 0,DrugName,DiseaseName,label
0,Rhein,"Bone Diseases, Metabolic",0
1,Amitriptyline,Hypertension,1
2,Sildenafil,Kidney Diseases,1
3,Gemfibrozil,Respiratory Sounds,0
4,Cyclopentolate,Psychomotor Agitation,1



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



# **Training and Evaluation**

**Artificial Intelligence at Work**
This is where the magic happens. In this cell, our `RandomForestClassifier` model springs into action. It will analyze the textual descriptions of drugs and diseases, converting them into numerical vectors using the TF-IDF technique. It will then train itself to distinguish promising associations from unlikely ones.

Finally, we will test it on never-before-seen data to measure its accuracy. The result? A powerful predictive tool ready to generate new scientific hypotheses.

In [None]:
# --- Step 4: Training and Evaluating the Classifier ---
# Description: This cell executes the machine learning process: vectorization,
# data splitting, model training, and performance evaluation.

print("\n--- Step 4: Training and Evaluating the Model ---")
# We separate the features (X, the text) from the target (y, the label 0 or 1).
X = dataset['CombinedText']
y = dataset['label']

# We vectorize the text with TF-IDF, learning the vocabulary from our dataset.
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, min_df=5, max_df=0.7)
X_tfidf = vectorizer.fit_transform(X)

# We split the data into a training set (80%) and a test set (20%).
# 'stratify=y' ensures that the proportion of 0s and 1s is the same in both sets.
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42, stratify=y)
print(f"Data split into {X_train.shape[0]} training samples and {X_test.shape[0]} test samples.")

print("\nTraining the Random Forest model...")
# We initialize the classifier.
# n_estimators: number of "trees" in the forest.
# n_jobs=-1: use all CPU cores to speed up training.
model = RandomForestClassifier(n_estimators=150, random_state=42, n_jobs=-1)
# We train the model using the training data.
model.fit(X_train, y_train)
print("Model trained successfully.")

print("\n--- Performance Evaluation ---")
# We use the trained model to make predictions on the test set.
y_pred = model.predict(X_test)
# We print the report with the main metrics: precision, recall, f1-score.
print("Classification Report on the Test Set:")
print(classification_report(y_test, y_pred))
print(f"Total accuracy on the test set: {accuracy_score(y_test, y_pred):.4f}\n")


--- Step 4: Training and Evaluating the Model ---
Data split into 67520 training samples and 16880 test samples.

Training the Random Forest model...
Model trained successfully.

--- Performance Evaluation ---
Classification Report on the Test Set:
              precision    recall  f1-score   support

           0       0.80      0.77      0.78      8440
           1       0.78      0.81      0.79      8440

    accuracy                           0.79     16880
   macro avg       0.79      0.79      0.79     16880
weighted avg       0.79      0.79      0.79     16880

Total accuracy on the test set: 0.7877



# **Hypothesis Testing**

Querying the Model & Generating New Insights
Here, our work comes to life. We have trained a model capable of calculating the probability of an association between a drug and a disease. Now, we use it to test specific repositioning `test_hypothesis`.

Each "verdict" from the model is not a certainty but a quantified probability that can guide researchers, helping them decide which experimental paths are worth exploring in the lab.

In [None]:
# --- Step 5: Testing Repositioning Hypotheses ---
# Description: We use the trained model to predict the probability of association
# for new drug-disease pairs and collect the results for visualization.

import json # Import the library for handling the JSON format
from IPython.display import HTML, display # Import functions to display HTML in Colab

print("--- Step 5: Testing Our Repositioning Hypotheses ---")

# Initialize a list to hold the data for visualization
hypotheses_for_viz = []

def test_hypothesis(drug_name, disease_name):
    """
    Function to test a single hypothesis.
    It takes a drug and disease name, constructs their textual profile,
    and uses the model to predict the probability of their association.
    It also saves the results for the graph if the probability is high.
    """
    global hypotheses_for_viz # Declare that we want to modify the global list

    print(f"\nHypothesis: Can the drug '{drug_name}' be used for the condition '{disease_name}'?")

    try:
        drug_info = drugs_df[drugs_df['DrugName'] == drug_name].iloc[0]
        disease_info = diseases_df[diseases_df['DiseaseName'] == disease_name].iloc[0]
    except IndexError:
        print("ERROR: Drug or disease not found in the original dataset.")
        return

    hypothesis_text_data = {**drug_info, **disease_info}
    hypothesis_text = ' '.join([str(hypothesis_text_data.get(col, '')) for col in text_cols])
    hypothesis_vector = vectorizer.transform([hypothesis_text])
    prediction_proba = model.predict_proba(hypothesis_vector)
    probability_of_association = prediction_proba[0][1]

    print(f"  Estimated Probability of Association: {probability_of_association:.2%}")
    if probability_of_association > 0.5:
        print("  Model Verdict: PROBABLE Association")
        # If the probability is over 50%, we add the data to our list for the graph.
        hypotheses_for_viz.append({
            "source": drug_name,
            "target": disease_name,
            "probability": round(probability_of_association, 4)
        })
    else:
        print("  Model Verdict: UNLIKELY Association")
    print("-" * 50)

# Run tests on the original hypotheses
print("--- Testing Initial Hypotheses ---")
test_hypothesis("Clozapine", "Nausea")
test_hypothesis("Valproic acid", "Seizures") # Control hypothesis
test_hypothesis("Indomethacin", "Alzheimer Disease")

# --- NEW HYPOTHESES TO TEST ---
print("\n--- Testing Newly Generated Hypotheses ---")
test_hypothesis("Indomethacin", "Hypertension")
test_hypothesis("Propranolol", "Nausea")
test_hypothesis("Lithium cation", "Depressive Disorder")
test_hypothesis("Valproic acid", "Hypotension")

# --- ADDITIONAL NEW HYPOTHESES ---
print("\n--- Testing Additional New Hypotheses ---")
test_hypothesis("Dexamethasone", "Kidney Diseases")
test_hypothesis("Morphine", "Hypotension")
test_hypothesis("Aspirin", "Fever")

--- Step 5: Testing Our Repositioning Hypotheses ---
--- Testing Initial Hypotheses ---

Hypothesis: Can the drug 'Clozapine' be used for the condition 'Nausea'?
  Estimated Probability of Association: 99.33%
  Model Verdict: PROBABLE Association
--------------------------------------------------

Hypothesis: Can the drug 'Valproic acid' be used for the condition 'Seizures'?
  Estimated Probability of Association: 98.00%
  Model Verdict: PROBABLE Association
--------------------------------------------------

Hypothesis: Can the drug 'Indomethacin' be used for the condition 'Alzheimer Disease'?
  Estimated Probability of Association: 94.67%
  Model Verdict: PROBABLE Association
--------------------------------------------------

--- Testing Newly Generated Hypotheses ---

Hypothesis: Can the drug 'Indomethacin' be used for the condition 'Hypertension'?
  Estimated Probability of Association: 99.33%
  Model Verdict: PROBABLE Association
--------------------------------------------------

# **Graph Visualization**

The Map of Potential Discoveries
Complex data is easier to interpret when visualized. In this final phase, we transform our most promising hypotheses (those with a probability greater than 50%) into an interactive force-directed graph.

Blue nodes represent drugs.
Red nodes represent diseases.
The thickness of the lines indicates the strength of the probability calculated by the model.
This isn't just a chart: it's a visual map of future therapeutic possibilities, uncovered by artificial intelligence.

In [None]:
# --- Generating and Displaying the Interactive Graph ---

# 1. We convert the collected results into a JSON formatted text string.
json_data_string = json.dumps(hypotheses_for_viz, indent=2)

# 2. We create an HTML template that will contain our D3.js graph.
#    We use a Python f-string to dynamically "inject" the JSON data
#    into the JavaScript script.
#    NOTE: The double curly braces {{...}} are used to escape
#    braces that should be interpreted by JavaScript and not by Python.
html_template = f"""
<!DOCTYPE html>
<meta charset="utf-8">
<style>
    /* Styles for the SVG container, nodes, links, and graph text */
    .graph-container {{
      background-color: #f9f9f9; /* Light gray background for readability */
      border: 1px solid #ccc;
      border-radius: 8px;
      width: 900px;
      height: 600px;
      cursor: move; /* Cursor indicates it can be dragged */
    }}
    .links line {{
        stroke: #999;
        stroke-opacity: 0.6;
    }}
    .nodes circle {{
        stroke: #fff;
        stroke-width: 1.5px;
    }}
    text {{
        font-family: sans-serif;
        font-size: 10px;
        pointer-events: none;
    }}
    .tooltip {{
      position: absolute;
      text-align: center;
      padding: 8px;
      font: 12px sans-serif;
      background: lightsteelblue;
      border-radius: 8px;
      pointer-events: none;
      opacity: 0;
    }}
    .legend-text {{
        font-family: sans-serif;
        font-size: 12px;
    }}
    .graph-title {{
        font-family: sans-serif;
        font-size: 16px;
        font-weight: bold;
    }}
</style>

<svg class="graph-container" width="900" height="600"></svg>
<div class="tooltip"></div>

<script src="https://d3js.org/d3.v7.min.js"></script>
<script>
    // The JSON data is inserted here directly from Python.
    const jsonData = {json_data_string};

    // Prepare the data for D3, creating a list of nodes and a list of links.
    const nodesMap = new Map();
    const links = jsonData.map(d => {{
        if (!nodesMap.has(d.source)) nodesMap.set(d.source, {{ id: d.source, type: 'drug' }});
        if (!nodesMap.has(d.target)) nodesMap.set(d.target, {{ id: d.target, type: 'disease' }});
        return {{ source: d.source, target: d.target, probability: d.probability }};
    }});

    const graphData = {{
        nodes: Array.from(nodesMap.values()),
        links: links
    }};

    // Graph creation
    const svg = d3.select("svg");
    const width = +svg.attr("width");
    const height = +svg.attr("height");

    // A <g> group that will contain all graph elements
    const container = svg.append("g");

    const color = d3.scaleOrdinal().domain(['drug', 'disease']).range(['#1f77b4', '#d62728']); // Blue for drugs, Red for diseases

    // Initialize the D3 force simulation
    const simulation = d3.forceSimulation(graphData.nodes)
        .force("link", d3.forceLink(graphData.links).id(d => d.id).distance(150))
        .force("charge", d3.forceManyBody().strength(-250))
        .force("center", d3.forceCenter(width / 2, height / 2))
        .force("collide", d3.forceCollide().radius(12));

    const tooltip = d3.select(".tooltip");

    // Draw the links (lines) INSIDE THE CONTAINER
    const link = container.append("g")
        .attr("class", "links")
        .selectAll("line")
        .data(graphData.links)
        .enter().append("line")
        .attr("stroke-width", d => Math.max(1, d.probability * 8));

    // Draw the nodes (circles) INSIDE THE CONTAINER
    const node = container.append("g")
        .attr("class", "nodes")
        .selectAll("circle")
        .data(graphData.nodes)
        .enter().append("circle")
        .attr("r", 10)
        .attr("fill", d => color(d.type))
        .call(d3.drag().on("start", dragstarted).on("drag", dragged).on("end", dragended))
        .on("mouseover", (event, d) => {{
            tooltip.transition().duration(200).style("opacity", .9);
            tooltip.html(`<strong>${{d.type.toUpperCase()}}</strong><br>${{d.id}}`)
                   .style("left", (event.pageX + 15) + "px")
                   .style("top", (event.pageY - 28) + "px");
        }})
        .on("mouseout", d => {{
            tooltip.transition().duration(500).style("opacity", 0);
        }});

    // Add text labels to the nodes INSIDE THE CONTAINER
    const label = container.append("g")
        .selectAll("text")
        .data(graphData.nodes)
        .enter().append("text")
        .text(d => d.id)
        .attr("x", 14)
        .attr("y", 4);

    // Function that updates the position of elements at each "tick" of the simulation
    simulation.on("tick", () => {{
        link
            .attr("x1", d => d.source.x)
            .attr("y1", d => d.source.y)
            .attr("x2", d => d.target.x)
            .attr("y2", d => d.target.y);

        node
            .attr("cx", d => d.x)
            .attr("cy", d => d.y);

        label
            .attr("transform", d => `translate(${{d.x}},${{d.y}})`);
    }});

    // Logic for Zoom and Pan
    const zoom_handler = d3.zoom()
        .scaleExtent([0.2, 5])
        .on("zoom", (event) => {{
            container.attr("transform", event.transform);
        }});
    svg.call(zoom_handler);

    // *** KEY ADDITION: TITLE AND LEGEND ***
    // Add a title to the graph
    svg.append("text")
        .attr("class", "graph-title")
        .attr("x", width / 2)
        .attr("y", 30)
        .attr("text-anchor", "middle")
        .text("Drug Repositioning Hypotheses (Probability > 50%)");

    // Create a group for the legend
    const legend = svg.append("g")
        .attr("transform", `translate(20, 40)`); // Position of the legend

    // Legend for drugs
    legend.append("circle").attr("cx",0).attr("cy",0).attr("r", 6).style("fill", "#1f77b4");
    legend.append("text").attr("class", "legend-text").attr("x", 10).attr("y", 0).text("Drug").style("alignment-baseline","middle");

    // Legend for diseases
    legend.append("circle").attr("cx",0).attr("cy",20).attr("r", 6).style("fill", "#d62728");
    legend.append("text").attr("class", "legend-text").attr("x", 10).attr("y", 20).text("Disease").style("alignment-baseline","middle");

    // Legend for line thickness
    legend.append("line").attr("x1", -5).attr("y1", 40).attr("x2", 5).attr("y2", 40).style("stroke", "#999").style("stroke-width", 4);
    legend.append("text").attr("class", "legend-text").attr("x", 10).attr("y", 40).text("Higher Probability").style("alignment-baseline","middle");


    // Functions to handle dragging of individual nodes
    function dragstarted(event, d) {{
        if (!event.active) simulation.alphaTarget(0.3).restart();
        d.fx = d.x;
        d.fy = d.y;
    }}
    function dragged(event, d) {{
        d.fx = event.x;
        d.fy = event.y;
    }}
    function dragended(event, d) {{
        if (!event.active) simulation.alphaTarget(0);
        d.fx = null;
        d.fy = null;
    }}
</script>
"""

# 3. Finally, we use the display(HTML(...)) function to render the graph
#    directly in the output of the Colab cell.
print("\n\n" + "="*60)
print("FORCE LAYOUT GRAPH VISUALIZATION")
print("Results with probability > 50% are shown below.")
print("Use the mouse WHEEL to ZOOM and DRAG the background to PAN the graph.")
print("="*60 + "\n")

display(HTML(html_template))



FORCE LAYOUT GRAPH VISUALIZATION
Results with probability > 50% are shown below.
Use the mouse WHEEL to ZOOM and DRAG the background to PAN the graph.

