<a href="https://colab.research.google.com/github/fatemehmsh90/Business-Data-Analytics-Project/blob/main/notebooks/Drug_Substitute_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Substitute Identification and Risk Analysis in the Pharmaceutical Supply Chain Using Data-Driven Similarity and Exploratory Analytics
**Programme:** MSc IT – Business Data Analytics

**Author:** Fatemeh Mashayekhiahangarani
  
**Repo:** Business-Data-Analytics-Project  

> This notebook contains the analysis, figures, and notes used for the dissertation.  
> Final report will be a separate PDF (uploaded to Moodle). Code and figures are saved here.

---


# Table of Contents



## 1. Setup
(Installing required packages, setting random seed, defining directories)

In [None]:
# === Setup: basic configuration and reproducibility ===

import os, random
import numpy as np
import pandas as pd

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Define a directory for saving figures (used later for visualizations)
FIG_DIR = "/content/figures"
os.makedirs(FIG_DIR, exist_ok=True)

print("Setup completed successfully.")
print("Figures will be saved to:", FIG_DIR)

## 2. Data
(Introduction to the MID dataset and data loading)

## 3. EDA – Exploratory Data Analysis
(Initial data exploration and basic visualizations)

## 4. Text Embeddings
(Creating text embeddings using LLM or sentence-transformer models)

## 5. Structured Features
(Encoding structured attributes: Therapeutic, Chemical, and Action classes)

## 6. Similarity & Integration
(Merging text and structured embeddings and calculating cosine similarity)

## 7. Clustering & Network
(Clustering drugs and building substitution networks)

## 8. Validation & Evaluation
(Evaluating results with external datasets using metrics like Hit@k and Precision@k)

## 9. Results & Business Insights
(Top-5 substitute lists, SI and SRI indices, and managerial implications)

## 10. Export & Figures
(Saving figures and outputs for the final report in PDF format)

---


# Chapter 1 – Introduction  *(~900–1000 words)*

### 1.1 Background of the Study *(~250 words)*
[Write here the general background of the problem: drug shortages, the need for substitutes, slow manual searching, limitations of focusing only on the active ingredient, and the importance of considering differences in chemical composition, mechanism of action, and therapeutic class. Explain how these factors affect treatment continuity and healthcare costs.]

### 1.2 Problem Statement *(~150 words)*
[Clearly define the main problem: the absence of a data-driven system for identifying substitute drugs based on both textual and structured attributes. Highlight the need for a faster and more transparent analytical approach.]

### 1.3 Research Aim & Objectives *(~120 words)*
**Aim:**  
To develop a data-driven approach for identifying substitute drugs and analyzing risk in the pharmaceutical supply chain.  

**Objectives:**  
- Generate text embeddings and integrate them with structured features  
- Compute similarity and extract Top-k substitute drugs  
- Perform clustering and build substitution networks  
- Design and calculate SI (Substitutability Index) and SRI (Shortage Risk Index)  
- Validate results using an external reference dataset  

### 1.4 Research Questions *(~100 words)*
- RQ1: Does combining textual and structured features improve the accuracy of drug substitution?  
- RQ2: Which therapeutic classes show the highest and lowest internal substitutability?  
- RQ3: How do the SI and SRI indices represent the risk level in different drug groups?  
- RQ4: How can the analytical results support procurement and supply chain decision-making?

### 1.5 Scope & Significance *(~150 words)*
[Scope: use of the Mendeley MID dataset and external validation dataset.  
Significance: support data-driven decision-making to reduce delays, improve supply chain efficiency, and assist healthcare professionals and procurement officers in choosing alternative medicines.]

### 1.6 Overview of Analytical Approach *(~150 words)*
[General workflow: EDA → Text Embedding → Structured Feature Encoding → Integration → Similarity Computation → Clustering/Network → Evaluation → Business Insight Generation.]

### 1.7 Structure of the Dissertation *(~80 words)*
[Provide a short overview of chapters 2 to 9, summarizing how each chapter contributes to achieving the research objectives and how the analysis progresses from theory to implementation and business interpretation.]
