<a href="https://colab.research.google.com/github/fatemehmsh90/Business-Data-Analytics-Project/blob/main/notebooks/Drug_Substitute_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Substitute Identification and Risk Analysis in the Pharmaceutical Supply Chain Using Data-Driven Similarity and Exploratory Analytics
**Programme:** MSc IT – Business Data Analytics

**Author:** Fatemeh Mashayekhiahangarani
  
**Repo:** Business-Data-Analytics-Project  

> This notebook contains the analysis, figures, and notes used for the dissertation.  
> Final report will be a separate PDF (uploaded to Moodle). Code and figures are saved here.

---


# Table of Contents



# CHAPTER 1 – INTRODUCTION  

## 1.1	BACKGROUND OF THE STUDY
Drug shortage is an important problem in the world.
When a medicine is not available in the market, patients and doctors need to find a substitute drug.
This process is often slow and manual, and people usually look only at the active ingredient.
However, this is not enough because medicines can be different in their chemical structure, mechanism of action, and therapeutic class.
If these differences are not considered, the chosen substitute may not work well and can cause treatment delays or higher costs (Aronson et al., 2023a; Aronson et al., 2023b).
There are many reasons for drug shortages, such as production problems, dependency on one manufacturer, and weak distribution systems.
These problems hurt patients and put pressure on the healthcare system (Andy A. and Andy D., 2023).
In recent years, data-driven approaches have been used more often to support decision-making.
Using data science and combining it with artificial intelligence, these methods can find patterns faster and help select better substitutes (Iyer, 2025).
For text information about drugs, embedding methods can also be used to understand meaning and measure similarity between medicines (Kauffman et al., 2025).
This study aims to build a data-driven approach for identifying substitute medicines. This study try to combine textual and structural features to calculate similarities and provide results that can help pharmaceutical supply company managers make better and faster decisions. (Aronson et al., 2023b; Iyer, 2025).


### 1.2	PROBLEM STATEMENT
Today, there is no simple and data-based system that can automatically find substitute medicines.
Pharmacists and doctors often decide only by reading the drug information or by using their own experience.
This process takes time and sometimes gives wrong results, because it only looks at the active ingredient and ignores other important details like chemical structure, mechanism of action, and therapeutic class (Aronson et al., 2023a; Aronson et al., 2023b).
When drug shortages become more common, quick and correct decisions are very important.
However, many countries and companies still use basic or manual tools.
Recent studies show that machine learning and data analytics can help to find hidden patterns and possible drug substitutes that support better decision-making in the pharmaceutical supply chain (Iyer, 2025; Kauffman et al., 2025).
Still, there is no research that brings both textual and structured features together for this purpose.
Therefore, the main problem of this study is:
how to combine textual data (like product descriptions) and structured data (like therapeutic class, chemical structure, and mechanism of action) to predict possible substitute medicines, and how to use these results for better business and supply chain decisions (Iyer, 2025; Aronson et al., 2023b).


### 1.3	RESEARCH AIM AND OBJECTIVES
**Research Aim:**  
The main aim of this study is to build a data-driven method that can find substitute medicines in a more accurate way.
To do this, the study combines two types of information:

1.	Text information from drug descriptions, and
2.	Structured information such as therapeutic class, chemical structure, and mechanism of action.

By using both types of data together, the study tries to understand which medicines are similar and which ones can be used as substitutes during a drug shortage.
  

**Research Objectives:**  
- Create text embeddings for the drug descriptions to understand meaning and similarity (Kauffman et al., 2025).
- Encode the structured features of each drug (Therapeutic / Chemical / Action classes).
- Combine text and structured features to calculate similarity between medicines.
- Identify the Top-k substitute drugs for each medicine.
- Apply clustering and build a substitution network.
- Create two indicators: SI (Substitutability Index) and SRI (Shortage Risk Index) for business analysis.
- Validate the results using an external dataset (Iyer, 2025).

These objectives help the study create a method that is both technically strong and useful for real decision-making in the pharmaceutical supply chain.


### 1.4 Research Questions *(~100 words)*
- RQ1: Does combining textual and structured features improve the accuracy of drug substitution?  
- RQ2: Which therapeutic classes show the highest and lowest internal substitutability?  
- RQ3: How do the SI and SRI indices represent the risk level in different drug groups?  
- RQ4: How can the analytical results support procurement and supply chain decision-making?

### 1.5 Scope & Significance *(~150 words)*
[Scope: use of the Mendeley MID dataset and external validation dataset.  
Significance: support data-driven decision-making to reduce delays, improve supply chain efficiency, and assist healthcare professionals and procurement officers in choosing alternative medicines.]

### 1.6 Overview of Analytical Approach *(~150 words)*
[General workflow: EDA → Text Embedding → Structured Feature Encoding → Integration → Similarity Computation → Clustering/Network → Evaluation → Business Insight Generation.]

### 1.7 Structure of the Dissertation *(~80 words)*
[Provide a short overview of chapters 2 to 9, summarizing how each chapter contributes to achieving the research objectives and how the analysis progresses from theory to implementation and business interpretation.]


## 1. Setup
(Installing required packages, setting random seed, defining directories)

In [None]:
# === Setup: basic configuration and reproducibility ===

import os, random
import numpy as np
import pandas as pd

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Define a directory for saving figures (used later for visualizations)
FIG_DIR = "/content/figures"
os.makedirs(FIG_DIR, exist_ok=True)

print("Setup completed successfully.")
print("Figures will be saved to:", FIG_DIR)

Setup completed successfully.
Figures will be saved to: /content/figures


## 2. Data
(Introduction to the MID dataset and data loading)

## 3. EDA – Exploratory Data Analysis
(Initial data exploration and basic visualizations)

## 4. Text Embeddings
(Creating text embeddings using LLM or sentence-transformer models)

## 5. Structured Features
(Encoding structured attributes: Therapeutic, Chemical, and Action classes)

## 6. Similarity & Integration
(Merging text and structured embeddings and calculating cosine similarity)

## 7. Clustering & Network
(Clustering drugs and building substitution networks)

## 8. Validation & Evaluation
(Evaluating results with external datasets using metrics like Hit@k and Precision@k)

## 9. Results & Business Insights
(Top-5 substitute lists, SI and SRI indices, and managerial implications)

## 10. Export & Figures
(Saving figures and outputs for the final report in PDF format)

---


# REFERENCES
1.	Aronson, J.K., Ferner, R.E., & Heneghan, C. (2023a). Drug shortages. Part 1: Definitions and harms. British Journal of Clinical Pharmacology, 89(10), pp. 2950–2956.
[Online] Available at: https://bpspubs.onlinelibrary.wiley.com/doi/10.1111/bcp.15842 [Accepted 20 Jun 2023].
2.	Aronson, J.K., Ferner, R.E., & Heneghan, C. (2023b). Drug shortages. Part 2: Trends, causes and solutions. British Journal of Clinical Pharmacology, 89(10), pp. 2957–2963.
[Online] Available at: https://doi.org/10.1111/bcp.15853 [Accepted 20 Jun 2023].
3.	Andy, A., & Andy, D. (2023). Drug Shortages in Pharmacies: Root Causes, Consequences and the Role of the FDA in Mitigation Strategies. Progress in Medical Sciences Journal, 7(5), pp. 1–7. [Online] Available at: https:// doi.org/10.47363/PMS/2023(7)E129 [Accepted 20 Oct 2023].
4.	Iyer, S.S. (2025). Data-Driven Decision Making: The Key to Future Health Care Business Success. RA Journal of Applied Research, 11(3), pp. 115–136. [Online] Available at: https://doi.org/10.47191/rajar/v11i3.06 [Accepted 03 Mar 2025].
5.	Kauffman, J., Miotto, R., Klang, E., Costa, A., Norgeot, B., Zitnik, M., Khader, S., Wang, F., Nadkarni, G.N., and Glicksberg, B.S. (2025). Embedding Methods for Electronic Health Record Research. Annual Review of Biomedical Data Science, 8, pp. 563–590. [Online] Available at: https://doi.org/10.1146/annurev-biodatasci-103123-094729 [Accepted 01 May 2025].
