<a href="https://colab.research.google.com/github/fatemehmsh90/Business-Data-Analytics-Project/blob/main/notebooks/Drug_Substitute_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Substitute Identification and Risk Analysis in the Pharmaceutical Supply Chain Using Data-Driven Similarity and Exploratory Analytics
**Programme:** MSc IT – Business Data Analytics

**Author:** Fatemeh Mashayekhiahangarani
  
**Repo:** Business-Data-Analytics-Project  

> This notebook contains the analysis, figures, and notes used for the dissertation.  
> Final report will be a separate PDF (uploaded to Moodle). Code and figures are saved here.

---


# Table of Contents



# CHAPTER 1 – INTRODUCTION  

## 1.1	BACKGROUND OF THE STUDY
Drug shortage is an important problem in the world.

When a medicine is not available in the market, patients and doctors need to find a substitute drug.

This process is often slow and manual, and people usually look only at the active ingredient.

However, this is not enough because medicines can be different in their chemical structure, mechanism of action, and therapeutic class.

If these differences are not considered, the chosen substitute may not work well and can cause treatment delays or higher costs (Aronson et al., 2023a; Aronson et al., 2023b).

There are many reasons for drug shortages, such as production problems, dependency on one manufacturer, and weak distribution systems. These problems hurt patients and put pressure on the healthcare system (Andy A. and Andy D., 2023).

In recent years, data-driven approaches have been used more often to support decision-making. Using data science and combining it with artificial intelligence, these methods can find patterns faster and help select better substitutes (Iyer, 2025).

For text information about drugs, embedding methods can also be used to understand meaning and measure similarity between medicines (Kauffman et al., 2025).

This study aims to build a data-driven approach for identifying substitute medicines. This study try to combine textual and structural features to calculate similarities and provide results that can help pharmaceutical supply company managers make better and faster decisions. (Aronson et al., 2023b; Iyer, 2025).


## 1.2	PROBLEM STATEMENT
Today, there is no simple and data-based system that can automatically find substitute medicines. Pharmacists and doctors often decide only by reading the drug information or by using their own experience.

This process takes time and sometimes gives wrong results, because it only looks at the active ingredient and ignores other important details like chemical structure, mechanism of action, and therapeutic class (Aronson et al., 2023a; Aronson et al., 2023b).

When drug shortages become more common, quick and correct decisions are very important. However, many countries and companies still use basic or manual tools.

Recent studies show that machine learning and data analytics can help to find hidden patterns and possible drug substitutes that support better decision-making in the pharmaceutical supply chain (Iyer, 2025; Kauffman et al., 2025).

Still, there is no research that brings both textual and structured features together for this purpose.

Therefore, the main problem of this study is:

how to combine textual data (like product descriptions) and structured data (like therapeutic class, chemical structure, and mechanism of action) to predict possible substitute medicines, and how to use these results for better business and supply chain decisions (Iyer, 2025; Aronson et al., 2023b).


## 1.3	RESEARCH AIM AND OBJECTIVES
**Research Aim:**  
The main aim of this study is to build a data-driven method that can find substitute medicines in a more accurate way.
To do this, the study combines two types of information:

1.	Text information from drug descriptions, and
2.	Structured information such as therapeutic class, chemical structure, and mechanism of action.

By using both types of data together, the study tries to understand which medicines are similar and which ones can be used as substitutes during a drug shortage.
  

**Research Objectives:**  
- Create text embeddings for the drug descriptions to understand meaning and similarity (Kauffman et al., 2025).
- Encode the structured features of each drug (Therapeutic / Chemical / Action classes).
- Combine text and structured features to calculate similarity between medicines.
- Identify the Top-k substitute drugs for each medicine.
- Apply clustering and build a substitution network.
- Create two indicators: SI (Substitutability Index) and SRI (Shortage Risk Index) for business analysis.
- Validate the results using an external dataset (Iyer, 2025).

These objectives help the study create a method that is both technically strong and useful for real decision-making in the pharmaceutical supply chain.


## 1.4	RESEARCH QUESTIONS
This study is guided by several research questions that help to give a clear direction to the project and show what the analysis aims to answer.

**RQ1:**

Does combining text data (such as drug descriptions) and structured data (such as therapeutic class, chemical structure, and mechanism of action) improve the accuracy of finding substitute medicines?

This question will be addressed by studies that show how different drug properties can influence decisions to substitute drugs for each other. (Aronson et al., 2023a; Aronson et al., 2023b).

**RQ2:**

Which therapeutic classes have the highest substitutability, and which ones have the lowest substitutability?

This is an important question between all questions in pharmaceutical supply chains that must be answered because some drug groups are more sensitive to shortages than others. (Andy A. and Andy D., 2023).

**RQ3:**

Can the substitution index (SI) and the shortage risk index (SRI) help describe the risk level of different drug groups in a simple and useful way?

This can be supported by the idea of data-driven decisions in the supply chain. (Iyer, 2025).

**RQ4:**

How can the results of this model support real decisions in procurement and the pharmaceutical supply chain?

This question focuses on the practical value of the analysis.


## 1.5	SCOPE & SIGNIFICANCE
**Scope:**

In this study, I use the MID dataset as the main data source.
This dataset includes information such as drug name, text description, therapeutic class, chemical class, and mechanism of action.

The analysis is limited to these fields. I do not use price data, sales data, or patient-level data. The main focus is to see if these text and structured fields are enough to suggest possible substitute medicines.

To make the results more reliable, I also plan to use an external dataset for checking the model, for example a public medicine substitute dataset from Kaggle or a similar source.

The project does not go deep into company finance or detailed cost modelling, but the results can still be useful for pharmacy managers and supply chain planners (Iyer, 2025).

**Significance:**

Drug shortages are a real and growing problem in many countries.
They can delay treatment and create stress for patients, doctors and pharmacists (Aronson et al., 2023a). In many places, the current tools for finding substitutes are slow, manual, or not updated.

A data-driven method that can suggest substitutes faster and in a more structured way may help to reduce delays and improve access to medicines (Iyer, 2025).

By using text embeddings and structured features together, the model does not rely only on the active ingredient or the drug name, but also on therapeutic class, chemical properties and mechanism of action (Kauffman et al., 2025).

Overall, this study can be a small but useful step towards smarter tools for managing the pharmaceutical supply chain.


## 1.6 OVERVIEW OF ANALYTICAL APPROACH
This study follows a clear and step-by-step analytical process.

The goal is to combine text information and structured drug features to build a method that can suggest substitute medicines and help understand shortage risks.

**Step 1: Exploratory Data Analysis (EDA)**

The first step is to look at the data and understand its basic shape.
In the first review, I examine the number of drugs, missing values, and distribution of treatment categories in the database.

This helps me see if the data needs cleaning and what patterns appear at the start.

**Step 2: Text Embeddings**

The text descriptions of each drug are turned into numerical vectors using embedding methods. These vectors help the model understand the meaning of the text better(Kauffman et al., 2025).

**Step 3: Structured Features**

In this step, we will convert therapeutic classes, chemical classes, and mechanisms of action into numerical values. These features are of great importance because they show us how drugs can be related in terms of therapeutics and properties or chemical formulas(Aronson et al., 2023b).

**Step 4: Combining Text and Structure**

The text embeddings and structured features are merged to create a full representation of each drug.

**Step 5: Similarity Calculation**

For every drug, similarity to other drugs is calculated using cosine similarity.
This helps identify possible substitute medicines.

**Step 6: Clustering and Substitution Network**

Drugs that are similar are grouped together.
This network is then displayed as a graph to show how the drug ingredients are related and which groups have stronger substitution bonds with each other.

**Step 7: Model Evaluation**

The results are checked using an external dataset.
Metrics such as Hit@k and Precision@k are used to see how well the method finds correct substitutes.

**Step 8: Business Insights**

Finally, two indicators are created:

**•	SI (Substitutability Index)**

**•	SRI (Shortage Risk Index)**

These two indicators help decision-makers in the pharmaceutical supply chain to better understand which drug groups have good alternatives and which groups face higher prices (Iyer, 2025).

This approach makes the analysis technically strong and also useful for real decision-making in the pharmaceutical supply chain.


## 1.7 STRUCTURE OF THE DISSERTATION
This dissertation is divided into several parts.
In first Chapter it gives us an introduction to the topic. It explains the problem, the aim of the study, the research questions, and the general analytical approach.

Chapter Two presents the literature review.
This chapter describes drug shortages, substitution patterns, data analytics methods, similarity techniques, text embeddings, and how these tools can be used in the pharmaceutical supply chain.

Chapter Three explains the research methodology.
It introduces the MID dataset and describes the steps for data cleaning, text embedding, encoding structured features, and calculating similarity.

Chapter Four shows the results of the exploratory data analysis (EDA).

Chapter Five presents the similarity model output, including substitute drug lists, clustering results, and the substitution network.

Chapter Six focuses on model evaluation and shows how well the method performs using metrics such as Hit@k and Precision@k.

Chapter Seven provides business insights.
It explains the SI (Substitutability Index) and SRI (Shortage Risk Index) and shows how the results can support decisions in procurement and supply chain planning.

Finally, Chapter Eight includes the conclusion, limitations of the study, and suggestions for future work.


# CHAPTER 2 – LITERATURE REVIEW
## 2.1 DRUG SHORTAGES: DEFINITIONS, CAUSES, IMPACTS
Drug shortages are a serious problem in many countries. According to Aronson et al. (2023a), a drug shortage happens when a medicine is not available for patients at the time they need it. Shortages can be short or long, and they can affect the quality of treatment and patient safety.

There are many causes why drug shortages occur.

Aronson et al. (2023b) explain that problems in manufacturing, lack of raw materials, quality control issues, dependence on a single supplier, strict import rules, and distribution problems are some of the main reasons.

Sometimes companies reduce production because the profit of a drug is low. In other cases, transportation delays or global crises create disruptions in the supply chain.

A review article by Adak (2024) shows that drug shortages not only create technical problems but also serious human problems.
Shortages can increase treatment costs, delay therapy, and create stress for patients, doctors, and pharmacists.

Adak (2024) also notes that shortages reduce trust in the healthcare system and make daily work in pharmacies more difficult. Another study by Andy and Andy (2023) explains that the current systems used by pharmacies to find substitute medicines are slow and limited.

In many places, pharmacists decide based only on the active ingredient. This can be risky because two drugs with the same active ingredient can still have very different chemical structures, mechanisms of action, or therapeutic classes.

Because of these challenges, recent research suggests that healthcare systems should use more data-driven tools and smarter methods to support substitution decisions. These tools can help make faster and more accurate choices and reduce the negative impacts of shortages.

## 2.2 MEDICINE SUBSTITUTION: CONCEPTS & CHALLENGES
Medicine substitution means using another medicine when the original one is not available. According to Aronson et al. (2023b), substitution can happen for many reasons, such as a drug shortage, high price, production problems, or a change in the treatment plan.

There are two common types of substitution.

**1) Generic substitution**

In this case, the original medicine is replaced with a generic version that has the same active ingredient. This method is widely used, but it is not always enough.

Even if two medicines have the same active ingredient, they may still have different chemical structures or mechanisms of action
(Aronson et al., 2023a).

**2) Therapeutic substitution**

Here, the medicine is replaced with another medicine that has a different active ingredient but a similar therapeutic effect. This type of substitution is more complex and requires deeper pharmaceutical knowledge.

Studies show that finding the right substitute is not always easy. One major challenge is that information sources are often old or incomplete (Andy & Andy, 2023).

In many countries, pharmacists must search manually through different websites or books to find similar medicines. This takes time and may lead to mistakes.

Another problem is that similarity between medicines is not only about the active ingredient.

A safe and correct substitution should consider:

•	chemical structure

•	mechanism of action

•	therapeutic class

•	side effects

•	drug interactions

Because of these challenges, recent research recommends using data-driven tools and machine learning methods to support substitution decisions (Iyer, 2025).

These tools can combine text and structured data and suggest substitute medicines faster and more accurately.


## 1. Setup
(Installing required packages, setting random seed, defining directories)

In [None]:
# === Setup: basic configuration and reproducibility ===

import os, random
import numpy as np
import pandas as pd

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Define a directory for saving figures (used later for visualizations)
FIG_DIR = "/content/figures"
os.makedirs(FIG_DIR, exist_ok=True)

print("Setup completed successfully.")
print("Figures will be saved to:", FIG_DIR)

Setup completed successfully.
Figures will be saved to: /content/figures


## 2. Data
(Introduction to the MID dataset and data loading)

## 3. EDA – Exploratory Data Analysis
(Initial data exploration and basic visualizations)

## 4. Text Embeddings
(Creating text embeddings using LLM or sentence-transformer models)

## 5. Structured Features
(Encoding structured attributes: Therapeutic, Chemical, and Action classes)

## 6. Similarity & Integration
(Merging text and structured embeddings and calculating cosine similarity)

## 7. Clustering & Network
(Clustering drugs and building substitution networks)

## 8. Validation & Evaluation
(Evaluating results with external datasets using metrics like Hit@k and Precision@k)

## 9. Results & Business Insights
(Top-5 substitute lists, SI and SRI indices, and managerial implications)

## 10. Export & Figures
(Saving figures and outputs for the final report in PDF format)

---


# REFERENCES
1.	Aronson, J.K., Ferner, R.E., & Heneghan, C. (2023a). Drug shortages. Part 1: Definitions and harms. British Journal of Clinical Pharmacology, 89(10), pp. 2950–2956.
[Online] Available at: https://bpspubs.onlinelibrary.wiley.com/doi/10.1111/bcp.15842 [Accepted 20 Jun 2023].
2.	Aronson, J.K., Ferner, R.E., & Heneghan, C. (2023b). Drug shortages. Part 2: Trends, causes and solutions. British Journal of Clinical Pharmacology, 89(10), pp. 2957–2963.
[Online] Available at: https://doi.org/10.1111/bcp.15853 [Accepted 20 Jun 2023].
3.	Andy, A., & Andy, D. (2023). Drug Shortages in Pharmacies: Root Causes, Consequences and the Role of the FDA in Mitigation Strategies. Progress in Medical Sciences Journal, 7(5), pp. 1–7. [Online] Available at: https:// doi.org/10.47363/PMS/2023(7)E129 [Accepted 20 Oct 2023].
4.	Iyer, S.S. (2025). Data-Driven Decision Making: The Key to Future Health Care Business Success. RA Journal of Applied Research, 11(3), pp. 115–136. [Online] Available at: https://doi.org/10.47191/rajar/v11i3.06 [Accepted 03 Mar 2025].
5.	Kauffman, J., Miotto, R., Klang, E., Costa, A., Norgeot, B., Zitnik, M., Khader, S., Wang, F., Nadkarni, G.N., and Glicksberg, B.S. (2025). Embedding Methods for Electronic Health Record Research. Annual Review of Biomedical Data Science, 8, pp. 563–590. [Online] Available at: https://doi.org/10.1146/annurev-biodatasci-103123-094729 [Accepted 01 May 2025].
