# Detecting Adverse Drug Events From Conversational Texts
Adverse Drug Events (ADEs) are potentially very dangerous to patients and are top causes of morbidity and mortality. Many ADEs are hard to discover as they happen to certain groups of people in certain conditions and they may take a long time to expose. Healthcare providers conduct clinical trials to discover ADEs before selling the products but normally are limited in numbers. Thus, post-market drug safety monitoring is required to help discover ADEs after the drugs are sold on the market.

Less than 5% of ADEs are reported via official channels and the vast majority is described in free-text channels: emails & phone calls to patient support centers, social media posts, sales conversations between clinicians and pharma sales reps, online patient forums, and so on. This requires pharmaceuticals and drug safety groups to monitor and analyze unstructured medical text from a variety of jargons, formats, channels, and languages - with needs for timeliness and scale that require automation.

#### Use cases:

* Conversational Texts ADE Classification
* Detecting ADE and Drug Entities From Texts
* Analysis of Drug and ADE Entities
* Finding Drugs and ADEs Have Been Talked Most
* Detecting Most Common Drug-ADE Pairs
* Checking Assertion Status of ADEs
* Relations Between ADEs and Drugs

In [1]:
# initial config
%run "./00_setup.ipynb"

### Import Libraries

In [8]:
# initializing Spark
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

from delta import *

### Starting PySpark Session

In [2]:
# os.environ["HADOOP_HOME"] = "C:\\Users\\yraj\\Work\\Spark\\spark-3.2.4-bin-hadoop2.7"

In [6]:
import findspark
findspark.init()

In [9]:
findspark.find()

'C:\\Users\\yraj\\Work\\Spark\\spark-3.2.4-bin-hadoop2.7'

In [11]:
spark = SparkSession.builder.getOrCreate()

In [12]:
spark

We have created a script containing a class which consists of methods for creating paths and downloading data

In [14]:
util = Util('Drugs & Adverse Events')

C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data is already present
C:\Users\yraj\Work\POCs\Drugs & Adverse Events\delta is already present


2023/07/04 10:40:07 INFO mlflow.tracking.fluent: Experiment with name 'C:\Users\yraj\Work\POCs\Drugs & Adverse Events\ade-llm' does not exist. Creating a new experiment.


In [15]:
util.print_paths()

root folder           : C:\Users\yraj\Work\POCs\Drugs & Adverse Events
raw data location     : C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data
delta sables location : C:\Users\yraj\Work\POCs\Drugs & Adverse Events\delta


In [16]:
util.display_data()

****************************************************************************************************
data available in C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data are:
****************************************************************************************************
ADE-NEG.txt
DRUG-AE.rel


Storing Configuration File as '.json'

In [17]:
if 'config' not in locals():
  config = {}

In [29]:
config['base_path'] = 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events'
config['data_path'] = 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\data'
config['delta_path'] = 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\delta'
config['vector_store_path'] = 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\data\\vector_store'
config['registered_model_name'] = 'ade-llm'
config['embedding_model_name'] = 'all-MiniLM-L12-v2'
config['openai_chat_model'] = 'gpt-3.5-turbo'
config['system_message_template'] = """You are a helpful assistant built by Yash, you are good at helping classification of drug and it's affect based on the context provided, the context is a document. If the context does not provide enough relevant information to determine the answer, just say I don't know. If the context is irrelevant to the question, just say I don't know. If you did not find a good answer from the context, just say I don't know. If the query doesn't form a complete question, just say I don't know. If there is a good answer from the context, try to summarize the context to answer the question."""
config['human_message_template'] = """Given the context: {context}. Classify the drug and it's affect {statement}."""

In [32]:
import json
with open('./include/config.json', 'w') as file:
    json.dump(config, file)

In [33]:
with open('./include/config.json') as file:
    config = json.load(file)

In [34]:
config

{'base_path': 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events',
 'data_path': 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\data',
 'delta_path': 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\delta',
 'vector_store_path': 'C:\\Users\\yraj\\Work\\POCs\\Drugs & Adverse Events\\data\\vector_store',
 'registered_model_name': 'ade-llm',
 'embedding_model_name': 'all-MiniLM-L12-v2',
 'openai_chat_model': 'gpt-3.5-turbo',
 'system_message_template': "You are a helpful assistant built by Yash, you are good at helping classification of drug and it's affect based on the context provided, the context is a document. If the context does not provide enough relevant information to determine the answer, just say I don't know. If the context is irrelevant to the question, just say I don't know. If you did not find a good answer from the context, just say I don't know. If the query doesn't form a complete question, just say I don't know. If there is a good answer from the context, t

## Downloading Dataset
We will use a slightly modified version of some conversational ADE texts which are downloaded from https://sites.google.com/site/adecorpus/home/document. See
>[Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports](https://www.sciencedirect.com/science/article/pii/S1532046412000615)
for more information about this dataset.

**We will work with two main files in the dataset:**

- DRUG-AE.rel : Conversations with ADE.
- ADE-NEG.txt : Conversations with no ADE.

Lets get started with downloading these files.

In [35]:
for file in ['DRUG-AE.rel', 'ADE-NEG.txt']:
    try:
        os.listdir(f'{util.data_path}\\{file}')
        print(f'{file} is already downloaded')
    except:
        util.load_remote_data(f'https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/{file}')

****************************************************************************************************
downloading file DRUG-AE.rel to C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data
****************************************************************************************************
****************************************************************************************************
downloading file ADE-NEG.txt to C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data
****************************************************************************************************


In [36]:
# display the files downloaded
util.display_data()

****************************************************************************************************
data available in C:\Users\yraj\Work\POCs\Drugs & Adverse Events\data are:
****************************************************************************************************
ADE-NEG.txt
DRUG-AE.rel


### dataframe for negative ADE texts

In [16]:
neg_df = (
  spark.read.text(f"{util.data_path}\\ADE-NEG.txt")
  .selectExpr("split(value,'NEG')[1] as text","1!=1 as is_ADE")
  .drop_duplicates()
)

In [17]:
# display(neg_df.limit(20))
(neg_df.limit(20).collect())

[Row(text=' The patient was extubated 1 week later.', is_ADE=False),
 Row(text=' No abnormalities were identified on review of collection and processing records.', is_ADE=False),
 Row(text=' Hereditary angio-oedema is rare and potentially life-threatening, being characterised by recurrent episodes of perioral or laryngeal oedema.', is_ADE=False),
 Row(text=' Intrathecal chemotherapy with methotrexate or cytosine arabinoside is the standard approach to prophylaxis and treatment of central nervous system leukemia in children.', is_ADE=False),
 Row(text=" Infliximab, a chimeric monoclonal antibody targeting tumor necrosis factor alpha (TNF-alpha), is efficacious in the treatment of rheumatoid arthritis and Crohn's disease.", is_ADE=False),
 Row(text=' Neutropenic colitis has been thought to be a serious gastrointestinal complication associated with chemotherapy for hematological malignancy.', is_ADE=False),
 Row(text=' These reports suggest the possibility that the risk of developing hype

### dataframe for positive ADE texts

In [18]:
pos_df = (
  spark.read.csv(f"{util.data_path}\\DRUG-AE.rel", sep="|", header=None)
  .selectExpr("_c1 as text", "1==1 as is_ADE")
  .drop_duplicates()
)

In [19]:
pos_df.limit(10).show()

+--------------------+------+
|                text|is_ADE|
+--------------------+------+
|Vancomycin is the...|  true|
|Successful desens...|  true|
|The case concerns...|  true|
|Four days after i...|  true|
|Nineteen cases of...|  true|
|Ten days after it...|  true|
|Two cases of siro...|  true|
|DISCUSSION: Ampho...|  true|
|Acute coronary ev...|  true|
|Patients from end...|  true|
+--------------------+------+



### dataframe for all conversational texts with labels

In [20]:
raw_data_df=neg_df.union(pos_df).selectExpr('uuid() as id','*').orderBy('id')

In [21]:
raw_data_df.show()

+--------------------+--------------------+------+
|                  id|                text|is_ADE|
+--------------------+--------------------+------+
|00039966-6c72-40a...| Emergent operati...| false|
|0007fb36-1578-48b...| A patient is des...| false|
|0008e34c-b4bf-499...| BACKGROUND: The ...| false|
|0009ee0b-0306-4a3...| Although isoprot...| false|
|000f5c29-1b5f-41b...| Although it is u...| false|
|000fb368-c91a-462...| In each case, th...| false|
|0011d2f8-5638-44c...| Awareness of thi...| false|
|00191977-7f4b-468...| CONCLUSION: It i...| false|
|001de9a2-64ef-476...| Two probands had...| false|
|0026faff-1311-4b2...| Phenytoin and ca...| false|
|00275fcb-e619-456...| The patient requ...| false|
|00282700-21f8-47b...|Ulcerating enteri...|  true|
|002b86dc-f796-4af...| Generalized argy...| false|
|002e2af0-3041-486...| To our knowledge...| false|
|002f26d1-8b5e-499...| Historically, co...| false|
|00300159-f099-456...| Treatment was in...| false|
|00337c69-c481-47b...| The evid

In [30]:
raw_data_df.count()

20896

## write ade_events to Delta
We will combine the two dataframe and store the dat ain bronze delta layer

In [22]:
raw_data_df.repartition(12).write.format('delta').mode('overwrite').save(f'{util.delta_path}/bronze/ade_events')

### reading the data

In [24]:
df=spark.read.load(f'{util.delta_path}/bronze/ade_events').orderBy(F.rand(seed=42)).repartition(64).cache()

In [29]:
(df.limit(20).show())

+--------------------+--------------------+------+
|                  id|                text|is_ADE|
+--------------------+--------------------+------+
|9e30ac5d-8f3d-456...| Its duration of ...| false|
|e1b28265-fd24-4d7...| An angiogram sho...| false|
|2bed409e-2f8f-488...| We report a 14-y...| false|
|cde4ccbb-53c7-43c...| The objective of...| false|
|75d85cf6-c386-48d...| Perforated appen...| false|
|0f52fbd0-564d-476...| CONCLUSIONS: Thi...| false|
|d8b04629-3ddb-439...| Treatment was st...| false|
|3ef074bf-bfe3-4ca...| At 26 weeks' ges...| false|
|188654af-33e5-4b7...| The mother had u...| false|
|52188914-747c-4b3...| Detailed immunol...| false|
|e1a43c5e-58f9-4b1...| Early detection ...| false|
|6784d899-bb4d-421...| To the best of o...| false|
|57a5d6e8-1182-45d...| PURPOSE: We repo...| false|
|949ed5b8-1df2-493...| Neuroimaging sho...| false|
|5583f0c7-0cad-45a...| The liver other ...| false|
|6d7d0461-0382-407...|Neuroleptic malig...|  true|
|f25a72b3-176a-425...| This pat

In [31]:
df.count()

20896

In [37]:
spark.stop()