<h1>NLP Homework Number 3</h1>

<h2>Predict Drug Title</h2>
<p>In this homework we want to predict name of a drug from it's description. For this purpose, we tried two different language models to create embeddings of our descriptions. As it was told, we used DrugBank data in this homework. Before using any language models, first we went through a preprocessed phase that consist of normalizing words, removing stopwords, punctuations, digits and anything between parentheses and brackets. Also we removed top 7 the most frequent words in whole descriptions, because they didn't have useful informations. To generate sentence embeddings, first we used a pre-trained Fasttext Model to make out embeddings. To embed a sentece, we get the average of all words embeddings in that sentece, and by sentece we mean whole preprocessed description of any drug. Second we used MPNET instead of BERT family, as it is for sentence similaity prediction. At the end we tried to visualize our embeddings and predictions. Also we compared Fasttext vs MPNET top 3 drug name suggestions.</p>

# Installing Packages

In [None]:
!pip install xmltodict -q
!apt -qq install p7zip-full
!pip install fasttext -q
!pip install -U sentence-transformers -q

p7zip-full is already the newest version (16.02+dfsg-8).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m690.1 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


# Importing Packages

In [None]:
from gdown import download
import xmltodict
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re
import fasttext.util
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import plotly.express as px
import plotly.graph_objects as go
from sklearn.decomposition import PCA

# Downloading and Decopressing

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
download('https://drive.google.com/uc?id=1JgKileUJNIBxop3Ekod-6FMj67w85KR8', 'drugbank_full_database.zip')
!7za x -p"nlp1402" drugbank_full_database.zip

In [None]:
!cp "/content/drive/MyDrive/CurrentData/Copy of drugbank_full_database.zip" "/content/drugbank_full_database.zip"
!apt install p7zip-full
!7za x -p"nlp1402" drugbank_full_database.zip

# Reading Data

In [None]:
with open("full database.xml") as db:
    doc = xmltodict.parse(db.read())
print('Num of Drugs:', len(doc['drugbank']['drug']))

data = []
empty_des = []
for i, d in enumerate(doc['drugbank']['drug']):
  if d['description'] is not None and d['indication'] is not None and d['pharmacodynamics'] is not None:
    data.append({'description': '. '.join([d['description'], d['indication'], d['pharmacodynamics']]), 'label': d['name']})
  elif d['description'] is not None and d['indication'] is not None:
    data.append({'description': '. '.join([d['description'], d['indication']]), 'label': d['name']})
  elif d['description'] is not None and d['pharmacodynamics'] is not None:
    data.append({'description': '. '.join([d['description'], d['pharmacodynamics']]), 'label': d['name']})
  elif d['indication'] is not None and d['pharmacodynamics'] is not None:
    data.append({'description': '. '.join([d['indication'], d['pharmacodynamics']]), 'label': d['name']})
  elif d['description'] is not None:
    data.append({'description': d['description'], 'label': d['name']})
  elif d['indication'] is not None:
    data.append({'description': d['indication'], 'label': d['name']})
  elif d['pharmacodynamics'] is not None:
    data.append({'description': d['pharmacodynamics'], 'label': d['name']})
  else:
    empty_des.append(d['name'])

print('Num of Drugs without Description', len(empty_des))

df = pd.DataFrame(data)
df.to_csv('/content/drive/MyDrive/a/drug_df.csv', index=False)
df.head()

6259


Unnamed: 0,description,label
0,Lepirudin is a recombinant hirudin formed by 6...,Lepirudin
1,Cetuximab is a recombinant chimeric human/mous...,Cetuximab
2,Dornase alfa is a biosynthetic form of human d...,Dornase alfa
3,A recombinant DNA-derived cytotoxic protein co...,Denileukin diftitox
4,Dimeric fusion protein consisting of the extra...,Etanercept


# Preprocessing

In [None]:

df = pd.read_csv('/content/drive/MyDrive/CurrentData/drug_df.csv')
print(df.shape)
df.head()

(8976, 2)


Unnamed: 0,description,label
0,Lepirudin is a recombinant hirudin formed by 6...,Lepirudin
1,Cetuximab is a recombinant chimeric human/mous...,Cetuximab
2,Dornase alfa is a biosynthetic form of human d...,Dornase alfa
3,A recombinant DNA-derived cytotoxic protein co...,Denileukin diftitox
4,Dimeric fusion protein consisting of the extra...,Etanercept


In [None]:
class Preprocessor:

    def preprocess(self, df):
        label = df['label']
        label = label.lower()
        description = df['description']
        description = self.remove_between_parentheses_and_brackets(description)
        description = self.remove_digits(description)
        words = self.word_tokenizes(description)
        words = self.remove_stopwords(words)
        words = self.remove_punctuations(words)
        words = self.normalize(words)
        words = self.lemmatize(words)
        while(label in words):
            words.remove(label)
        txt = ' '.join(words)
        return txt

    def remove_between_parentheses_and_brackets(self, text):
        pattern = r'\([^)]*\)|\[[^]]*\]'
        text = re.sub(pattern, '', text)
        return text

    def remove_digits(self, text):
        pattern = r'\b\w*\d\w*\b'
        text = re.sub(pattern, '', text)
        return text

    def word_tokenizes(self, text):
        tokens = word_tokenize(text)
        return tokens

    def remove_stopwords(self, words):
        stop_words = set(stopwords.words('english'))
        stop_words.add('also')
        stop_words.add('s')
        words = [word for word in words if word.lower() not in stop_words]
        return words

    def remove_punctuations(self, words):
        filtered_words = [word.translate(str.maketrans('', '', string.punctuation)) for word in words]
        return filtered_words

    def normalize(self, words):
        normalized_words = [word.lower() for word in words]
        return normalized_words

    def lemmatize(self, words):
        lemmatizer = WordNetLemmatizer()
        lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
        return lemmatized_words

preprocess = Preprocessor()

In [None]:
df_copy = df.copy()
df_copy['description'] = df.apply(preprocess.preprocess, axis=1)
df_copy.head()

Removing **Frequent** words in description helps Models to generate more accurate embeddings

In [None]:
def remove_frequent(data):
  pattern = r'\btreatment\b|\bused\b|\bpatient\b|\beffect\b|\bdrug\b|\btrial\b|\bcell\b' # based on nltk.FreqDist having frequency more than 2500
  return re.sub(pattern, '', data)

all_data = []
for row in df_copy['description'].apply(str.split):
  all_data += row
freq = nltk.FreqDist(all_data).most_common(7)
print(freq)

In [None]:
df_copy['description'] = df_copy['description'].apply(remove_frequent)

# Fasttext Model

In [None]:
# Generate a sentence Embeddings based on average embedding of all words in it
def get_fasttext_embedding(data, model):
  embeddings = []
  for token in data.split():
    embeddings.append(model.get_word_vector(token))
  embeddings = np.array(embeddings)
  return np.mean(embeddings, axis=0)

# Calculate Cosine Similarity between two given vectors
def calc_similarity(vec1, vec2):
  vec1, vec2 = vec1.reshape(1, -1), vec2.reshape(1, -1)
  return cosine_similarity(vec1, vec2)

ft = fasttext.load_model('/content/drive/MyDrive/CurrentData/cc.en.300.bin') # Loading Pre-trained fasttext model



In [None]:
df_copy['fasttext_embedding'] = df_copy['description'].apply(get_fasttext_embedding, model=ft) # Add embeddings as a new column

# MPNET Model

Here we use `MPNET`, a model for sentence similarity. We're considering each drug description as a single sentence and expecting permutations in descriptions still preserve overall similarity.

In [None]:
# Generate Embeddings based on given Descriptions
def transformers_embeddings(description, model):
    return model.encode(description)

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2') # Loading Sentence-Transformers

In [None]:
df_copy['mpnet_embedding'] = df_copy['description'].apply(transformers_embeddings, model=model) # Add embeddings as a new column

# Evaluation

In [None]:
def string_to_list(string_list):
  splited = string_list.split()
  last = splited[-1]
  first = splited[0]
  if len(first) == 1 and len(last) == 1:
    generated_list = splited[1:-1]
  elif len(first) != 1 and len(last) == 1:
    first = first[1:]
    generated_list = [first] + splited[1:-1]
  elif len(first) == 1 and len(last) != 1:
    last = last[:-1]
    generated_list = splited[1:-1] + [last]
  else:
    first = first[1:]
    last = last[:-1]
    generated_list = [first] + splited[1:-1] + [last]

  return np.array(generated_list, dtype=np.float32)

df_copy = pd.read_csv('/content/drive/MyDrive/CurrentData/drug_df_final.csv') # Embeddings for drugs (mpnet and fasttext) are stored. Because it takes too long to for each run generete.
df_copy.fasttext_embedding = df_copy.fasttext_embedding.apply(string_to_list)
df_copy.mpnet_embedding = df_copy.mpnet_embedding.apply(string_to_list)

In [None]:
# Return top 3 drugs with highest cosine similarity to embeddings of given input
def search_over_embeds(model, data, input, embedding='fasttext_embedding'):
  # Creating single row DataFrame for input with no label.
  temp_df = pd.DataFrame([{"description" : input, 'label': '[no label]'}])
  temp_df['description'] = temp_df.apply(preprocess.preprocess, axis=1) # Preprocessing
  if embedding == 'fasttext_embedding': # Generate embeddings for input
    embed = temp_df['description'].apply(get_fasttext_embedding, model=model)
  else:
    embed = temp_df['description'].apply(transformers_embeddings, model=model)
  temp_df[embedding] = embed

  similarity_scores = []
  for index, drug in data.iterrows(): # Finding top 3 highest cosine similarity between given input and all descriptions
    similarity = calc_similarity(drug[embedding], temp_df[embedding][0])[0][0]
    similarity_scores.append(similarity)

  top_3_indices = sorted(range(len(similarity_scores)), key=lambda i: similarity_scores[i], reverse=True)[:3]
  top_3_embeddings = [data['label'][i] for i in top_3_indices]

  return top_3_embeddings, similarity_scores, embed

In [None]:
df[df['description'].astype(str).map(len) < 100] # For validation purpose, we only use descriptions having length < 100

Unnamed: 0,description,label
1290,A semi-synthetic cephalosporin antibiotic.,Cefradine
1318,A metallic element that has the atomic number ...,Aluminium
1363,Aprindine is a cardiac depressant used in arrh...,Aprindine
1389,alpha-methylthiofentanyl is an opioid analgesi...,alpha-methylthiofentanyl
1398,Aminorex is an amphetamine-like anorectic agen...,Aminorex
...,...,...
8958,Obecabtagene autoleucel is an investigational ...,Obecabtagene autoleucel
8959,AMG-119 are CAR-T cells targeting delta-like l...,AMG-119
8967,P-BCMA-101 is an autologous CAR-T therapy deve...,P-BCMA-101
8969,Zanolimumab is a fully human monoclonal antibo...,Zanolimumab


In [None]:
# Top 3 based on Fasttext
top3_fasttext, scores_fasttext, embed_fasttext = search_over_embeds(ft, df_copy, 'extractor of allergenic', embedding='fasttext_embedding')
print('Fasttext Drug Suggestion:', top3_fasttext)

# Top 3 based on MPNET
top3_mpnet, scores_mpnet, embed_mpnet = search_over_embeds(model, df_copy, 'extractor of allergenic', embedding='mpnet_embedding')
print('MPNET Drug Suggestion:', top3_mpnet)

Fasttext Drug Suggestion: ['Ginger', 'Orris', 'Rabbit']
MPNET Drug Suggestion: ['Ginger', 'Orris', 'Rabbit']


## Plotting
For better understating, we're going to plot given description by user embeddings and compare it to 2000 random drugs embeddings. Coloring in plots are based on cosine similarity of them and the given description.

### Fasttext Ploting

In [None]:
pca = PCA(n_components=2)
embedding_2d = pca.fit_transform(np.stack(df_copy.fasttext_embedding))
drug_embedding = pca.transform(np.stack(embed_fasttext))
embedding_2d = pd.DataFrame(embedding_2d, columns=['X', 'Y'])
embedding_2d['label'] = df_copy['label']
embedding_2d['score'] = scores_fasttext
random_drugs = np.random.randint(0, embedding_2d.shape[0], 2000)
embedding_2d = embedding_2d.iloc[random_drugs]

In [None]:
fig = px.scatter(embedding_2d, x="X", y="Y", color='score')
fig.update_layout(
    height=1000,
    title_text='Drugs Fasttext Embedding Chart'
)
fig.add_traces(
    px.scatter(x=[drug_embedding[0, 0]], y=[drug_embedding[0, 1]]).update_traces(marker_size=10, marker_color="black", marker_symbol='cross').data
)
fig.show()

### MPNET Plotting

In [None]:
pca = PCA(n_components=2)
embedding_2d = pca.fit_transform(np.stack(df_copy.mpnet_embedding))
drug_embedding = pca.transform(np.stack(embed_mpnet))
embedding_2d = pd.DataFrame(embedding_2d, columns=['X', 'Y'])
embedding_2d['label'] = df_copy['label']
embedding_2d['score'] = scores_mpnet
random_drugs = np.random.randint(0, embedding_2d.shape[0], 2000)
embedding_2d = embedding_2d.iloc[random_drugs]

In [None]:
fig = px.scatter(embedding_2d, x="X", y="Y", color='score')
fig.update_layout(
    height=1000,
    title_text='Drugs MPNET Embedding Chart'
)
fig.add_traces(
    px.scatter(x=[drug_embedding[0, 0]], y=[drug_embedding[0, 1]]).update_traces(marker_size=10, marker_color="black", marker_symbol='cross').data
)
fig.show()

In [None]:
df[(df['label'] == top3_fasttext[0]) | (df['label'] == top3_fasttext[1]) | (df['label'] == top3_fasttext[2])]

Unnamed: 0,description,label
4030,Ginger allergenic extract is used in allergeni...,Ginger
4093,Orris allergenic extract is used in allergenic...,Orris
4094,Rabbit allergenic extract is used in allergeni...,Rabbit


## Compare between suggested drugs generated by each model

In [None]:
# Top 3 based on Fasttext
top3_fasttext, scores_fasttext, embed_fasttext = search_over_embeds(ft, df_copy, 'arrhythmias cardiac depressant', embedding='fasttext_embedding')
print('Fasttext Drug Suggestion:', top3_fasttext)

# Top 3 based on MPNET
top3_mpnet, scores_mpnet, embed_mpnet = search_over_embeds(model, df_copy, 'arrhythmias cardiac depressant', embedding='mpnet_embedding')
print('MPNET Drug Suggestion:', top3_mpnet)

Fasttext Drug Suggestion: ['Aprindine', 'ACY001', 'Procainamide']
MPNET Drug Suggestion: ['Aprindine', 'Disopyramide', 'Flecainide']


In [None]:
df[(df['label'] == top3_fasttext[0]) | (df['label'] == top3_fasttext[1]) | (df['label'] == top3_fasttext[2])]

Unnamed: 0,description,label
1016,A derivative of procaine with less CNS action....,Procainamide
1363,Aprindine is a cardiac depressant used in arrh...,Aprindine
2929,Investigated for use/treatment in cardiac isch...,ACY001


In [None]:
df[(df['label'] == top3_mpnet[0]) | (df['label'] == top3_mpnet[1]) | (df['label'] == top3_mpnet[2])]

Unnamed: 0,description,label
269,A class I anti-arrhythmic agent (one that inte...,Disopyramide
1174,Flecainide is a Class I anti-arrhythmic agent ...,Flecainide
1363,Aprindine is a cardiac depressant used in arrh...,Aprindine


As we can see above, `MPNET` is suggesting more relevant drugs compare to `Fasttext`