<a href="https://colab.research.google.com/github/amitchug/ALMlops/blob/main/M3_NB_MiniProject_2_Keywords_Extraction_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Keywords Extraction using Transformer

## Learning Objectives

At the end of the experiment, you will be able to :

* perform data preprocessing, EDA and feature extraction on the Medical Transcription dataset
* build transformer components - positional embedding, encoder, decoder, etc
* train a transformer model for keywords extraction
* create function to perform inference using trained transformer
* use the gradio library  to generate a customizable UI for displaying the extracted keywords

## Dataset description

The dataset used in this project is the **Medical transcription** dataset. It contains sample medical transcriptions for various medical specialties.

The data is in CSV format with below features:

- **description**

- **medical_specialty**

- **sample_name**

- **transcription**

- **keywords**

##  Grading = 10 Points

## Information

Medical transcriptions are textual records of patient-doctor interactions, medical procedures, clinical findings, and more. Extracting keywords from these transcriptions can provide valuable insights into a patient's health status, medical history, and treatment plans.

* Significance:

  - Data Summarization: Keyword extraction helps in summarizing lengthy medical transcriptions, making it easier for healthcare professionals to quickly understand the patient's medical history.

  - Search and Retrieval: Extracted keywords can be used to index medical records, facilitating faster search and retrieval of relevant documents.

  - Trend Analysis: By analyzing frequently occurring keywords, healthcare institutions can identify common ailments, treatment outcomes, and more.

* Applications:

  - Clinical Decision Support: Extracted keywords can be used to develop clinical decision support systems that provide real-time suggestions to healthcare professionals.
  - Patient Monitoring: By continuously analyzing the keywords from a patient's medical transcriptions, healthcare systems can monitor the patient's health and predict potential health risks.
  - Research: Medical researchers can use extracted keywords to identify trends, study disease outbreaks, and understand treatment efficacies.
  - Billing and Insurance: Keywords can help in automating the medical coding process, which is essential for billing and insurance claims.

### Problem Statement

Build a transformer model for performing keywords extraction on medical transcription dataset.

**Note:**
> For some steps such as how to create a positional embedding layer, transformer components - encoder and decoder blocks, etc you may need to refer to the ***M3 Assignment-5 on Transformer_Decoder***.

### Import required packages

In [None]:
import numpy as np
import pandas as pd
import re
import random
import string
from string import digits
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer, tokenizer_from_json
from tensorflow.keras.preprocessing.sequence import pad_sequences

import warnings
pd.set_option("display.max_colwidth", 200)
warnings.filterwarnings("ignore")

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/Medical_transcription_dataset.csv
!ls | grep ".csv"

**Exercise 1: Read the Medical_transcription_dataset.csv dataset**

**Hint:** pd.read_csv()

In [None]:
# Load the dataset
# YOUR CODE HERE

### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Remove unnecessary columns - 'Unnamed: 0'
- Handle missing values
- Remove rows from data where `keywords` is only single empty space ' $ $ '
- Remove duplicates from data considering `transcription` and `keywords` columns


- **Remove unnecessary columns - 'Unnamed: 0'**

In [None]:
# YOUR CODE HERE

- **Handle missing values**

In [None]:
# Drop missing values
# YOUR CODE HERE

- **Remove rows from data where `keywords` is only single empty space ' '**

In [None]:
# Count of rows where keywords are ' '
# YOUR CODE HERE

In [None]:
# Remove rows where keywords are ' '
# YOUR CODE HERE

- **Remove duplicates from data considering `transcription` and `keywords` columns**

In [None]:
# Check duplicates
# YOUR CODE HERE

**Exercise 3: Display  all the categories of `medical_specialty` and their counts in the dataset [0.5 Mark]**



In [None]:
# Displaying the distinct categories of medical specialty
# YOUR CODE HERE

In [None]:
# Total categories
# YOUR CODE HERE

In [None]:
# Displaying the distinct categories of medical specialty and the number of records belonging to each category
# YOUR CODE HERE

**Exercise 4: Create a pie plot depicting the percentage of `medical_specialty` distributions category-wise. [0.5 mark]**

**Hint:** Use [plt.pie()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html) and [plt.get_cmap](https://matplotlib.org/stable/tutorials/colors/colormaps.html) for color mapping the pie chart.

In [None]:
# YOUR CODE HERE

### Pre-process `transcription` and `keywords` text

**Exercise 5: Create functions to perform below tasks: [0.5 Mark]**

- Convert transcription and keywords text to lowercase
- Remove quotes from transcription and keywords text
- Remove all the special characters/punctuations
- Remove digits from transcription and keywords text
- Remove extra spaces

- **Convert `transcription` and `keywords` text to lowercase**

In [None]:
# Convert transcription and keywords text to lowercase
# YOUR CODE HERE

- **Remove quotes from `transcription` and `keywords` text**

In [None]:
# Remove quotes from transcription and keywords text
# YOUR CODE HERE

- **Remove punctuations**

In [None]:
# Remove punctuations
# YOUR CODE HERE

- **Remove digits from `transcription` and `keywords` text**

In [None]:
# Remove digits from transcription and keywords sentences
# YOUR CODE HERE

- **Remove extra spaces**

In [None]:
# Remove extra spaces
# YOUR CODE HERE

**Exercise 6: Remove the stopwords from `transcription` text [0.5 Mark]**

- **Remove stopwords**

In [None]:
# Function to remove the stopwords

def remove_stopwords(text):

    # YOUR CODE HERE

In [None]:
# Remove stopwords from transcriptions
# YOUR CODE HERE

**[OPTIONAL]** Visualize the distribution of word counts in both `transcription` and `keywords` text.

**Hint:**
- Get the text length of each sample
- pd.DataFrame().hist() OR sns.histplot()

In [None]:
# Visualize the distribution of word counts
# YOUR CODE HERE

### Select the maximum sequence length for both `transcription` and `keywords`

In [None]:
# Fix the maximum length of the transcript
# Fix the maximum keywords length

max_len_transcript = 250
max_len_keywords = 30

**Exercise 7: Add `'start'` and `'end'` to `keywords` text at the beginning and end respectively [0.5 Mark]**

- 'start' will represent the beginning of output sequence
- 'end' will represent the end of output sequence

In [None]:
# Add 'start' and 'end' to keywords text
# YOUR CODE HERE

### Split data into training and testing set

- test_size=0.1
- random_state=0
- shuffle=True

In [None]:
# YOUR CODE HERE

### Tokenization and padding

**Exercise 8: Convert the `transcription` and `keywords` text to sequence of integer values, and make them of uniform length [0.5 Mark]**

- Use two tokenizers to tokenize transcription and keywords separately
  
  **Hint:** [Tokenizer()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer), `.fit_on_texts()`, `.texts_to_sequences()`

- Pad/Truncate both sequences as per their max sequence length specified in above exercises
    - use padding='post', truncating='post'
    - for transcription, (use maxlen= max_len_transcript)
    - for keywords, (use maxlen= max_len_keywords + 1)

  **Hint:** [`pad_sequences(`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)`sequences= , maxlen= , padding='post', truncating='post')`

- For long keywords sequences, the 'end' token might get truncated
    - replace the last token with the token index of 'end'

- save the vocab size for both sequences

In [None]:
# Instantiate tokenizer for transcripts
x_tokenizer = Tokenizer()

# Fit on training data
# YOUR CODE HERE

# Convert transcript sequences into integer sequences for both train and val set
# YOUR CODE HERE

# Add zero padding upto maximum length
# YOUR CODE HERE

# x vocab size
x_voc_size = len(x_tokenizer.word_index) +1
x_voc_size

In [None]:
# Instantiate tokenizer for keywords
y_tokenizer = Tokenizer()

# Fit on training data
# YOUR CODE HERE

# Convert keywords sequences into integer sequences for train and val set
# YOUR CODE HERE

# Add zero padding upto maximum length
# YOUR CODE HERE

# y vocab size
y_voc_size = len(y_tokenizer.word_index) +1
y_voc_size

- **For long keywords sequences, replace the last token with the token index of 'end'**

In [None]:
print(y_tokenizer.word_index['end'])

In [None]:
# Replace the last token with the token index of 'end' for long sequences

# Apply on Train keywords set
# YOUR CODE HERE

# Apply on Validation keywords set
# YOUR CODE HERE


### Positional Embedding

**Exercise 9: Create a class, `PositionalEmbedding` [1 Mark]**

- Use `mask_zero=True`, while defining token embeddings layer

- Make sure to make this layer a mask-generating layer by adding a method `compute_mask()`

In [None]:
class PositionalEmbedding(layers.Layer):

    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        # YOUR CODE HERE ...

    def call(self, inputs):
        # YOUR CODE HERE ...

    def compute_mask(self, inputs, mask=None):
        # YOUR CODE HERE

    def get_config(self):
        # YOUR CODE HERE


### Encoder Block

**Exercise 10: Create a class, `TransformerEncoder` [1 Mark]**

- While calling `attention` layer, do not use `attention_mask` parameter

- In Feed forward network, add `Dropout(0.1)` layer after 2 dense layers

- For skip connections, use `tf.keras.layers.Add()` instead of `'+'`

In [None]:
class TransformerEncoder(layers.Layer):

    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        # YOUR CODE HERE ...

    def call(self, inputs, mask=None):
        # YOUR CODE HERE ...

    def get_config(self):
        # YOUR CODE HERE


### Decoder Block

**Exercise 11: Create a class, `TransformerDecoder` [1 Mark]**

- Do not create any separate function to get causal attention mask, just pass `use_causal_mask = True` parameter while calling `attention_1` layer

- While calling `attention_2` layer, do not use `attention_mask` parameter

- In Feed forward network, add `Dropout(0.1)` layer after 2 dense layers

- For skip connections, use `tf.keras.layers.Add()` instead of `'+'`

In [None]:
class TransformerDecoder(layers.Layer):

    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        # YOUR CODE HERE ...

    def get_config(self):
        # YOUR CODE HERE ...

    def call(self, inputs, encoder_outputs, mask=None): # two inputs: decoder i/p and encoder o/p

        # YOUR CODE HERE ...


### Build Transformer model

**Exercise 12: Create a transformer model with below points: [1 Mark]**

- Use the respective vocabulary size for PositionalEmbedding of encode and decoder inputs

- Add `Dropout(0.1)` layers after both encoder and decoder PositionalEmbedding layers

- Do not use `activation="softmax"` for the last dense classification layer (You will be required to create a custom loss, and metric in the next stage.)

- Add a stack of 4 encoder blocks and 4 decoder blocks to your transformer

In [None]:
# Create transformer model

embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,))

# YOUR CODE HERE ...


decoder_inputs = keras.Input(shape=(None,))

# YOUR CODE HERE ...


decoder_outputs = layers.Dense(y_voc_size)(x)

transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
transformer.summary()

## Model Compilation and Training [1 Mark]

**Exercise 13: Set up the optimizer**

Refer [here](https://www.tensorflow.org/text/tutorials/transformer#set_up_the_optimizer) for the following steps:

- Use the Adam optimizer with a custom learning rate scheduler

- Instantiate the Adam optimizer with custom learning rate

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

    # YOUR CODE HERE


In [None]:
# Instantiate the Adam optimizer with custom learning rate
# YOUR CODE HERE

**Exercise 14: Set up the loss and metrics**

- Apply a padding mask while calculating the loss with cross-entropy loss function as demonstrated [here](https://www.tensorflow.org/text/tutorials/transformer#set_up_the_loss_and_metrics).  

In [None]:
def masked_loss(label, pred):
    # YOUR CODE HERE


def masked_accuracy(label, pred):
    # YOUR CODE HERE


**Exercise 15: Compile transformer model with custom optimizer, loss, and metric & perform training [0.5 Mark]**

- Use [*transcription sequences*, and *keywords sequences(shifted right)*] as input to transformer

- Train model using colab's GPU runtime with batch_size=32, and epochs=30. (It might take one minute per epoch with GPU)

**Hint:** Check if the training code is running without any errors with CPU runtime, later switch to GPU runtime for faster training. Once trained, save the model weights, and download into your system for later use.

In [None]:
# Compile

# YOUR CODE HERE

In [None]:
# Train

# YOUR CODE HERE

### Save model weights

In [None]:
!mkdir my_model_weights

In [None]:
# Save model weights
# It will create a '.weights.h5' file which can be downloaded into your system from colab

transformer.save_weights('my_model_weights/my_weights.weights.h5')

In [None]:
# OR
# Make a zip file, which also can be downloaded into your system from colab

!zip -r 'my_model_weights.zip' 'my_model_weights'

  adding: my_model_weights/ (stored 0%)
  adding: my_model_weights/my_weights.weights.h5 (deflated 9%)


### Load model weights

Whenever you need to use this trained model weights:
* use the model architecture to create exact same model
* then load the trained weights directly using below code

In [None]:
# To load model weights
#transformer.load_weights('my_model_weights/my_weights.weights.h5')

## Run Inference

**Exercise 16: Create a function to extract keywords, given transcription text as input [1 Mark]**

- Encode the input sentence using the Transcription tokenizer. This is the encoder input
- Initialize decoder input with the 'start' token
- The decoder then outputs the predictions by looking at the encoder output and its own output (self-attention).
- Concatenate the predicted token to the decoder input and pass it to the decoder repeatedly
- Make decoder predict the next token based on the previous tokens it has predicted

In [None]:
def extract_keywords(sentence, transformer=transformer):

    """ Takes an input sentence, and transformer. Returns extracted keywords. """

    # Convert input sentence into integer sequence (Note that tokenizer.texts_to_sequences() take list of text as input)
    ip_tokens = # YOUR CODE HERE

    # Add zero padding upto maximum length transcription
    ip_tok_seq = # YOUR CODE HERE

    # Create a decoder sequence with 'start' token index
    dec_tok_seq = np.array([y_tokenizer.word_index['start']])

    # Variable to store the output text string
    keyword_sentence = ''

    for i in range(max_len_keywords):

        # Get output logits from transformer
        pred = transformer([ip_tok_seq.reshape(1,-1), dec_tok_seq.reshape(1, -1)], training=False)
        pred = pred[:, -1:, :]

        # Select the index with max value from 'pred' to get the output token index
        token = # YOUR CODE HERE

        # Convert output token to word
        word = y_tokenizer.index_word[token]

        # End the loop if word is 'end'
        # YOUR CODE HERE

        # Append 'token' to dec_tok_seq
        dec_tok_seq = np.append(dec_tok_seq, token)

        # Append 'word' to keyword sentence
        # YOUR CODE HERE

    return keyword_sentence.strip()


In [None]:
# Predict keywords for a sample input

# YOUR CODE HERE

## Gradio Implementation [OPTIONAL]

Gradio is an open-source python library that allows us to quickly create easy-to-use, customizable UI components for our ML model, any API, or any arbitrary function in just a few lines of code. We can integrate the GUI directly into the Python notebook, or we can share the link with anyone.

In [None]:
!pip -qq install gradio

In [None]:
import gradio

In [None]:
# Input from user
in_transcript = gradio.Textbox(lines=10, placeholder=None, value="transcription", label='Enter Transcription Text')

# Output prediction
out_keywords = gradio.Textbox(type="text", label='Extracted Keywords')


# Gradio interface to generate UI
iface = gradio.Interface(fn = extract_keywords,
                         inputs = [in_transcript],
                         outputs = [out_keywords],
                         title = "Keywords Extraction",
                         description = "Using transformer model, trained from scratch",
                         allow_flagging = 'never')

iface.launch(share = True)

Click on the link generated above to see UI.