# Challenge - Clinical Text Classification

#### You now know the structure of a Machine Learning pipeline, and how to implement it from the pre-processing of data to the evaluation of results. Good job! But can you do it with text data? In this challenge, you will have the opportunity to implement text processing techniques, including a simple strategy to represent text data in a structure that is compatible with Machine Learning algorithms: the Bag of Words model. Leveraging this strategy, and combining it with what you have learned by implementing the previous pipeline, you will be able to classify clinical notes according to the medical specialty where they were produced. Good luck!

Import Packages

In [1]:
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

!pip install scikit-learn
!pip install nltk
!pip install pandas



In [2]:
import pandas as pd
from sklearn import feature_extraction
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
from random import randint
import nltk
nltk.download('punkt')

from sklearn.model_selection import StratifiedShuffleSplit

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bruno.ribeiro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Load Data

In [3]:
data = pd.read_csv("nlp_challenge.csv")

How does the data look? Here are a few examples:

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,12,Cerebral Angiogram - moyamoya disease.,Neurology,Moyamoya Disease,"CC:, Confusion and slurred speech.,HX , (prima...",
1,2755,EEG during wakefulness and light sleep is abn...,Neurology,Video EEG - 3,"PROCEDURE: , EEG during wakefulness demonstrat...","neurology, epileptogenic, wakefulness, eeg, fr..."
2,2756,A pleasant gentleman with a history of Wilson...,Neurology,Wilson's Disease - Letter,"Doctor's Address,Dear Doctor:,This letter is a...","neurology, atrial enlargement, wilson's diseas..."
3,2757,This is a 43-year-old female with a history o...,Neurology,Video EEG,"TIME SEEN: , 0734 hours and 1034 hours.,TOTAL ...","neurology, electroencephalography, eeg monitor..."
4,2758,The patient has a history of epilepsy and has...,Neurology,Video EEG - 1,"DATE OF EXAMINATION: , Start: 12/29/2008 at 1...","neurology, non-epileptic events, temporal spik..."


The "description" columns contains the samples of the data you will use to train your model. The target is the "medical_specialty" column. As you can see, these data is very different from time series! Start by isolating the information you need (i.e., create the variables X_data and y_data, referring to the description and medical_specialty columns, respectively).

In [5]:
X_data, y_data = data["description"], data["medical_specialty"]

## Split Data

Now, notice that your data is not split into train and test sets. Splitting data is one of the most defining steps in the Machine Learning pipeline. A bad split could result in data leakage, or a test set that does not accurately represent the true data distribution, which could consequently lead to over or under-estimation of results.

To avoid falling into this pits, one must know the data one's working with. Let us check the class distributions. To do so, we will make use of the Counter from the collections library, which gives us the occurrences of each unique instance within a list.

In [6]:
Counter(y_data)

Counter({' Neurology': 223,
         ' Radiology': 273,
         ' Cardiovascular / Pulmonary': 372,
         ' Gastroenterology': 230})

We want to make sure this distribution is similar both in training and testing. Apply a stratified method to do the split (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)

In [7]:
train_idxs, test_idxs = next(StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=32).split(X_data, y_data))

In [8]:
X_train, y_train, X_test, y_test = X_data[train_idxs], y_data[train_idxs], X_data[test_idxs], y_data[test_idxs]

Verify the class distributions on both resultant sets

In [9]:
Counter(y_train)

Counter({' Gastroenterology': 161,
         ' Cardiovascular / Pulmonary': 260,
         ' Radiology': 191,
         ' Neurology': 156})

In [10]:
Counter(y_test)

Counter({' Cardiovascular / Pulmonary': 112,
         ' Neurology': 67,
         ' Radiology': 82,
         ' Gastroenterology': 69})

## Text Preprocessing

Let's get some words rolling! Here's a tutorial of text processing techniques: https://towardsdatascience.com/text-preprocessing-in-natural-language-processing-using-python-6113ff5decd8
You should complement the function "preprocess_text", to return a list of clean tokens, when receiving a chunk of text. Note that the output of this function must be tokenized, to be compatible with the following steps of the notebook

In [11]:
def preprocess_text(text):
    """Takes a text and returns the processed version of it.

    Args:
        text (str): raw text

    Returns:
        list: set of clean tokens containing the content of text
    """

    # Normalize Text (lowercasing)
    processed_text = text.lower()

    # Add more preprocessing steps below:

    # Try a few preprocessing techniques:
    # - Tokenization
    # - Stopwords Removal
    # - Stemming/Lemmatization
    processed_text = nltk.word_tokenize(processed_text)

    # You can also implement other preprocessing you may find useful

    return processed_text

Test the function on a randomly sampled tweet from the train dataset (Give it a couple of tries to really see impact).

In [12]:
random_description = X_train.values[randint(0, X_train.shape[0])]
print(f"Original Note: {random_description}")
print(f"Processed Note: {preprocess_text(random_description)}")

Original Note:  Neurologic examination sample.  
Processed Note: ['neurologic', 'examination', 'sample', '.']


Now that you have cleaned every entry in our dataset, you can proceed to extract features. There are many ways to do this, but we will focus on two simple techniques: The bag-of-words model, and the tfidf. In the following cells, you'll be able to try several these two ways of vectorizing the data.

## Feature Extraction

Machine Learning models do not accept strings as input. So how can we train one to perform text classification? We need to represent each text sample into a numeric vector. This numeric vector is our feature vector. In this challenge, you will implement two simple methods for text vectorization: The bag-of-words and the tfidf methodologies. Here's a quick introduction to both of them: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

### Bag-of-Words

To implement a bag-of-words vectorization, we will use the CountVectorizer function from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

###### Note: Notice that this function expects the output from the preprocessing function to be a tokenized tweet. If you did not implement a tokenizer yet, you must re-think your preprocessing methodology.

In [13]:
def get_bow_representations(train_samples, test_samples, tokenizer):
    """Returns a bag-of-words based representation of both the train and test samples.

    Args:
        train_samples (list): List of training samples.
        test_samples (list): List of test samples.
        tokenizer (object): A preprocessing function that outputs a list of tokens.

    Returns:
        train_vectors, test_vectors: vectorized representations of the train and test sets, according to the BOW method.
    """

    count_vectorizer = feature_extraction.text.CountVectorizer(tokenizer=tokenizer)

    train_vectors = count_vectorizer.fit_transform(train_samples)

    test_vectors = count_vectorizer.transform(test_samples)

    return train_vectors, test_vectors

Test this function and check the dimension of the resultant vectors:

In [14]:
train_vectors, test_vectors = get_bow_representations(X_train, X_test, preprocess_text)
print(train_vectors[0].shape)



(1, 2683)


#### Considering the result of the previous cell, what is the number of unique words in the entire preprocessed train dataset?

Answer here: 

### TF-IDF

TF-IDF implementation is similar to the Bag-of-Words one. But instead, we use the TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [15]:
def get_tfidf_representations(train_samples, test_samples, tokenizer):
    """Returns a tf-idf based representation of both the train and test samples.

    Args:
        train_samples (list): List of training samples.
        test_samples (list): List of test samples.
        tokenizer (object): A preprocessing function that outputs a list of tokens.

    Returns:
        train_vectors, test_vectors: vectorized representations of the train and test sets, according to the BOW method.
    """

    tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(tokenizer=tokenizer)

    train_vectors = tfidf_vectorizer.fit_transform(train_samples)

    test_vectors = tfidf_vectorizer.transform(test_samples)

    return train_vectors, test_vectors

Test this function, similarly to what you did earlier

In [16]:
train_vectors, test_vectors = get_tfidf_representations(X_train, X_test, preprocess_text)
train_vectors

<768x2683 sparse matrix of type '<class 'numpy.float64'>'
	with 12946 stored elements in Compressed Sparse Row format>

## Models

Now, you need a model to perform the main task: Text Classification of Clinical Descriptions. As so, we give you predictive functions that accept a model as input (You can choose your own classifier), train it upon training samples, and return a prediction against the test set.

#### Training Implementation

In [17]:
def get_predictions(model, train_samples, train_labels, test_samples):
    """Simple implementation of a Naive Bayes classifier.

    Args:
        train_samples (_type_): List of vectorized trained tweets.
        train_labels (_type_): List of train labels.
        test_samples (_type_): List of vectorized test tweets.

    Returns:
        preds: Predictions against the test set.
    """

    model.fit(train_samples, train_labels)

    return model.predict(test_samples)

## Build your Pipeline!

Time to bring it all together! You got your preprocessing function, your feature extractors, and your training implentation. Now, you can combine them according to the structure of the traditional NLP pipeline you learned about in this first class.

1. Start by getting your data in the correct format, i.e. a set of training tweets, a set of training labels, a set of test tweets, and a set of test labels.

In [18]:
# This step was already performed when we splitted the data!

2. Preprocess and get a numerical representation of the tweets. You should perform these two steps at once, since the vectorizers accept a preprocessing function as input.

In [19]:
# Vectorize your text
# Notice that you don't have to apply pre-processing first, because the vetorizers apply it themselves.
# You must pass our preprocessing function to the feature extraction function, though.
x_train_vectorized, x_test_vectorized = get_bow_representations(X_train, X_test, preprocess_text)

3. Get predictions against the test set, using a classifier of your choice

In [20]:
# Choose your fighter
model = MultinomialNB()

In [21]:
# Train your model and get predictions against the test samples
preds = get_predictions(model, x_train_vectorized, y_train, x_test_vectorized)

4. Check the performance you achieve through the selected pipeline.

In [22]:
# Test your model in terms of accuracy, f1-score

print(f"Accuracy: {round(balanced_accuracy_score(y_test, preds), 5)}")
print(f"F1-score: {round(f1_score(y_test, preds, average='macro'), 5)}")

Accuracy: 0.6719
F1-score: 0.67102


If you managed to reach this cell without any issues, you probably saw that you can reach a quite reasonable results with this simple pipeline (for a classical 4-class problem, a random classifier would be right 25% of the times). Congratulations, you just built a decent text classifier capable of attributing a description to a certain medical specialty!

#### Challenge: Note that this is a very simple task, since clinical note descriptions are already a summary of the most relevant traits of each clinical note! To achieve these results with the notes themselves would be much harder! Try to reproduce the pipeline you created, but use the column "transcription" of the original dataset as the X_data, instead.