# NLP roberta medicina
- Dataset: [Medical Transcriptions](https://www.kaggle.com/tboyle10/medicaltranscriptions)

## Contexto

É extremamente difícil encontrar dados médicos devido aos regulamentos de privacidade da HIPAA. Esse conjunto de dados oferece uma solução, fornecendo amostras de transcrição médica.

## Conteúdo

Este conjunto de dados contém transcrições médicas de amostra para várias especialidades médicas.

## Reconhecimentos

Esses dados foram extraídos de mtsamples.com

## Inspiração

Podemos classificar corretamente as especialidades médicas com base no texto da transcrição?

## Importar bibliotecas

In [0]:
# Transformers (https://github.com/huggingface/transformers)
!pip install transformers -q

In [0]:
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
import transformers
import torch.utils.data as tdata
import torch.optim as optim

import tqdm

## Importar os dados

In [4]:
data = pd.read_csv("mtsamples.csv", index_col=0)
data.head()

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


## Pré processamento

In [5]:
data['medical_specialty'].value_counts()

 Surgery                          1103
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        372
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  230
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Obstetrics / Gynecology           160
 Urology                           158
 Discharge Summary                 108
 ENT - Otolaryngology               98
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    62
 Psychiatry / Psychology            53
 Office Notes                       51
 Podiatry                           47
 Dermatology                        29
 Cosmetic / Plastic Surgery         27
 Dentistry               

In [6]:
(data['medical_specialty'].value_counts() > 10).mean()

0.825

### Remover NaNs

In [0]:
data = data[['transcription', 'medical_specialty']]
data = data.drop(data[data['transcription'].isna()].index)

### Text process

In [8]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
import string

def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])


In [10]:
data.transcription = data.transcription.apply(text_process)
data.transcription

0       SUBJECTIVE 23yearold white female presents com...
1       PAST MEDICAL HISTORY difficulty climbing stair...
2       HISTORY PRESENT ILLNESS seen ABC today pleasan...
3       2D MMODE 1 Left atrial enlargement left atrial...
4       1 left ventricular cavity size wall thickness ...
                              ...                        
4994    HISTORY pleasure meeting evaluating patient re...
4995    ADMITTING DIAGNOSIS Kawasaki diseaseDISCHARGE ...
4996    SUBJECTIVE 42yearold white female comes today ...
4997    CHIEF COMPLAINT 5yearold male presents Childre...
4998    HISTORY 34yearold male presents today selfrefe...
Name: transcription, Length: 4966, dtype: object

In [11]:
data.head()

Unnamed: 0,transcription,medical_specialty
0,SUBJECTIVE 23yearold white female presents com...,Allergy / Immunology
1,PAST MEDICAL HISTORY difficulty climbing stair...,Bariatrics
2,HISTORY PRESENT ILLNESS seen ABC today pleasan...,Bariatrics
3,2D MMODE 1 Left atrial enlargement left atrial...,Cardiovascular / Pulmonary
4,1 left ventricular cavity size wall thickness ...,Cardiovascular / Pulmonary


### Encoder

In [0]:
from sklearn.preprocessing import LabelEncoder
cats_encoder = LabelEncoder()
data['encoded_labels'] = cats_encoder.fit_transform(data['medical_specialty'])

In [13]:
data.head()

Unnamed: 0,transcription,medical_specialty,encoded_labels
0,SUBJECTIVE 23yearold white female presents com...,Allergy / Immunology,0
1,PAST MEDICAL HISTORY difficulty climbing stair...,Bariatrics,2
2,HISTORY PRESENT ILLNESS seen ABC today pleasan...,Bariatrics,2
3,2D MMODE 1 Left atrial enlargement left atrial...,Cardiovascular / Pulmonary,3
4,1 left ventricular cavity size wall thickness ...,Cardiovascular / Pulmonary,3


### Tokenizer

In [14]:
# Não usaremos o cuda por enquanto
device = torch.device("cuda:0")

# BERT do google melhorado pelo FB
roberta_weights = 'roberta-base'

# Carregar as matrizes pre treinadas do roberta no pytorch 
roberta_model = transformers.RobertaModel.from_pretrained(roberta_weights).to(device)

# Transformar cada palavra e tokenizar para números
roberta_token = transformers.RobertaTokenizer.from_pretrained(roberta_weights)

# Colocar cada lista da descrição no encode e transformar em tensor de 2D (1+N) do pytorch
tokenized = [torch.tensor(roberta_token
                          .encode(x)).unsqueeze(0).to(device) for x in data['transcription']]


Token indices sequence length is longer than the specified maximum sequence length for this model (622 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (616 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (512 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (541 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (802 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [15]:
# Índice da representação do texto que foi tokenizado
tokenized[0]

tensor([[    0, 25272, 33302, 10002,   883,   180,   279,  1104,  2182,  6822,
          3674, 26331,   341, 26331,  3033,  3417,  4265,  3007,   375,  1381,
         18407,   405,   179,   525,  4503, 26838,  1006,   765,    86,  2551,
          2217, 12833,   341, 12447,   763,    67,   341,    94,  1035,   880,
           634,    80,   688,   536,  2082,   447,   157,   341,    81,   627,
         24774, 11085,  4113,  9243, 35937, 11085,  4113, 18881,   109,   990,
          2703,  1230,  8456,   206,  2342,  5867,    62, 32653,  2371, 14939,
          8456,   855, 31270,   139,  6892, 25826,  3998,   225, 12447,   763,
          7981, 39042,  7536,   684,  6150, 26331,  7912, 33302, 10002,   846,
         19196, 17515,  8325,  2697,  1925,  1164,   316, 37664, 17779,  5382,
         14599, 33019,  1437,  4270,   627,  9244,  1827,   396,  1931,  1906,
           877,  4417,   337, 38791,  5166,  1437,  4270,   627,  9244,  1827,
         27722,   699, 20971,   450,   255, 13123,  

### Embedding

In [16]:
# Criar lista de embeddings para mapear cada texto em números
embeddings = []

# Não treinar só pegar os pesos
roberta_model.eval()

# Pegar os tensores e passar pro modelo pré-treinado
with torch.no_grad():
  for x in tqdm.notebook.tqdm(tokenized):
    embeddings.append(roberta_model(x)[1].cpu().numpy())

HBox(children=(IntProgress(value=0, max=4966), HTML(value='')))




In [17]:
embeddings[0].shape

(1, 768)

## Numpy e Arrays

In [0]:
# Transformar para numpy
embeddings_numpy = np.array(embeddings).squeeze()

In [19]:
embeddings_numpy.shape

(4966, 768)

In [20]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

Xtrain, Xval, ytrain, yval = train_test_split(embeddings_numpy, data['encoded_labels'] , train_size=0.5, random_state=0)

print(Xtrain.shape, Xval.shape, ytrain.shape, yval.shape)

(2483, 768) (2483, 768) (2483,) (2483,)


In [21]:
ytrain.unique().shape

(40,)

In [0]:
from sklearn.metrics import f1_score

## Treinando Modelo

In [23]:
# Criar uma array para guardar as previsões
predictions = np.zeros((Xval.shape[0], 332))

for class_ in sorted(data['encoded_labels'].unique()):

  # Treinando um contra todos
  mask_train = ytrain==class_
  mask_val = yval==class_

  # Numero de classes do treino e validação menor que 10
  if mask_train.sum() + mask_val.sum() < 10:
    # Não faz validação
    continue

  # Transformar em inteiro
  ytrain_ = mask_train.astype(int)
  yval_ = mask_val.astype(int)

  # Inicializar, treinar e predizer
  model = RandomForestClassifier(class_weight='balanced') #dar peso maior quando tem menos exemplos
  model.fit(Xtrain, ytrain_)

  p = model.predict_proba(Xval)[:,1]

  predictions[:, class_] = p

  # Pega um percentil das previsões (valor que é maior que 99% dos exemplos)
  threshold = np.percentile(p, 99)
  p_cut = (p > threshold).astype(int)

  print("Class = {} | Num exemplos positivos | train = {} | val = {} | F1 = {} | p-avg = {}\n".format(class_, ytrain_.sum(), yval_.sum(), f1_score(yval_, p_cut), p.mean()))
  #break

Class = 2 | Num exemplos positivos | train = 8 | val = 10 | F1 = 0.17142857142857143 | p-avg = 0.003821687973516302

Class = 3 | Num exemplos positivos | train = 189 | val = 182 | F1 = 0.009661835748792272 | p-avg = 0.07996723731487626

Class = 4 | Num exemplos positivos | train = 8 | val = 6 | F1 = 0.06666666666666667 | p-avg = 0.005069621661987939

Class = 5 | Num exemplos positivos | train = 257 | val = 259 | F1 = 0.0 | p-avg = 0.11515516146424777

Class = 6 | Num exemplos positivos | train = 14 | val = 13 | F1 = 0.2105263157894737 | p-avg = 0.006545386078075091

Class = 7 | Num exemplos positivos | train = 9 | val = 18 | F1 = 0.16216216216216214 | p-avg = 0.004368929646200132

Class = 8 | Num exemplos positivos | train = 19 | val = 10 | F1 = 0.05714285714285714 | p-avg = 0.0069866987319937065

Class = 9 | Num exemplos positivos | train = 7 | val = 3 | F1 = 0.07142857142857142 | p-avg = 0.004330367938385416

Class = 10 | Num exemplos positivos | train = 51 | val = 57 | F1 = 0.0 | p-

## Avaliação do modelo

In [24]:
# Selecionar a melhor classe
predictions

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00998386, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [0]:
# Retornar a coluna que tem a melhor probabilidade
p_argmax = predictions.argmax(axis=1)

In [26]:
f1_score(yval, p_argmax, average='micro')

0.18767619814740233

In [0]:
# Rankear as probabilidades do modelo
from scipy.stats import rankdata

# Rankear do menor pro maior
p_top_rank = np.zeros(predictions.shape)

# Para cada classe pega as previões e rankeia a melhor
for class_ in range(predictions.shape[1]):
  p_top_rank[:,class_] = rankdata(predictions[:, class_])

# Pegar o indice melhor coluna
p_top_rank = p_top_rank.argmax(axis=1)

In [28]:
p_top_rank

array([24, 21, 13, ..., 13, 27, 10])

In [29]:
# Vai melhorar? qual ganha?
f1_score(yval,p_top_rank, average='micro')

0.1059202577527185

In [0]:
# Ideias
# threshold optimization by class
# vs bag of words? tfidf, ensemble