<a href="https://colab.research.google.com/github/fvillena/dcc-ia-nlp/blob/master/5-embeddings-classification-sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings y clasificación

In [1]:
!wget https://raw.githubusercontent.com/fvillena/workshopEmbeddingsAndClassifiers/master/corpus.txt
!wget https://raw.githubusercontent.com/fvillena/workshopEmbeddingsAndClassifiers/master/data.csv

--2020-11-10 22:38:41--  https://raw.githubusercontent.com/fvillena/workshopEmbeddingsAndClassifiers/master/corpus.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7394290 (7.1M) [text/plain]
Saving to: ‘corpus.txt’


2020-11-10 22:38:42 (15.3 MB/s) - ‘corpus.txt’ saved [7394290/7394290]

--2020-11-10 22:38:42--  https://raw.githubusercontent.com/fvillena/workshopEmbeddingsAndClassifiers/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2125438 (2.0M) [text/plain]
Saving to: ‘data.csv’


2020-11-10 22:38:42 (9.67 MB/s) - ‘d

In [2]:
import nltk
import gensim
import re
import pandas as pd
import sklearn
import numpy as np

Cargamos nuestro corpus que contiene muchos diagnósticos en texto libre

In [3]:
corpus = []
with open("corpus.txt", encoding="utf-8") as f:
  for line in f:
    corpus.append(line.rstrip())

Cargamos nuestro conjunto de datos que contiene diagnósticos, junto a las especialidades a los que fueron referidos.

In [4]:
data = pd.read_csv("data.csv")
data = data.sample(len(data),random_state=11)

In [5]:
data

Unnamed: 0,diagnostic,specialty
13599,"Consulta, no especificada",TRAUMATOLOGIA
14183,"Coxartrosis (artrosis de la cadera) , (artrosis)",TRAUMATOLOGIA
6169,Consulta no Especificada,TRAUMATOLOGIA
40788,Retinopatia de la prematuridad,OFTALMOLOGIA
18139,Consulta no Especificada,TRAUMATOLOGIA
...,...,...
32081,Gingivoestomatitis y faringoamigdalitis herpética,OFTALMOLOGIA
7259,Luxacion de la rodilla,TRAUMATOLOGIA
21584,"Catarata senil, no especificada",OFTALMOLOGIA
36543,"Diabetes mellitus, no especificada",OFTALMOLOGIA


In [6]:
def normalizer(text): #normalizes a given string to lowercase and changes all vowels to their base form
    text = text.lower() #string lowering
    text = re.sub(r'[^A-Za-zñáéíóú]', ' ', text) #replaces every punctuation with a space
    text = re.sub('á', 'a', text) #replaces special vowels to their base forms
    text = re.sub('é', 'e', text)
    text = re.sub('í', 'i', text)
    text = re.sub('ó', 'o', text)
    text = re.sub('ú', 'u', text)
    return text

In [7]:
def preprocessor(text):
  text = normalizer(text)
  tokens = nltk.tokenize.casual_tokenize(text)
  return tokens

In [8]:
def vectorizer(text, model): #returns a vector representation from a list of words and a given model
    vectors = []
    for i in text:
        try:
            vectors.append(model.wv[i])
        except:
            pass
    return(np.nan_to_num(np.mean(vectors,axis=0)))

## Actividad 1: Cálculo de los embeddings

Calcule un word embedding utilizando word2vec sobre `corpus`. Recuerde que debe preprocesar el texto antes de calcular los embeddings.

In [9]:
corpus_preprocessed = list(map(preprocessor,corpus))

model = gensim.models.word2vec.Word2Vec(sentences = corpus_preprocessed)

## Actividad 2: Clasificación

Construya para cada documento almacenado en `data.diagnostic` un vector de características utilizando los embeddings calculados anteriormente. Debe decidir cómo va a combinar los embeddings de cada una de las palabras del documento en un sólo vector que defina cada documento.

Cuando ya tenga calculada su matriz de características entrene un modelo que predica la especialidad almacenada en `data.specialty` dutilizando algún algoritmo conocido por usted y calcule la exactitud del modelo.

In [10]:
features = np.zeros(shape=(len(data),model.wv.vectors.shape[1]))
for i,diagnostic in enumerate(data.diagnostic):
  features[i,:] = vectorizer(preprocessor(diagnostic),model)
  
import sklearn.linear_model
import sklearn.model_selection

cv_results = sklearn.model_selection.cross_validate(sklearn.linear_model.LogisticRegression(max_iter=10000),features[:10000],data.specialty[:10000])
cv_results["test_score"].mean()

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


0.8427000000000001

In [11]:
clf = sklearn.linear_model.LogisticRegression(max_iter=10000)
clf.fit(features[:10000],data.specialty[:10000])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Actividad 3: Predictor de especialidad

Construya una función que dado un diagnóstico, retorne la especialidad más adecuada para referir utilizando su modelo entrenado anteriormente.

In [12]:
def predict_specialty(diagnostic):
  tokens = preprocessor(diagnostic)
  vector = vectorizer(tokens,model).reshape(1, -1)
  return clf.predict(vector)

In [13]:
predict_specialty("fractura de cadera")

array(['TRAUMATOLOGIA'], dtype=object)

## Actividad 4: ELMo embeddings

En las siguientes líneas, descargamos un modelo de ELMo entrenado en español y transformamos nuestros documentos a embeddings utilizando el modelo anteriormente cargado.



1.   ¿Cuántas dimensiones tiene el embedding de cada palabra?
2.   Busque 2 diagnósticos distintos que tengan una palabra en común, ¿el vector asociado a la palabra en común, es el mismo en cada diagnóstico?
3.   Calcule un vector único asociado a cada diagnóstico y entrene un modelo de clasificación que resuelva la misma tarea anterior y compare los resultados de ELMo y Word2Vec




In [14]:
!pip install elmoformanylangs

Collecting elmoformanylangs
[?25l  Downloading https://files.pythonhosted.org/packages/ee/84/4d8dcfaaece62c420254c1d860d02d3f79f7ed15206a73171ac2bbc8e57e/elmoformanylangs-0.0.4.post2-py3-none-any.whl (42kB)
[K     |███████▋                        | 10kB 15.1MB/s eta 0:00:01[K     |███████████████▎                | 20kB 1.7MB/s eta 0:00:01[K     |███████████████████████         | 30kB 2.3MB/s eta 0:00:01[K     |██████████████████████████████▋ | 40kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 1.6MB/s 
Collecting overrides
  Downloading https://files.pythonhosted.org/packages/ff/b1/10f69c00947518e6676bbd43e739733048de64b8dd998e9c2d5a71f44c5d/overrides-3.1.0.tar.gz
Building wheels for collected packages: overrides
  Building wheel for overrides (setup.py) ... [?25l[?25hdone
  Created wheel for overrides: filename=overrides-3.1.0-cp36-none-any.whl size=10174 sha256=2458a71df69848d740bc72d03894e2f4005b13515d96fb605e346f48cfa3b5b1
  Stored in directory: /ro

In [15]:
!wget http://vectors.nlpl.eu/repository/11/145.zip
!unzip 145.zip

--2020-11-10 22:39:10--  http://vectors.nlpl.eu/repository/11/145.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 419687058 (400M) [application/zip]
Saving to: ‘145.zip’


2020-11-10 22:39:34 (16.9 MB/s) - ‘145.zip’ saved [419687058/419687058]

Archive:  145.zip
  inflating: char.dic                
  inflating: config.json             
  inflating: encoder.pkl             
  inflating: meta.json               
  inflating: README                  
  inflating: token_embedder.pkl      
  inflating: word.dic                


In [16]:
import elmoformanylangs

In [17]:
e = elmoformanylangs.Embedder('')

2020-11-10 22:39:45,237 INFO: char embedding size: 2637
2020-11-10 22:39:47,006 INFO: word embedding size: 185214
2020-11-10 22:39:53,669 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(185214, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(2637, 50, padding_idx=2634)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_fe

In [18]:
sentences = list(map(preprocessor,data.diagnostic))

In [19]:
features_raw = e.sents2elmo(sentences[:10000])

2020-11-10 22:40:00,625 INFO: 157 batches, avg len: 6.8
2020-11-10 22:41:03,171 INFO: Finished 1000 sentences.
2020-11-10 22:42:04,859 INFO: Finished 2000 sentences.
2020-11-10 22:43:02,324 INFO: Finished 3000 sentences.
2020-11-10 22:44:05,181 INFO: Finished 4000 sentences.
2020-11-10 22:45:05,905 INFO: Finished 5000 sentences.
2020-11-10 22:46:18,890 INFO: Finished 6000 sentences.
2020-11-10 22:47:24,900 INFO: Finished 7000 sentences.
2020-11-10 22:48:24,405 INFO: Finished 8000 sentences.
2020-11-10 22:49:19,949 INFO: Finished 9000 sentences.
2020-11-10 22:50:23,406 INFO: Finished 10000 sentences.


In [20]:
sentences[7]

['trastornos', 'de', 'disco', 'lumbar', 'y', 'otros', 'con', 'radiculopatia']

In [21]:
sentences[13]

['otros', 'trastornos', 'de', 'los', 'meniscos']

In [22]:
features_raw[7][0]

array([ 0.63260055, -2.6295576 , -1.4965271 , ..., -1.6663014 ,
       -0.3976533 , -1.0974367 ], dtype=float32)

In [23]:
features_raw[13][1]

array([ 0.95266294, -2.3787699 , -1.0691546 , ..., -1.9494299 ,
       -0.4950823 , -1.503986  ], dtype=float32)

In [24]:
features = []
for doc in features_raw:
  features.append(doc.mean(0))
features = np.vstack(features)
cv_results = sklearn.model_selection.cross_validate(sklearn.linear_model.LogisticRegression(max_iter=10000),features,data.specialty[:10000])
cv_results["test_score"].mean()

0.8714999999999999