## Word Embedding

The data that we start with are hundreds of thousands of comments related to some UK hospitals. Each are associated with a label referring to whether the comment is useful (labeled 1) or not (labeled 0) at indicating an effective treatment. I have already cleaned the comments (e.g. small case, punctuation and stop words removed), but I will not share them since the comment/label associations are from private communication.

In this notebook I show how I perform word embedding using the vocabulary defined in the data_prep.ipynb. This process converts the comments into vectors which will be ingested by the NLP algorithm.

We first import the vocabulary that we have defined in the data_prep.ipynb.

In [5]:
import json

with open('../vocabulary.txt', 'r') as f:
    word_in_vocab = json.loads(f.read())

We also define the dataframe that contains the cleaned comments.

In [6]:
import pandas as pd

df = pd.read_csv('../data/efftreat_clean_label.csv')
comments = df['comments_clean'].values

We use the binary bag of words embedding, which means that words in the vocabulary are replaced by ones or zeros depending whether they appear in the comment or not, respectively.

Here is an example for the sentence "jon has a red car". Let's say our vocabulary has been defined as ['car', 'bird', 'blue', 'red', 'pigeon']. Therefore, our embedding for the sentence results in [1,0,0,1,0], which has the size of our vocabulary, and where ones correspond to words that appear in the comments.

Therefore, because we have defined a vocabulary containing 1405 individual words, each of the comments are replaced by sparse vectors of size 1405. We further recall that we have defined a max cut at 50 words for the length of the comments, as explained in the data_prep.ipynb.

In [3]:
import numpy as np
import scipy
from tqdm import tqdm

max_length = 50
embedded_array = []
with tqdm(total=len(comments)) as pbar:
    for comment_i in comments:
        
        if len(comment_i.split(' ')) > max_length:
            comment_i = ' '.join(comment_i.split(' ')[0:max_length])

        embedded_sentence = []
        for word_i in word_in_vocab:
            if word_i in comment_i.split(' '):
                embedded_sentence.append(1)
            else:
                embedded_sentence.append(0)
     
        embedded_array.append(embedded_sentence)
        pbar.update()

embedded_array = np.array(embedded_array)

100%|██████████████████████████████████| 114909/114909 [04:09<00:00, 460.32it/s]


In [10]:
print(embedded_array.shape)

(114909, 1405)


The matrix embedded_array has a shape (number of comments, length of the vocabulary). This embedding is used to pass the comments to the NLP algorithm, presented in the nlp_model.ipynb notebook.

### Concluding remarks

Throughout this notebook I show how I transform cleaned comments written in English to a sparse matrix that will be passed to the NLP algorithm. We note that the binary bag of words is not the most efficient way to embed text, since mostly consisting of large sparse vectors populated by ones and zeros, and it does not contain any information about the meaning of the sentence. However, at this early stage it provides one of the simplest way to embed text. Future improvements will include testing various embeddings such as using n-grams or word2vec.