# Exercise 4. Text Representation (2)
### Text, Web and Social Media Analytics

In this exercise, we will apply the following models to the data from the previous exercise: 

- Word2Vec
- Doc2Vec
- BERT (lemmatized)

At the end of the exercise, we will derive a corpus with each of them which can be used in later tasks such as classification and clustering. 

## Part 0. Preparation

As a first step, we first install the transformers package, which includes state-of-the-art NLP models for TensorFlow and Pytorch.

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 10.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 37.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 38.3MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


We now import all the necessary packages and libraries that we will be using.

In [None]:
import pickle
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import tensorflow as tf
import torch
from transformers import BertTokenizer, BertModel
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

We load our lemmatized data from the pickle file that we have saved in one of the previous exercises. We also print the first row to see that the data was loaded correctly.

In [None]:
lemmatized_data = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/lemmatized_data.p', 'rb'))

print(lemmatized_data.iloc[0])

content         car wonder enlighten car see day door sport ca...
target                                                          7
target_names                                            rec.autos
Name: 0, dtype: object


After loading the data, we take each document and tokenize it, so we end up with a list of all the words in the document. We print the first record to see how the data looks like.

In [None]:
corpus_gen = lemmatized_data['content'].apply(lambda text: text.split())

print(corpus_gen[0])

['car', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'small', 'addition', 'bumper', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'engine', 'specs', 'year', 'production', 'car', 'history', 'info', 'funky', 'looking', 'car', 'mail', 'thank']


## Part 1. Word2Vec

Here we define a new Word2Vec model and train it using the corpus we created in the previous step. We set the 'size' parameter to 100 to define the dimensionality of the word vectors; we also set the 'min_count' parameter, which indicates to ignore all words that appear in less than 566 documents. After that, we save the model.

In [None]:
model = Word2Vec(corpus_gen, size=100, min_count=566)
model.save('word2vec.model')

In order to get a better idea about certain characteristics of the model, we print the features and the number of features.


In [None]:
print('Features:\n{}\n'.format(sorted(model.wv.vocab.keys())))
print('Num. of Features:\n{}'.format(len(model.wv.vocab.keys())))

Features:
['able', 'accept', 'access', 'act', 'action', 'actually', 'add', 'address', 'advance', 'ago', 'agree', 'allow', 'american', 'answer', 'anybody', 'appear', 'apple', 'application', 'apply', 'appreciate', 'apr', 'april', 'area', 'argument', 'armenian', 'armenians', 'article', 'ask', 'assume', 'atheist', 'attack', 'available', 'away', 'bad', 'base', 'begin', 'believe', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'break', 'bring', 'build', 'buy', 'call', 'car', 'card', 'care', 'carry', 'case', 'cause', 'center', 'certain', 'certainly', 'change', 'check', 'child', 'chip', 'christian', 'christians', 'church', 'city', 'claim', 'clear', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comment', 'company', 'consider', 'contact', 'contain', 'continue', 'control', 'copy', 'correct', 'cost', 'country', 'couple', 'course', 'cover', 'create', 'crime', 'current', 'data', 'date', 'datum', 'david', 'day', 'deal', 'death', 'decide', 'design',

Here we show the vector representation of the word 'car', which has a dimension of 100 like we defined previously in the 'size' parameter when creating the model.

In [None]:
model.wv['car']

array([ 0.016478  ,  0.47469488, -0.0735962 , -1.5947578 , -0.71565026,
       -0.13002563,  0.92253655, -0.56696934,  0.4373428 , -0.16656286,
       -1.2158672 ,  0.14024894, -0.48905385, -0.10485542, -0.293814  ,
       -1.8627863 ,  0.1057101 , -0.46848246, -0.10593803, -0.5915618 ,
       -0.79466665,  1.5016022 ,  0.9591846 , -0.9669569 , -0.7194357 ,
        0.5617788 ,  1.0216058 , -1.7020634 , -0.47521907, -0.37625855,
        0.6199494 ,  1.5019242 ,  0.42780042,  0.44220775, -0.34955272,
       -1.3595812 , -0.6390408 ,  1.490715  , -1.0162692 ,  0.7565516 ,
       -0.32053503, -0.9799328 , -0.05697813,  0.1426159 ,  1.4504987 ,
       -0.21290368,  0.3151061 ,  0.6335179 , -0.22196509,  0.5143563 ,
        0.9986693 , -0.81887585,  1.2004273 , -0.50767195,  0.44180468,
        0.04145134,  1.1315997 , -1.2109666 , -0.810632  , -1.1241968 ,
        0.96450996,  0.6558029 ,  0.94009155, -0.6905603 ,  0.890224  ,
        1.3099073 ,  0.78042006, -1.3137709 , -0.1449007 , -0.17

We can also check which words are similar or close in the vector space to a given one, in this case, 'car'.

In [None]:
model.wv.most_similar('car')

[('bike', 0.6044567227363586),
 ('buy', 0.5782100558280945),
 ('friend', 0.5530164241790771),
 ('get', 0.49781185388565063),
 ('pay', 0.4936054050922394),
 ('price', 0.4751429557800293),
 ('speed', 0.4499210715293884),
 ('hit', 0.4419363737106323),
 ('drive', 0.44047486782073975),
 ('light', 0.4325913190841675)]

We can check the similarity to more than one word, in this case for 'bike' and 'machine', where we want to see the most similar one by defining the 'topn' to one. 

In [None]:
model.wv.most_similar(positive=['bike', 'machine'], topn=1)

[('fast', 0.654640793800354)]

After seeing that each word is represented in a vector, we would like to also convert each document to a vector that is the average of all the words. To do this, we go through each word of each document and get its vector representation. After we have the vector for all the words, we average them and get a vector representation for the document. 

In [None]:
embedded_corpus = []

for document in corpus_gen:
  word_embeddings = []
  for word in document:
    try:
      word_embeddings.append(model.wv[word])
    except KeyError:
      continue
  if len(word_embeddings) > 0:
    document_embedding = np.mean(word_embeddings, axis=0)
    embedded_corpus.append(document_embedding)

We then convert this document embedding to a dataframe, where each row is a document and all the columns are an average of the vector values for each word. We then print the head to understand how our dataframe looks like and we also print the shape of the dataframe to understand its structure. 

In [None]:
embeddings_df = pd.DataFrame(embedded_corpus)

print(embeddings_df.head())
print('\nShape: {}'.format(embeddings_df.shape))

         0         1         2   ...        97        98        99
0  0.145009  0.074307  0.062820  ... -0.002585 -0.446552 -0.051636
1 -0.328259 -0.023729  0.118994  ...  0.053510 -0.314329 -0.025866
2 -0.236099  0.099917  0.135177  ... -0.062771 -0.287006 -0.233269
3 -0.143364 -0.172010  0.220363  ...  0.060539 -0.210669 -0.076263
4 -0.043743  0.041052 -0.155883  ... -0.056644 -0.083022 -0.262626

[5 rows x 100 columns]

Shape: (11297, 100)


We now save our Word2Vec model as a pickle file so we can access it later.

In [None]:
pickle.dump(embeddings_df, open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/WordtoVecModel.pkl', 'wb'))

We know that the document embedding that we performed before has the same dimension as the word embedding, since it was made by averaging all the word vectors. This means we can also look for words that have a similar representation as for the document; to do this we take the first document embedding and lookup for similar words by vector. 

In [None]:
first_document = embedded_corpus[0]

model.wv.similar_by_vector(first_document)

[('car', 0.872447669506073),
 ('friend', 0.6493353843688965),
 ('bike', 0.5698528289794922),
 ('get', 0.5553348064422607),
 ('buy', 0.5460785627365112),
 ('month', 0.5265671610832214),
 ('price', 0.513207733631134),
 ('see', 0.49368739128112793),
 ('go', 0.47965988516807556),
 ('lot', 0.4707129895687103)]

We can also look for documents that are similar with each other by calculating the cosine distance. In order to do this, we take one document, in this case, the first one, and calculate the cosine distance with all the other documents, then we get the document who had the least distance. 

In [None]:
min_distance = 1.0
most_similar_index = 0
counter = 0

for document in embedded_corpus:
  distance = cosine(first_document, document)
  if (distance < min_distance) and (distance != 0.0):
    min_distance = distance
    most_similar_index = counter
  counter += 1

most_similar_index, min_distance

(1044, 0.10285025835037231)

However, when we print the document that is supposed to be similar, we see that they are actually quite different. This means that the document representation that we performed by averaging all the word embeddings might not work correctly.

In [None]:
print(lemmatized_data['content'].iloc[0])
print(lemmatized_data['content'].iloc[most_similar_index])

car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank
old simms article apr csx cciw csx cciw stewart beal write article netnews upenn edu jhaines eniac seas upenn edu jason haines write wonder people good use old simms bunch apple mac know lot people try sell get inovative use want buy simms interested hearing guy work take use cyano acrylate glue wide panel constructs box use pencil holder get entreprenuerial spirit cheapy clear plastic box mount simm inside sell pet simm sure plenty sucker aaron


## Part 2. Doc2Vec

We now define a new Doc2Vec model and train it, where we define the 'vector_size' to 100 and the 'min_count' to 566, just like we did with the previous model. However, the different is that instead of giving the model the raw corpus, we have to create tagged documents for it to work properly.

In [None]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_gen)]
model = Doc2Vec(documents, vector_size=100, min_count=566)

We see that this model creates an embedding for the whole document, instead of for each word in the document. We print here the embedded first document to see how it looks like. 

In [None]:
first_document_embedding = model.docvecs[0]

print(first_document_embedding)

[ 0.0276915  -0.00424333 -0.03035437  0.03665333 -0.03213951 -0.01816239
 -0.07778365 -0.03633784 -0.02787157  0.03769598 -0.06721046  0.04989615
  0.03432883  0.01098605  0.01431738 -0.01388856  0.03482941  0.01128545
 -0.01953517 -0.02494158 -0.04504299  0.03002725  0.0299719   0.03134677
  0.0103391  -0.0370337   0.04888672 -0.06606048 -0.01375777  0.05624491
  0.05905925 -0.00221229  0.0647403   0.04874262 -0.0057706   0.01769597
  0.00062595  0.04845971  0.0154444  -0.00515356  0.0259825  -0.03847988
  0.03785087 -0.04696894  0.06054603 -0.01189699  0.04639918  0.01431327
  0.02038807 -0.03153087 -0.01505555 -0.02887575 -0.01999183  0.05477902
  0.05484637  0.03216729 -0.01537522  0.00627428 -0.00896889  0.01775825
 -0.04210246  0.03084033 -0.01253119 -0.07095121 -0.04444231  0.01873594
  0.07175679 -0.02109042  0.04794594  0.01523552  0.03240154  0.03453762
 -0.01690733 -0.04336994  0.04585202 -0.03524435  0.03358722 -0.05555914
 -0.04826734 -0.01969656  0.05696304 -0.00813756 -0

Since we have now a vector representation for each document, we can look for the most similar document by calculating the cosine distance again. By using the same method as in the previous model, we find the document with the least distance to the first document. 

In [None]:
min_distance = 1.0
most_similar_index = 0
counter = 0

for document in model.docvecs.vectors_docs:
  distance = cosine(first_document_embedding, document)
  if (distance < min_distance) and (distance != 0.0):
    min_distance = distance
    most_similar_index = counter
  counter += 1

most_similar_index, min_distance

(596, 0.09099602699279785)

We then print the first document as well as its closest in the vector space. We can clearly see that both documents are similar, unlike with the previous approach. 

In [None]:
print(lemmatized_data['content'].iloc[0])
print(lemmatized_data['content'].iloc[most_similar_index])

car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank
car alarm info ungo box want car alarm think get ungo box knowledge experience alarm price range different model good car alarm email responce cak lehigh edu chad chad


We now convert our embeddings into a dataframe like we did before, where each row is a different document and the columns are the vector representation. 

In [None]:
embeddings_df = pd.DataFrame(model.docvecs.vectors_docs)

print(embeddings_df.head())
print('\nShape: {}'.format(embeddings_df.shape))

         0         1         2         3         4         5         6   \
0  0.027692 -0.004243 -0.030354  0.036653 -0.032140 -0.018162 -0.077784   
1  0.027123  0.003297  0.003503  0.029629  0.005422  0.024020 -0.027331   
2  0.077605  0.061941  0.048014  0.027540  0.095746  0.014163  0.103565   
3  0.018683 -0.007956  0.014993 -0.002906 -0.051899  0.023039 -0.008452   
4  0.002189  0.001715 -0.009429  0.041566  0.018541  0.015386  0.005003   

         7         8         9   ...        90        91        92        93  \
0 -0.036338 -0.027872  0.037696  ... -0.036934  0.043027 -0.018570  0.012485   
1 -0.027808 -0.014980 -0.042095  ...  0.016356 -0.002130  0.012637 -0.001656   
2 -0.039304  0.006564 -0.047993  ... -0.028965  0.023155  0.000817  0.045025   
3 -0.023287 -0.020055 -0.002552  ...  0.033085 -0.024785  0.001221  0.006971   
4 -0.049475 -0.001632  0.014421  ...  0.015579  0.050467  0.037013  0.014271   

         94        95        96        97        98        99  
0 -0

We then save our Doc2Vec model to a pickle file which we can use later. 

In [None]:
pickle.dump(embeddings_df, open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/DoctoVecModel.pkl', 'wb'))

## Part 3. BERT

We check if our GPU is available and recognized by Tensorflow. 

In [None]:
device_name = tf.test.gpu_device_name()

if device_name != '':
  print('Found GPU at: {}'.format(device_name))
else:
  raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


We then check if our GPU is also available and recognized by PyTorch. 

In [None]:
if torch.cuda.is_available():
  device = torch.device('cuda')
  print('There are %d GPU(s) available.' %torch.cuda.device_count())
  print('We will use the GPU: ', torch.cuda.get_device_name(0))
else:
  print('No GPU available, using the CPU instead.')
  device = torch.device('cpu')

There are 1 GPU(s) available.
We will use the GPU:  Tesla K80


We add the tags '[CLS]' and '[SEP]' at the beginning and at the end of each document respectively, so that the BERT model can identify the task it was pre-trained to do.  

In [None]:
sentences = ['[CLS] ' + query + ' [SEP]' for query in lemmatized_data['content']]

print(sentences[0])

[CLS] car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank [SEP]


We define a new BertTokenizer object from a pre-trained model and then tokenize each document. We print the first one to see how the BertTokenizer splits the words.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.tokenize(sentence) for sentence in sentences]

print(tokenized_texts[0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…


['[CLS]', 'car', 'wonder', 'en', '##light', '##en', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'brick', '##lin', 'door', 'small', 'addition', 'bumper', 'separate', 'rest', 'body', 'know', 'tell', '##me', 'model', 'engine', 'spec', '##s', 'year', 'production', 'car', 'history', 'info', 'funky', 'looking', 'car', 'mail', 'thank', '[SEP]']


Here we want to see the size of the largest tokenization performed, where we can see that the number of tokens is quite large.

In [None]:
max_length = max([len(document) for document in tokenized_texts])

print(max_length)

8232


We see that the number of tokens generated for each document vary a lot, so here we want to pad each sequence to make them have the same length. Since we have sequences with variable length strings, we set the parameter 'dtype' to 'object', as it is recommended. We then define the parameter 'maxlen' to 512, which means that the maximum length of each sequence will be increased or decreased to this number; we also chose 512, because that is the length of the tensors that the BERT Model is expecting. The 'value' parameter is set to '[PAD]', which means that all the spaces that we are filling will have this string. Finally, both the 'truncating' and 'padding' parameters are set to 'post', which means that the padding and truncating will be performed at the right side of the values. 

In [None]:
sentences_padded = pad_sequences(tokenized_texts, dtype=object, maxlen=512, 
                                 value='[PAD]', truncating='post', padding='post')

We now convert each of the tokens to a numerical id based on the vocabulary of the documents. 

In [None]:
sentences_converted = [tokenizer.convert_tokens_to_ids(token) for token in sentences_padded]

We create mask for each of the sequences we created in the previous step, where we write a one if the id is higher than zero or zero if the id is lower than zero; in other words, the mask consists of ones and zeros depending on the value of the id given. 

In [None]:
masks = []

for seq in sentences_converted:
  seq_mask = [int(i>0) for i in seq]
  masks.append(seq_mask)

We now create two tensors, one from the sentences converted to ids, which we will use as inputs, and another from the masks, which we will also use as masks. 

In [None]:
inputs = torch.LongTensor(sentences_converted)
masks = torch.LongTensor(masks)

We define a new BertModel from its pre-trained setup 'bert-case-uncased'. 

In [None]:
model = BertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




We then send it to the device that we will use to train our model, which in our case, it's the GPU. We can also see the architecture of the neural network that the model uses.

In [None]:
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

We now create a TensorDataset using the inputs and masks that we created before. We also define a SequentialSampler, which will take samples from our dataset. Finally, we create a DataLoader with our dataset, sampler, and using a batch size of 16. 

In [None]:
batch_size = 16
prediction_data = TensorDataset(inputs, masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

We now loop through the dataloader, which gives batches to the model to be trained with, and we then append the resulting embeddings given by the model to our list of results.

In [None]:
result = []
i = 0

for batch in prediction_dataloader:
    batch = tuple(t.to(device) for t in batch)
    b_input_ids, b_input_mask = batch
    with torch.no_grad():
        outputs = model(b_input_ids)
    embeddings = outputs.pooler_output # CLS embeddings for the batch
    embeddings = embeddings.detach().cpu().numpy()
    result.append(embeddings)
    i = i + 1
print('DONE')

DONE


Here we print the length of our result and the shape of each batch to understand how it looks like. We had 11314 documents and we fed the data in batches of 16 documents, which result in 708 total batches to process all the documents. This is also why the first value of the shape of the batch is 16 and the second value is 768, because that's the size of the embedding that the BERT model outputs. 

In [None]:
print('Length of Result: {}'.format(len(result)))
print('Batch Shape: {}'.format(result[0].shape))

Length of Result: 708
Batch Shape: (16, 768)


We loop through each batch extracting each embedding to then create a dataframe with all of them. We then print the head of the dataframe to understand how it looks. 

In [None]:
final = []

for batch in result:
   for embedding in batch:
      final.append(embedding)

final_df = pd.DataFrame(final)

print(final_df.head())

        0         1         2    ...       765       766       767
0 -0.127447 -0.267551 -0.952460  ... -0.817432 -0.201459 -0.155924
1 -0.128594 -0.297495 -0.960266  ... -0.858595 -0.153644 -0.225011
2 -0.364291 -0.418737 -0.972758  ... -0.922895 -0.341115 -0.155387
3 -0.208075 -0.404881 -0.984882  ... -0.915685 -0.226392 -0.209980
4 -0.134173 -0.371447 -0.981081  ... -0.921013 -0.209249 -0.286874

[5 rows x 768 columns]


Finally, we save our results to a pickle file, so we can use this data later on.

In [None]:
pickle.dump(final_df, open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/BertModel.pkl', 'wb'))