## Bert Embeddings
This model will embed project descriptions into an embedding.  Creating an embedding for even a small latent space will take 30 minutes for the New York dataset.  This notebook will output the embeddings for each project into a CSV file.

The file only needs to be executed if a new embedding is calculated.

As a baseline, the smallest BERT model will be used.  This will create a 1D vector of size 512 for every sentence of text provided.

In [1]:
# install keras-bert if necessary

# !pip install keras-bert

In [2]:
import os
import csv
import numpy as np

import pandas as pd
from tqdm.notebook import tqdm

# https://github.com/CyberZHG/keras-bert
from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX

Using TensorFlow backend.


In [3]:
BERT_BASE_DIR = os.path.join(os.getcwd(), 'pretrained')
os.path.isdir(BERT_BASE_DIR)

True

### Select Pretrained BERT encoder
Many BERT pretrained encoders are available.  The more dimensions that the encoder has, the longer it takes to embed a sentence and the more space that it takes.

For purposes of predicting project success, we simply want an encoded space to represent the project description.  We will not be using the embeddings to do any translations or predictions based soley on the embedding.

In [4]:
# https://github.com/google-research/bert
# levels of 2,4,6,8,10,12
# h's of 128,256,512,768
# increasing each increases space size and embedding time
# uncased_L-2_H-128_A-2     1.77s  512 elements (bert tiny) 64.2 *
# uncased_L-12_H-128_A-2    8.92s  1024 elements
# uncased_L-4_H-256_A-4     3.4s   2048 elements (bert mini) 65.8
# uncased_L-4_H-512_A-8     4.06s  4096 elements  (bert small) 71.2
# uncased_L-8_H-512_A-8     7.61s  4096 elements (bert medium) 73.5
# uncased_L-12_H-768_A-12   12.9s  6144 elements (bert base)
bert_model = 'uncased_L-2_H-128_A-2' 
model_path = os.path.join(BERT_BASE_DIR, bert_model)


### Define output file
The calculated embeddings will be output to a CSV file that can be read by another process.  Since the time to embed can take an hour, this is the most effective method for sharing the embedding.

In [5]:
file_path = '../../data/Capital_Projects.csv'
if os.path.isfile(file_path):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'file_path' and ensure it points to the source file.")

OK - path points to file.


### Read the Porject Descriptions

In [6]:
data = pd.read_csv(file_path)
projects = data[['PID', 'Description']].drop_duplicates()

### Create Embedding CSV File
Create a CSV file that includes the PID and embedded description.  In order to ensure that each embedding is the same length, the sentence is embedded rather than each of the words in the sentence.  Each embedding is stored in a format that makes it easy to read when extracting from the saved CSV file.

In [14]:
bert_model

'uncased_L-2_H-128_A-2'

In [16]:
output_file = '../data/processed/embeddings_' + bert_model + '.csv'



In [220]:
# NOTE - This will take 30 minutes to execute
# If the file exists, you don't need to run this unless you are changing the model

with open(output_file, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=",")
    csv_writer.writerow(['PID', 'embedding'])

    for row in tqdm(projects.itertuples(), total=len(projects), desc="Creating embeddings"):
        
        # if project description is nan, make it an underscore
        if type(row.Description) == float:
            desc = ['_']
        else:
            # Join all sentences into list of 1 element.
            # This ensures that output is same length for each description.
            desc = [x.strip() for x in row.Description.split('.') if x != '']
            desc = [' '.join(desc)]
        
        # calculate embedding and format to store in csv file
        emb = extract_embeddings(model_path, desc, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX])[0]
        emb = str(list(emb)).replace('[','').replace(']','')
        
        csv_writer.writerow([row.PID, emb])
            

HBox(children=(FloatProgress(value=0.0, description='Creating embeddings', max=681.0, style=ProgressStyle(desc…




### Done Creating Embeddings!

### Reading Embeddings
To read the embeddings, use Pandas to import the file and format the stored embedded values into a list of float values.

In [20]:
if os.path.isfile(output_file):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'output_file' and ensure it points to the source file.")
    print(output_file)

OK - path points to file.


In [21]:
embedding = pd.read_csv(output_file)

def convert(s):
    return [float(x) for x in s.embedding.split(',')]

embedding['embedding'] = embedding.apply(convert, axis=1)

In [22]:
embedding.head()

Unnamed: 0,PID,embedding
0,3,"[-0.13848002, 1.4585834, -6.7887063, 0.0612462..."
1,7,"[-0.1312232, 1.1953796, -6.7208276, 0.06136747..."
2,18,"[0.0988148, 1.6704051, -6.5728025, 0.068978384..."
3,25,"[-0.2662505, 1.1822503, -6.7361383, 0.06858564..."
4,34,"[-0.35441703, 1.6325995, -6.6924543, 0.1016369..."


In [None]:
# test cosine distance between two similarly described project