# Skip-gram addr2vec

In this notebook, I'll convert addresses to vectors through TensorFlow and understand the ability to correct for mistyping.


## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation. 

![one-hot encodings](assets/one_hot_encoding.png)

To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

![lookup](assets/lookup_matrix.png)

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.

<img src='assets/tokenize_lookup.png' width=500>
 
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix as well.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.



In [1]:
import time

import numpy as np
import tensorflow as tf

import utils

  from ._conv import register_converters as _register_converters
  (fname, cnt))
  (fname, cnt))


Load the [openaddress US North East dataset](https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_northeast.zip), and extract onto 'openaddr' directory if not found. read the csv files and load address dictionaries

In [15]:
import glob
import os
import pandas as pd


from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile

dataset_folder_path = 'openaddr'
dataset_filename = 'openaddr-collected-us_northeast.zip'
dataset_name = 'Openaddress Dataset'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(dataset_filename):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc=dataset_name) as pbar:
        urlretrieve(
            'https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-us_northeast.zip',
            dataset_filename,
            pbar.hook)

if not isdir(dataset_folder_path):
    with zipfile.ZipFile(dataset_filename) as zip_ref:
        zip_ref.extractall(dataset_folder_path)


id_to_address = {}
address_to_id = {}
i = 0
for state in os.listdir('./openaddr/us'):
    
    for filename in glob.glob('./openaddr/us/{}/*.csv'.format(state)):
        if i == 0:
            print("Column names available: {}".format(csv.columns))
        csv = pd.read_csv(filename)
        stack = np.stack((csv['STREET'], csv['UNIT'], 
                          csv['CITY'], csv['DISTRICT'], csv['REGION'], ), axis=-1)
        for j in stack:
            addr = " ".join([str(k).lower()
                             for k in j if not isinstance(k, type(np.nan))])
            addr +=  ' ' + state.lower()
            if addr not in address_to_id:
                id_to_address[i] = addr
                address_to_id[addr] = i
                i += 1
            
if '' in address_to_id:
    i = address_to_id['']
    del address_to_id['']
    del id_to_address[i]


Column names available: Index(['LON', 'LAT', 'NUMBER', 'STREET', 'UNIT', 'CITY', 'DISTRICT', 'REGION',
       'POSTCODE', 'ID', 'HASH'],
      dtype='object')


  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


## Preprocessing

Here I'm fixing up the text to make training easier. This comes from the `utils` module I wrote. The `preprocess` function coverts any punctuation into tokens, so a period is changed to ` <PERIOD> `. In this data set, there aren't any periods, but it will help in other NLP problems. I'm also removing all words that show up five or fewer times in the dataset. This will greatly reduce issues due to noise in the data and improve the quality of the vector representations. If you want to write your own functions for this stuff, go for it.

In [16]:
print([i for i in address_to_id.keys()][:30])

['e pond rd somerset me me', 'quaker ln somerset me me', 'templinville ln somerset me me', 'redmond ln somerset me me', 'rockwood ln somerset me me', 'rocky ridge ln somerset me me', 'rome rd somerset me me', 'rose ln somerset me me', 'rosenthal ln somerset me me', 'ross hill rd somerset me me', 'sand hill rd somerset me me', 'smithfield rd somerset me me', 'stevens rd somerset me me', 'sunset meadow ln somerset me me', 'thomas ln somerset me me', 'valley ln somerset me me', 'village rd somerset me me', 'thompson rd york me me', 'timber ridge dr york me me', 'trout brook rd york me me', 'tuckers way york me me', 'walkers ln york me me', 'walkers ln a york me me', 'welch ln york me me', 'west ln york me me', 'windward ln york me me', 'abby ln york me me', 'adams st york me me', 'alden dr york me me', 'alexander dr york me me']


In [17]:
print("Total addresses: {}".format(len(address_to_id)))

vocab_to_id = {}
int_to_vocab = {}
idx = 0
for i, address in id_to_address.items():
    for j in range(len(address) - 2):
        if address[j:j+3] not in vocab_to_id:
            vocab_to_id[address[j:j+3]] = idx
            int_to_vocab[idx] = address[j:j+3]
            idx += 1

vocab_size = len(vocab_to_id)
print("Total unique words: {}".format(vocab_size))

Total addresses: 2329546
Total unique words: 29696


In [None]:
features = []
labels = []
for i, address in id_to_address.items():
    X = np.zeros(vocab_size)
    for j in range(len(address) - 2):
        X[vocab_to_id[address[j:j+3]]] = 1.0
    features.append(X)
    labels.append(address)
    
print("Example feature vector and label:")
print(features[500])
print(labels[500])

In [None]:
# Size of the encoding layer (the hidden layer)
encoding_dim = 32

# Input and target placeholders
inp_shape = vocab_size

inputs_ = tf.placeholder(tf.float32 ,(None, inp_shape), name='inputs')
targets_ = tf.placeholder(tf.float32 ,(None, inp_shape), name='targets')

# Output of hidden layer, single fully connected layer here with ReLU activation
encoded = tf.layers.dense(inputs_, encoding_dim, activation=tf.nn.relu)

# Output layer logits, fully connected layer with no activation
logits = tf.layers.dense(encoded, inp_shape, activation=None)
# Sigmoid output from logits
decoded = tf.nn.sigmoid(logits, name='outputs')

# Sigmoid cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
# Mean of the loss
cost = tf.reduce_mean(loss)

# Adam optimizer
opt = tf.train.AdamOptimizer(0.001).minimize(cost)
