# Extracting and Storing Addresses

This tutorial demonstrates how to extract addresses from text and store results in Postgres using `PostgresStorage` module.

In [1]:
from estnltk import Text
from estnltk.taggers import AddressPartTagger, AddressGrammarTagger
from estnltk.core import rel_path
from estnltk.storage.postgres import PostgresStorage, JsonbTextQuery, JsonbLayerQuery, RowMapperRecord
from estnltk.storage.postgres import create_schema, delete_schema

In this tutorial we are going to use the following small toy dataset:

In [2]:
text_corpus = [
    'Kontor asub aadressil Rävala 5, Tallinn.',
    'Salong asub uuel aadressil, üle tee asuvas Rävala pst 7 hoones',
    'Korterite müük: Gonsiori tn 36, Tallinn'
]

First, let's save our dataset to the database:

In [3]:
storage = PostgresStorage(pgpass_file='~/.pgpass',
                          schema="grammarextractor")
create_schema(storage)

collection = storage.get_collection("texts_with_addresses")
collection.create()

with collection.insert() as collection_insert:
    for key, text in enumerate(text_corpus):
        collection_insert(Text(text).tag_layer(['words']), key=key)

INFO:postgres_storage.py:97: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'pault'
INFO:postgres_storage.py:109: role: 'pault'
INFO:db.py:84: new empty collection 'texts_with_addresses' created


Next, we extract addresses and save them in a separate layer:

In [4]:
address_token_tagger = AddressPartTagger(output_layer='address_tokens')

def row_mapper_1(row):
    text_id, text = row[0], row[1]
    layer = address_token_tagger.tag(text)["address_tokens"]
    return [RowMapperRecord(layer=layer, meta=None)]

collection.create_layer('address_tokens',
                        data_iterator=collection.select(),
                        row_mapper=row_mapper_1)


address_tagger = AddressGrammarTagger(output_layer='addresses', input_layer='address_tokens')

def row_mapper_2(row):
    text_id, text = row[0], row[1]
    layer = address_tagger.tag(text)['addresses']
    return [RowMapperRecord(layer=layer, meta=None)]


collection.create_layer('addresses',
                        data_iterator=collection.select(layers=["address_tokens"]),
                        row_mapper=row_mapper_2)

INFO:db.py:823: collection: 'texts_with_addresses'
INFO:db.py:842: preparing to create a new layer: 'address_tokens'
INFO:db.py:908: layer created: 'address_tokens'
INFO:db.py:823: collection: 'texts_with_addresses'
INFO:db.py:842: preparing to create a new layer: 'addresses'
INFO:db.py:908: layer created: 'addresses'


Let's now load one text object and see what's inside:

In [5]:
key, text = next(collection.select(layers=['addresses']))
text

text
"Kontor asub aadressil Rävala 5, Tallinn."

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,8
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,8
address_tokens,"grammar_symbol, type",,,True,4
addresses,"grammar_symbol, TÄNAV, MAJA, ASULA, MAAKOND, INDEKS",,address_tokens,True,1


As we can see, the `addresses` layer has attributes TÄNAV, MAJA, ASULA, MAAKOND, INDEKS which  can be used in search. For example, we can search for records containing a street name 'Rävala' and a house number '5':

In [6]:
q = JsonbLayerQuery(layer_table=collection.layer_name_to_table_name("addresses"),
                    TÄNAV='Rävala', MAJA='5', ambiguous=True)
for key, text in collection.select(layer_query={'addresses': q}):
    print(text)

Text(text='Kontor asub aadressil Rävala 5, Tallinn.')


Equivalently, we can use a method `find_fingerprint`:

In [7]:
q = {"field": "TÄNAV", "query": ["Gonsiori tn"], "ambiguous": True}
for key, text in collection.find_fingerprint(layer_query={'addresses': q}):
    print(text)

Text(text='Korterite müük: Gonsiori tn 36, Tallinn')


In [8]:
delete_schema(storage)