# pglex--a 'pretty good' lexical service

The pglex project seeks to create an API for publishing and retrieving lexical documents, i.e. dictionary entries. The goal is to provide a service that can be used to build dictionary websites and other applications for endangered languages.

Other participants: Andrew Garrett, Dmetri Hayes, Edwin Ko

# The big picture

## How things began

We wanted to replace the interactive dictionary and text features of the Karuk and Yurok websites created by Andrew Garrett circa 2005.

- Performance is mediocre
- Difficult to implement certain desired queries
- Monolithic -- data files, code, and display are tightly coupled, which makes it difficult to modify and reuse for other language projects

# The big picture

## Our goals

Replace the Karuk and Yurok dictionaries in a way that 1) improves the existing dictionaries; and 2) could benefit other language documentation and research projects.

- Existing dictionary functions will continue to work
- Create a generic solution that will have pretty good results for a wide variety of languages with minimal technical knowledge required of the researcher
- Create a service for purposes other than a dictionary website, e.g. language learning tools
- Queries
  - Allow matches that ignore diacritics (ex. Máíhĩ̵̀kì)
  - Allow partial matches
  - Morphology-aware searches (for the contact language)
    - Search for 'dog' or 'dogs' returns the same result
- Provide a basic structure that is pretty good for many languages without restricting what can be in a lexical entry
  - Just bring your data!
- Faster performance

# The pglex API

An API (Application Progamming Interface) defines interactions between software programs and allows them to communicate in predictable and meaningful ways. For our purposes you can think of the `pglex` API as a set of functions with internet addresses.

```
# pglex address
{base_url}/{project}/{function}

base_url = https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi
project = karuk
function = lex

# The function address
https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/lex
```

# The pglex API

## The `lex` function

The `lex` function returns a lexical entry based on its identifier. It requires one parameter, the `lexid` that you pass to the function. One way to do this is to add it to the url. Try pasting the following into a web browser's url bar:

```
https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/lex/4783
```

Any web browser will do. Firefox is known to format the result nicely.

# The pglex API

## The `lex` function

You can also call the function from a programming language. Here we use the Python [requests library](https://requests.readthedocs.io/en/master/) to access the function and provide the `lexid` parameter in a JSON payload:

```python
import requests
r = requests.post(
    'https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/lex',
    json={'lexid': ['4783']}
)
lexes = r.json()['hits']
```

In [None]:
import requests
r = requests.post(
    'https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/lex',
    json={'lexid': ['4783', '4784']}
)
lexes = r.json()['hits']
lexes
r.json()

# The pglex API

## The `q` function

The `q` function is used to make a query. This function searches the fields of a lexical entry for a match. It allows a `q` parameter that is used to match a combination of Karuk-language fields or English-language fields.

```python
r = requests.post(
    'https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/q',
    json={'q': 'dog'}
)
lexes = r.json()['hits']
```

In [None]:
r = requests.post(
    'https://q3r0mu6cll.execute-api.us-west-1.amazonaws.com/devapi/karuk/q',
    json={'q': 'dog', 'pf': 10, 'explain': 'true'}
)
print(r.json()['total'])
r.json()

# Other possible queries

The `q` parameter is not required, and several other parameters can be used to construct a query. A few are illustrated here.

```
json={'q': '-a', 'flds': 'lex.lo'}  # Search in one specific field
json={'q': '-a', 'flds': 'lex.lo^20'}  # Search with a boost
json={'sdomain': 'mammal'}             # Filter by semantic domain
json={'q': 'dog', 'pos': 'verb'}       # Search for string and filter by part of speech
json={'sdomain': 'mammal', 'from': 10, 'size': 2}  # Page through the results
json={'sdomain': 'mammal', 'pf': 10, 'from': 0, 'size': 2}   # Popularity factor
json={'q': 'dog', 'explain': 'true'}  # The gory details
```

# Using pglex

Some sample applications that use `pglex` as a data source.

## Karuk dictionary website (Edwin Ko)

Karuk dictionary, including example sentences and links to audio recording where available:

`http://linguistics.berkeley.edu/scoil_dev_pglex/karuk`

## Karuk texts website (Dmetri Hayes)

An language-teaching application that contains a set of Karuk texts with audio and that provides detailed lexical information on request:

`https://linguistics.berkeley.edu/~dmetri/klamath`



# How it works

## Elasticsearch

Elasticsearch is a document-based database with these features:

- Sophisticated indexing for fast queries
- Scalable for very large corpora
- Flexible JSON documents
  - Can define 'pretty good' fields to be maximally useful in searches
    - Can ignore diacritics
    - Can simplify spellings, e.g. i/ɨ
    - Can ignore punctuation
  - Multiple values for a field usually okay
  - Missing fields in a document are not a problem
  - Extra undefined fields are not a problem
- Morphological analysis of English (or Spanish or...) fields
- Flexible query language
  - Calculates 'goodness' of a search result based on relevance
  - Can specify fields to search
  - Can weight the result by field matched
  - Can scale the 'goodness' rating by a field value, e.g. `popcnt`
  - Can filter by field
  - Partial matches with wildcards
  - Regular expression searches (not used by `pglex`)

# How it works

## AWS API Gateway

Provides a name (url) for incoming queries and hands them off to AWS Lambda.

## AWS Lambda

A service for running a 'serverless' function, in our case written in Python. The `q` function:

1. Accepts a request sent by the API Gateway.
1. Creates an elasticsearch query based on the request parameters.
1. Submits the query to our elasticsearch instance.
1. Receives the elasticsearch query result.
1. Packages the result and sends a response to the requester.

# Fun with string matching

Matching strings can be surprisingly difficult. Why don't the following strings match?

In [None]:
s0 = 'á'
s1 = 'á'
s0 == s1

If we print the strings in a specific encoding we can see that they contain different bytes. This is because `s0` contains the precomposed character ['LATIN SMALL LETTER A WITH ACUTE'](http://www.fileformat.info/info/unicode/char/00e1/index.htm) and `s1` is the decomposed form with two characters, ['LATIN SMALL LETTER A'](http://www.fileformat.info/info/unicode/char/0061/index.htm) and ['COMBINING ACUTE ACCENT'](http://www.fileformat.info/info/unicode/char/0301/index.htm).

In [None]:
print(s0.encode('utf8'))
print(s1.encode('utf8'))

In [None]:
s = 'M á í h ĩ̵̀ k ì'
print(s.encode('utf8'))