Skip to content

Commit

Permalink
Merge branch 'master' into advanced-doc
Browse files Browse the repository at this point in the history
  • Loading branch information
jdesboeufs committed Apr 9, 2023
2 parents 4463430 + d0e81b2 commit 3968fa9
Show file tree
Hide file tree
Showing 23 changed files with 203 additions and 114 deletions.
45 changes: 45 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: Python package
on:
push:
branches:
- master
pull_request:
branches:
- master
jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
services:
redis:
image: redis
options:
--health-cmd "redis-cli ping" --health-interval 10s --health-timeout 5s
--health-retries 5
ports:
- 6379:6379
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
cache: "pip"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
make develop
- name: Test with pytest
run: |
make testcoverage
- name: Coveralls
uses: coverallsapp/github-action@v1
with:
path-to-lcov: coverage.lcov
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ local*.py
*.db
*.tar.bz2
.pytest_cache/
coverage.lcov
33 changes: 0 additions & 33 deletions .travis.yml

This file was deleted.

59 changes: 33 additions & 26 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,28 @@
## dev
## 1.1.0

- Added `/health` endpoint to monitor Addok (#[750](https://github.com/addok/addok/issues/750))

## 1.1.0-rc2

- Added `load_csv_file` batch loader
- Fixed `type=housenumber` also returning other results in some cases (#478)
- Fixed ordering of housenumbers with non alpha-num chars (#656)
- Fixed `type=housenumber` also returning other results in some cases (#[478](https://github.com/addok/addok/issues/478))
- Fixed ordering of housenumbers with non alpha-num chars (#[656](https://github.com/addok/addok/issues/656))
- Added `ID_FIELD` to control which field is used as document `_id`
- `config.SYNONYMS_PATH` is now `config.SYNONYMS_PATHS` and is a list to allow
multiple files
- Fixed non unique id accross multiple docker sharing same Redis instance (#607)
- Added more variants for `lat` and `lon` params and better control their values (#592)
- Better ordering of candidates in case of autocomplete (#494)
- Fixed non unique id across multiple docker sharing same Redis instance (#[607](https://github.com/addok/addok/issues/607))
- Added more variants for `lat` and `lon` params and better control their values (#[592](https://github.com/addok/addok/issues/592))
- Better ordering of candidates in case of autocomplete (#[494](https://github.com/addok/addok/issues/494))
- By default, use more common chars when building fuzzy variants
- Added python >= 3.8 compat
- Restore legacy scoring algorithm (#[746](https://github.com/addok/addok/issues/746)): the new experimental scoring must be
activated manually, replacing `addok.helpers.results.score_by_ngram_distance` with
`addok.helpers.results.score_by_str_distance` in `SEARCH_RESULT_PROCESSORS_PYPATHS`


## 1.1.0-rc1

- Faster new scoring algorithm (#431)
- Faster new scoring algorithm (#[431](https://github.com/addok/addok/issues/431))
- Upgraded Falcon to 1.4.1
- `autocomplete` and `fuzzy` are not adding any more their collectors automagically,
instead they are now hard coded in the default config; if you haven't changed
Expand Down Expand Up @@ -90,46 +97,46 @@ in the documentation.

## 0.5.0
- Expose housenumber parent name in result geojson
- add support for housenumber payload ([#134](https://github.com/etalab/addok/issues/134))
- Fix clean_query being too much greedy for "cs" ([#125](https://github.com/etalab/addok/issues/125)
- add support for housenumber payload ([#134](https://github.com/addok/addok/issues/134))
- Fix clean_query being too much greedy for "cs" ([#125](https://github.com/addok/addok/issues/125)
- also accept long for longitude
- replace "s/s" in French preprocessing
- fix autocomplete querystring casting to boolean
- Always add housenumber in label candidates if set ([#120](https://github.com/etalab/addok/issues/120))
- make CSVView more hackable by plugins ([#116][https://github.com/etalab/addok/issues/116))
- Always add housenumber in label candidates if set ([#120](https://github.com/addok/addok/issues/120))
- make CSVView more hackable by plugins ([#116][https://github.com/addok/addok/issues/116))


## 0.4.0
- fix filters not taken into account in manual scan ([#105](https://github.com/etalab/addok/issues/105))
- fix filters not taken into account in manual scan ([#105](https://github.com/addok/addok/issues/105))
- added experimental list support for document values
- Added MIN_EDGE_NGRAMS and MAX_EDGE_NGRAMS settings ([#102](https://github.com/etalab/addok/issues/102))
- Added MIN_EDGE_NGRAMS and MAX_EDGE_NGRAMS settings ([#102](https://github.com/addok/addok/issues/102))
- documented MAKE_LABELS setting
- Allow to pass functions as PROCESSORS, instead of path
- remove raw housenumbers returned in result properties
- do not consider filter if column is empty, in csv ([#109](https://github.com/etalab/addok/issues/109))
- allow to pass lat and lon to define columns to be used for geo preference, in csv ([#110](https://github.com/etalab/addok/issues/110))
- replace "s/" by "sur" in French preprocessing ([#107](https://github.com/etalab/addok/issues/107))
- do not consider filter if column is empty, in csv ([#109](https://github.com/addok/addok/issues/109))
- allow to pass lat and lon to define columns to be used for geo preference, in csv ([#110](https://github.com/addok/addok/issues/110))
- replace "s/" by "sur" in French preprocessing ([#107](https://github.com/addok/addok/issues/107))
- fix server failing when document was missing `importance` value
- refuse to load if `ADDOK_CONFIG_MODULE` is given but not found
- allow to set ADDOK_CONFIG_MODULE with command line parameter `--config`
- mention request parameters in geojson ([#113](https://github.com/etalab/addok/issues/113))
- mention request parameters in geojson ([#113](https://github.com/addok/addok/issues/113))


## 0.3.1

- fix single character wrongly glued to housenumber ([#99](https://github.com/etalab/addok/issues/99))
- fix single character wrongly glued to housenumber ([#99](https://github.com/addok/addok/issues/99))

## 0.3.0

- use housenumber id as result id, when given ([#38](https://github.com/etalab/addok/issues/38))
- shell: warn when requested id does not exist ([#75](https://github.com/etalab/addok/issues/75))
- use housenumber id as result id, when given ([#38](https://github.com/addok/addok/issues/38))
- shell: warn when requested id does not exist ([#75](https://github.com/addok/addok/issues/75))
- print filters in debug mode
- added filters to CSV endpoint ([#67](https://github.com/etalab/addok/issues/67))
- also accept `lng` as parameter ([#88](https://github.com/etalab/addok/issues/88))
- add `/get/` endpoint ([#87](https://github.com/etalab/addok/issues/87))
- added filters to CSV endpoint ([#67](https://github.com/addok/addok/issues/67))
- also accept `lng` as parameter ([#88](https://github.com/addok/addok/issues/88))
- add `/get/` endpoint ([#87](https://github.com/addok/addok/issues/87))
- display distance in meters (not kilometers)
- add distance in single `/reverse/` call
- workaround python badly sniffing csv file with only one column ([#90](https://github.com/etalab/addok/issues/90))
- add housenumber in csv results ([#91](https://github.com/etalab/addok/issues/91))
- CSV: renamed "result_address" to "result_label" ([#92](https://github.com/etalab/addok/issues/92))
- workaround python badly sniffing csv file with only one column ([#90](https://github.com/addok/addok/issues/90))
- add housenumber in csv results ([#91](https://github.com/addok/addok/issues/91))
- CSV: renamed "result_address" to "result_label" ([#92](https://github.com/addok/addok/issues/92))
- no BOM by default in UTF-8
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ develop:
test:
py.test
testcoverage:
py.test --cov=addok/
py.test --cov-report lcov --cov=addok/
testall:
py.test --quiet
cd ../addok-france && py.test --quiet
cd ../addok-fr && py.test --quiet
cd ../addok-csv && py.test --quiet
cd ../addok-sqlite-store && py.test --quiet
clean:
rm -rf dist/* build/*
dist:
rm -rf dist/ build/
dist: test
python setup.py sdist bdist_wheel
upload:
twine upload dist/*
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,9 @@ to around 2000 searches per second.
Check the [documentation](http://addok.readthedocs.org/en/latest/) and a
[demo](http://adresse.data.gouv.fr/map) with French data.

For discussions, please use the [discourse Geocommun forum](https://forum.geocommuns.fr/c/adresses/addok-le-geocodeur/17). Discussions are mostly French, but English is very welcome.
For discussions, please use the [discourse Geocommun forum](https://forum.geocommuns.fr/c/adresses/addok-le-geocodeur/17). Discussions are mostly French, but English is very welcome.

Powered by Python and Redis.

[![Build Status](https://travis-ci.org/addok/addok.svg?branch=master)](https://travis-ci.org/addok/addok)
[![Requirements Status](https://requires.io/github/addok/addok/requirements.svg?branch=master)](https://requires.io/github/addok/addok/requirements/?branch=master)
[![PyPi version](https://img.shields.io/pypi/v/addok.svg)](https://pypi.python.org/pypi/addok/)
[![Coverage Status](https://coveralls.io/repos/addok/addok/badge.svg?branch=master&service=github)](https://coveralls.io/github/addok/addok?branch=master)
8 changes: 4 additions & 4 deletions addok/config/default.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@
BUCKET_MIN = 10
BUCKET_MAX = 100

# Above this treshold, terms are considered commons.
# Above this threshold, terms are considered commons.
COMMON_THRESHOLD = 10000

# Above this treshold, we avoid intersecting sets.
# Above this threshold, we avoid intersecting sets.
INTERSECT_LIMIT = 100000

# Min score considered matching the query.
Expand Down Expand Up @@ -86,7 +86,7 @@
"addok.helpers.results.make_labels",
"addok.helpers.results.score_by_importance",
"addok.helpers.results.score_by_autocomplete_distance",
"addok.helpers.results.score_by_str_distance",
"addok.helpers.results.score_by_ngram_distance",
"addok.helpers.results.score_by_geo_distance",
"addok.helpers.results.adjust_scores",
]
Expand Down Expand Up @@ -160,7 +160,7 @@

INDEX_EDGE_NGRAMS = True

# surrouding letters on a standard keyboard (default french azerty)
# surrounding letters on a standard keyboard (default french azerty)
FUZZY_KEY_MAP = {
"a": "ezqop",
"z": "aqse",
Expand Down
18 changes: 15 additions & 3 deletions addok/helpers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import csv
import sys
from itertools import islice
from functools import wraps
from importlib import import_module
from math import asin, cos, exp, radians, sin, sqrt
Expand Down Expand Up @@ -174,7 +175,18 @@ def imap_unordered(self, func, iterable, chunksize):

def parallelize(func, iterable, chunk_size, **bar_kwargs):
bar = Bar(prefix="Processing…", **bar_kwargs)
with ChunkedPool(processes=config.BATCH_WORKERS) as pool:
for chunk in pool.imap_unordered(func, iterable, chunk_size):

if sys.platform == "darwin":
while True:
chunk = list(islice(iterable, chunk_size))
if not chunk:
bar.finish()
break

func(*chunk)
bar(step=len(chunk))
bar.finish()
else:
with ChunkedPool(processes=config.BATCH_WORKERS) as pool:
for chunk in pool.imap_unordered(func, iterable, chunk_size):
bar(step=len(chunk))
bar.finish()
25 changes: 22 additions & 3 deletions addok/helpers/results.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
from addok.config import config
from addok.helpers import haversine_distance, km_to_score
from addok.helpers.text import ascii, compare_str, contains, equals, startswith
from addok.helpers.text import (
ascii,
compare_ngrams,
compare_str,
contains,
equals,
startswith,
)


def make_labels(helper, result):
Expand Down Expand Up @@ -67,7 +74,7 @@ def score_by_autocomplete_distance(helper, result):
if score:
result.add_score("str_distance", score, ceiling=1.0)
if not score:
_score_by_str_distance(helper, result, scale=0.9)
_score_by_ngram_distance(helper, result, scale=0.9)


def _score_by_str_distance(helper, result, scale=1.0):
Expand All @@ -82,7 +89,19 @@ def score_by_str_distance(helper, result):
_score_by_str_distance(helper, result)


score_by_ngram_distance = score_by_str_distance # Retrocompat.
def _score_by_ngram_distance(helper, result, scale=1.0):
for label in result.labels:
label = ascii(label)
score = compare_ngrams(label, helper.query) * scale
result.add_score("str_distance", score, ceiling=1.0)
if score >= config.MATCH_THRESHOLD:
break


def score_by_ngram_distance(helper, result):
if helper.autocomplete:
return
_score_by_ngram_distance(helper, result)


def score_by_geo_distance(helper, result):
Expand Down
13 changes: 12 additions & 1 deletion addok/helpers/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,15 @@
from functools import lru_cache
from pathlib import Path

import editdistance
from ngram import NGram
from unidecode import unidecode

from addok.config import config
from addok.db import DB
from addok.helpers import keys, yielder
from addok.helpers.index import token_frequency

import editdistance

PATTERN = re.compile(r"[\w]+", re.U | re.X)

Expand Down Expand Up @@ -161,6 +162,16 @@ def ngrams(text, n=3):
return set([text[i : i + n] for i in range(0, len(text) - (n - 1))])


def compare_ngrams(left, right, N=2, pad_len=0):
left = ascii(left)
right = ascii(right)
if len(left) == 1 and len(right) == 1:
# NGram.compare returns 0.0 for 1 letter comparison, even if letters
# are equal.
return 1.0 if left == right else 0.0
return NGram.compare(left, right, N=N, pad_len=pad_len)


def compare_str(left, right):
left = ascii(left)
right = ascii(right)
Expand Down
11 changes: 11 additions & 0 deletions addok/http/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from addok.config import config
from addok.core import reverse, search
from addok.db import DB
from addok.helpers.text import EntityTooLarge

notfound_logger = None
Expand Down Expand Up @@ -191,9 +192,19 @@ def on_get(self, req, resp, **kwargs):
self.render(req, resp, results, filters=filters, limit=limit)


class Health(View):
def on_get(self, req, resp):
return self.json(
req,
resp,
{"status": "HEALTHY", "redis_version": DB.info().get("redis_version")},
)


def register_http_endpoint(api):
api.add_route("/search", Search())
api.add_route("/reverse", Reverse())
api.add_route("/health", Health())


def register_command(subparsers):
Expand Down
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,4 +93,4 @@ Parameters:
- every filter that has been declared in the [config](config.md) is available as
parameters

Same response format as the `/search/` enpoint.
Same response format as the `/search/` endpoint.

0 comments on commit 3968fa9

Please sign in to comment.