New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Austria monuments #64
Conversation
Add an option in the report generator to also include the P-id to which the value should be matched. It's optional so that it can be omitted, e.g. if it's data for a label/description. Example of report: https://tools.wmflabs.org/coh/_total_se-ship_new.json Task: https://phabricator.wikimedia.org/T166648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a bad address example to the preview?
@@ -64,6 +64,12 @@ def test_last_char_is_vowel_pass(self): | |||
def test_last_char_is_vowel_fail(self): | |||
self.assertFalse(utils.last_char_is_vowel("bar")) | |||
|
|||
def test_first_char_is_number_true(self): | |||
self.assertTrue(utils.first_char_is_number("2 foo")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to 2foo
to make it closer to the failing example and ensure it is not related to "2" being a single word.
Get the longest string(s) in a list. | ||
|
||
:param in_list: list of strings | ||
:return: single string if there's only one with the max length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be easier to always return a list with either zero, one or many entries? That way you could always make iterate over the results (or use len()
).
contains both a simple word ("bro") and its compound ("järnvägsbro"), | ||
we only get the more specific one: | ||
* "götaälvsbron" -> "bro" | ||
* "en stor järnvägsbro" -> "järnvägsbro" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define the parameters
importer/importer.py
Outdated
"monuments_za_(en)": {"class": ZaEn, "data_files": {}}, | ||
"monuments_cm_(fr)": {"class": CmFr, "data_files": {}}, | ||
"monuments_at_(de)": {"class": AtDe, "data_files": {"municipalities": | ||
"austria_municipalities.json", | ||
"types": "at_(de)_types.json"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The at_(de)_types.json
should be added to the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or changed to use "lookup_downloads"
raw_name = self.name.lower() | ||
types = self.data_files["types"]["mappings"] | ||
keywords = list(types.keys()) | ||
keywords = [x.lower() for x in keywords] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to convert keys() to a list since you can iterate over them anyway and the end result will give a list.
keywords = [x.lower() for x in types.keys()]
""" | ||
types = self.data_files["types"]["mappings"] | ||
type_match = self.get_type_keyword() | ||
if type_match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if get_longest_match
was changed to always return a list the logic could here be simplified to:
if len(type_match) == 1:
do stuff
else:
self.add_to_report("type", self.name, "is")
since you are not making a reporting distinction between none and many matches
importer/AtDe.py
Outdated
""" | ||
type_match = self.get_type_keyword() | ||
municipality_name = self.get_municipality_name() | ||
base_desc_german_english = "{} in {}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe do these as small dicts instead
base_desc = {
"en": "{} in {}",
"de": "{} in {}",
"sv": "{} i {}"
}
place_desc = {
"en": "Austria",
"de": "Österreich",
"sv": "Österike"
}
type_desc = {
"en": "cultural property",
"de": "Denkmalgeschütztes Objekt",
"sv": "kulturarv"
}
That way you should quite easily be able to loop through them in the end.
importer/AtDe.py
Outdated
place_english = "Austria" | ||
place_swedish = "Österike" | ||
type_german = "Denkmalgeschütztes Objekt" | ||
type_english = "cultural property" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do the English and Swedish names come from? Wondering since I would expect the English one to be "cultural heritage property"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: rm english and swedish.
* 76 | ||
Filtering by containing only digits. | ||
""" | ||
bad_address = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to use bad_address = False
as you don't make use of the later contents of bad_address
?
elif self.adresse.isdigit(): | ||
bad_address = self.adresse | ||
else: | ||
address = self.adresse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to have this as the else
for if bad_address
? Separating the finding-bad-ones logic from the report-or-add logic.
Also not sure address = self.adresse
is needed since you only use it on the line below.
importer/AtDe.py
Outdated
municip_code = str(self.gemeindekennzahl) | ||
match = [x["item"] | ||
for x in all_codes if x["value"] == municip_code] | ||
if len(match) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safer to do
if len(match) == 1:
add stuff
else:
report
Base class for processing Austria in German.
Preview: https://www.wikidata.org/wiki/Wikidata:WikiProject_WLM/Mapping_tables/at_(de)/preview
Task: https://phabricator.wikimedia.org/T166771