Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to recognize more cities #4

Merged
merged 1 commit into from
Mar 12, 2017

Conversation

joseluizcoe
Copy link
Contributor

Problem: Brazilian cities like "São Paulo", "Carapicuíba" and other was not passing on old "city_regex".

Example using last version:

>>> text = "São Paulo é a capital do estado de São Paulo. As cidades de Baru
erí e Carapicuíba fazem parte da Grade São Paulo. O Rio de Janeiro continua
lindo. No carnaval eu vou para Salvador. No reveillon eu quero ir para Santo
s."
>>> old_result = GeoText(text)
>>> old_result.cities
['Salvador', 'Santos']
>>>

Considering the same text, using this "new" regex we have:

>>> result.cities
['S\xc3\xa3o Paulo', 'S\xc3\xa3o Paulo', 'As', 'Baruer\xc3\xad', 'Carapicu\x
c3\xadba', 'Grade S\xc3\xa3o Paulo', 'O Rio', 'Janeiro', 'No', 'Salvador', '
No', 'Santos']

I think this change can be useful to recognize other cities in the world.

Problem: Brazilian cities like "São Paulo", "Carapicuíba" and other was not passing on old "city_regex".

Example using last version:

```
>>> text = "São Paulo é a capital do estado de São Paulo. As cidades de Baru
erí e Carapicuíba fazem parte da Grade São Paulo. O Rio de Janeiro continua
lindo. No carnaval eu vou para Salvador. No reveillon eu quero ir para Santo
s."
>>> old_result = GeoText(text)
>>> old_result.cities
['Salvador', 'Santos']
>>>
```

Considering the same text, using this "new" regex we have:

```
>>> result.cities
['S\xc3\xa3o Paulo', 'S\xc3\xa3o Paulo', 'As', 'Baruer\xc3\xad', 'Carapicu\x
c3\xadba', 'Grade S\xc3\xa3o Paulo', 'O Rio', 'Janeiro', 'No', 'Salvador', '
No', 'Santos']
```

I think this change can be useful to recognize other cities in the world.
@elyase
Copy link
Owner

elyase commented Mar 9, 2017

Hi José, muito obrigado, :-) Looks great, I will add some tests and if everything passes will merge right away. Thanks again

@caiofralmeida
Copy link

caiofralmeida commented Mar 12, 2017

👍

@elyase elyase merged commit 69ab3bc into elyase:master Mar 12, 2017
@elyase
Copy link
Owner

elyase commented Mar 12, 2017

Added to the 0.3.0 release in pypi. Thanks a lot for making geotext better.

@@ -103,7 +103,7 @@ class GeoText(object):
index = build_index()

def __init__(self, text):
city_regex = r"[A-Z]+[a-z]*(?:[ '-][A-Z]+[a-z]*)*"
city_regex = r"[A-Z]+[a-zà-ú]*(?:[ '-][A-Z]+[a-zà-ú]*)*"
Copy link

@felipecwb felipecwb Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joseluizcoe I see that it cannot recognize cities with the first accented char like "Água Clara" or "Érico Cardoso"
Maybe [A-Z\À-\Ú] for these first chars can deal with it

@joseluizcoe joseluizcoe deleted the brazillian-cities-regex-patch branch March 14, 2017 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants