In [1]:
with open("headlines.txt") as file:
    data = [headline.strip() for headline in file]
    
data[:4]

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil']

In [2]:
import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"] for country in gc.get_countries().values()]
countries[:4]

['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda']

In [3]:
cities = [city['name'] for city in gc.get_cities().values()]
cities[:4]

['Andorra la Vella', 'Umm Al Quwain City', 'Ras Al Khaimah City', 'Zayed City']

In [4]:
from collections import Counter

city_counts = Counter(cities)
city_counts.most_common(10)

[('San Fernando', 8),
 ('Springfield', 8),
 ('San Pedro', 7),
 ('Richmond', 7),
 ('Mercedes', 6),
 ('La Paz', 6),
 ('Victoria', 6),
 ('Santa Rosa', 6),
 ('San Juan', 6),
 ('San Francisco', 6)]

## Removing Accent Marks

We need to remove the accent marks from the lists of countries and cities. For this we will use the `unidecode` library. (Method from this [Stack Overflow answer](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string).) For the cities and the countries from geonamescache, we will map the unaccented name to the accented name. 

In [5]:
import unidecode

country_accent_mapping = {
    unidecode.unidecode(country): country for country in countries
}

city_accent_mapping = {
    unidecode.unidecode(city): city for city in cities
}
city_accent_mapping["Asmar"]

'Āsmār'

In [6]:
data = [unidecode.unidecode(headline) for headline in data]
data[-4:]

['More Zika patients reported in Indang',
 'Suva authorities confirmed the spread of Rotavirus',
 'More Zika patients reported in Bella Vista',
 'Zika Outbreak in Wichita Falls']

# Searching for Cities and Countries

Next, we'll search each headline for any cities and/or countries. To do this, we use regular expressions created from the unaccented cities and countries.

In [7]:
# Create list of cities and countries
unaccented_cities = list(city_accent_mapping.keys())
unaccented_countries = set(country_accent_mapping.keys())

print(f"There are {len(unaccented_cities)} cities to look through.")
print(f"There are {len(unaccented_countries)} countries to look through.")

There are 23151 cities to look through.
There are 252 countries to look through.


In [8]:
import re

problem_city = 'San Jose'
re.search('\\bSan\\b|\\bSan Jose\\b', problem_city)

<re.Match object; span=(0, 3), match='San'>

Here we see the second problem. We've matched only `San` instead of the entire city name. To correct this, we change the ordering of the regular expression.

In [9]:
re.search('\\bSan Jose\\b|\\bSan\\b', problem_city)

<re.Match object; span=(0, 8), match='San Jose'>

In [10]:
unaccented_cities = sorted(unaccented_cities, key=lambda x: len(x), reverse=True)
unaccented_cities[:2]

['Chak Two Hundred Forty-nine Thal Development Authority',
 'Dolores Hidalgo Cuna de la Independencia Nacional']

In [11]:
unaccented_countries = sorted(unaccented_countries, key=lambda x: len(x), reverse=True)
unaccented_countries[:2]

['South Georgia and the South Sandwich Islands',
 'United States Minor Outlying Islands']

In [12]:
city_regex = r'\b|\b'.join(unaccented_cities)
city_regex[1500:1800]

'h (Kreis 4) / Aussersihl\\b|\\bZurich (Kreis 10) / Wipkingen\\b|\\bZurich (Kreis 11) / Affoltern\\b|\\bZurich (Kreis 9) / Altstetten\\b|\\bHeppenheim an der Bergstrasse\\b|\\bVilapicina i la Torre Llobeta\\b|\\bSaint-Maximin-la-Sainte-Baume\\b|\\bTamuning-Tumon-Harmon Village\\b|\\bTultitlan de Mariano Escobedo\\b|\\'

In [13]:
import numpy as np

np.random.seed(50)

test_headlines = np.random.choice(data, 10)

for test_headline in test_headlines:
    print(test_headline)
    match = re.search(city_regex, test_headline)
    if match:
        print(match.group(0), "\n")

More Zika patients reported in Custodia
Custodia 

Tokyo Encounters Severe Symptoms of Meningitis
Tokyo 

Zika Troubles come to Kampong Cham
Kampong Cham 

19 new Zika Cases in Sengkang
Sengkang 

Mumbai's Health Minister warns of more Zika cases
Mumbai 

Varicella re-emerges in Lagos
Lagos 

Mumbai's Health Minister warns of more Zika cases
Mumbai 

Milwaukee authorities confirmed the spread of Rhinovirus
Milwaukee 

Zika cases concern Charlotte residents
Charlotte 

Four cases of Zika in Hidalgo County
Hidalgo 



In [14]:
country_regex = r"\b|\b".join(unaccented_countries)
country_regex[:100]

'South Georgia and the South Sandwich Islands\\b|\\bUnited States Minor Outlying Islands\\b|\\bBonaire, S'

In [15]:
np.random.seed(100)
test_headlines = np.random.choice(data, 10)

for test_headline in test_headlines:
    print(test_headline)
    match = re.search(country_regex, test_headline)
    if match:
        print(match.group(0), "\n")

Longwood volunteers spreading Zika awareness
More Zika cases in Soyapango
Spike of Dengue Cases in Stockholm
Case of Measles Reported in Vancouver
Zika arrives in Belmopan
Outbreak of Zika in Colombo
Zika symptoms spotted in Arlington
Malaria re-emerges in Boise
Southampton Patient in Critical Condition after Contracting Tuberculosis
Manassas Encounters Severe Symptoms of Measles


In [16]:
test_headline = data[3]
print(test_headline)
print(re.search(city_regex, test_headline).group(0))
print(re.search(country_regex, test_headline).group(0))

Mystery Virus Spreads in Recife, Brazil
Recife
Brazil


In [17]:
print(city_accent_mapping["Recife"])
print(country_accent_mapping["Brazil"])

Recife
Brazil


Neither of these have accents. 

### City and Country Regular Expression Function

Let's encapsulate the logic to find city and country names into a function.

In [18]:
def find_city_and_country_in_headline(headline):
    """
    Find the city(s) and/or country(s) in a text headline.
    
    :param headline: string for headline
    
    :return dict: a dictionary mapping the headline to city(s) and/or countries.
    """
    city_match = re.search(city_regex, headline)
    country_match = re.search(country_regex, headline)
    cities = None if not city_match else city_match.group(0)
    countries = None if not country_match else country_match.group(0)
    return dict(headline=headline, countries=countries, cities=cities)

In [19]:
find_city_and_country_in_headline(data[3])

{'headline': 'Mystery Virus Spreads in Recife, Brazil',
 'countries': 'Brazil',
 'cities': 'Recife'}

In [20]:
find_city_and_country_in_headline(data[1])

{'headline': 'Could Zika Reach New York City?',
 'countries': None,
 'cities': 'New York City'}

In [21]:
headline_cities_and_countries = [
    find_city_and_country_in_headline(headline) for headline in data
]
headline_cities_and_countries[-10:]

[{'headline': 'Authorities are Worried about the Spread of Varicella in Clovis',
  'countries': None,
  'cities': 'Clovis'},
 {'headline': 'More Zika patients reported in Fort Worth',
  'countries': None,
  'cities': 'Fort Worth'},
 {'headline': 'Zika symptoms spotted in Boynton Beach',
  'countries': None,
  'cities': 'Boynton Beach'},
 {'headline': 'Outbreak of Zika in Portoviejo',
  'countries': None,
  'cities': 'Portoviejo'},
 {'headline': 'Influenza Exposure in Muscat',
  'countries': None,
  'cities': 'Muscat'},
 {'headline': 'Rumors about Rabies spreading in Jerusalem have been refuted',
  'countries': None,
  'cities': 'Jerusalem'},
 {'headline': 'More Zika patients reported in Indang',
  'countries': None,
  'cities': 'Indang'},
 {'headline': 'Suva authorities confirmed the spread of Rotavirus',
  'countries': None,
  'cities': 'Suva'},
 {'headline': 'More Zika patients reported in Bella Vista',
  'countries': None,
  'cities': 'Bella Vista'},
 {'headline': 'Zika Outbreak in 

In [23]:
import json

save_file = "headline_cities_and_countries.json"
with open(save_file, "w") as fout:
    fout.write(json.dumps(headline_cities_and_countries))

In [24]:
with open(save_file, "r") as fin:
    check_data = json.loads(fin.read())

In [25]:
check_data[-10:]

[{'headline': 'Authorities are Worried about the Spread of Varicella in Clovis',
  'countries': None,
  'cities': 'Clovis'},
 {'headline': 'More Zika patients reported in Fort Worth',
  'countries': None,
  'cities': 'Fort Worth'},
 {'headline': 'Zika symptoms spotted in Boynton Beach',
  'countries': None,
  'cities': 'Boynton Beach'},
 {'headline': 'Outbreak of Zika in Portoviejo',
  'countries': None,
  'cities': 'Portoviejo'},
 {'headline': 'Influenza Exposure in Muscat',
  'countries': None,
  'cities': 'Muscat'},
 {'headline': 'Rumors about Rabies spreading in Jerusalem have been refuted',
  'countries': None,
  'cities': 'Jerusalem'},
 {'headline': 'More Zika patients reported in Indang',
  'countries': None,
  'cities': 'Indang'},
 {'headline': 'Suva authorities confirmed the spread of Rotavirus',
  'countries': None,
  'cities': 'Suva'},
 {'headline': 'More Zika patients reported in Bella Vista',
  'countries': None,
  'cities': 'Bella Vista'},
 {'headline': 'Zika Outbreak in 

In [26]:
check_data[:5]

[{'headline': 'Zika Outbreak Hits Miami',
  'countries': None,
  'cities': 'Miami'},
 {'headline': 'Could Zika Reach New York City?',
  'countries': None,
  'cities': 'New York City'},
 {'headline': 'First Case of Zika in Miami Beach',
  'countries': None,
  'cities': 'Miami Beach'},
 {'headline': 'Mystery Virus Spreads in Recife, Brazil',
  'countries': 'Brazil',
  'cities': 'Recife'},
 {'headline': 'Dallas man comes down with case of Zika',
  'countries': None,
  'cities': 'Dallas'}]

In [28]:
with open("city_accent_mapping.json", "w") as fout:
    fout.write(json.dumps(city_accent_mapping))

In [29]:
with open("country_accent_mapping.json", "w") as fout:
    fout.write(json.dumps(country_accent_mapping))

In [31]:
import pandas as pd

data = pd.read_json("headline_cities_and_countries.json")
data = data.replace({None: np.nan})

data.head(10)

Unnamed: 0,headline,countries,cities
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo


In [39]:
locations = {}
for key, value in gc.get_cities().items():
    locations[value["name"]] = (value["latitude"], value["longitude"])

data['latitude'] = data['cities'].apply(lambda x: locations.get(x, (None, None))[0])
data['longitude'] = data['cities'].apply(lambda x: locations.get(x, (None, None))[1])

In [40]:
data.head(10)

Unnamed: 0,headline,countries,cities,latitude,longitude
0,Zika Outbreak Hits Miami,,Miami,25.77427,-80.19366
1,Could Zika Reach New York City?,,New York City,40.71427,-74.00597
2,First Case of Zika in Miami Beach,,Miami Beach,25.79065,-80.13005
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife,-8.05389,-34.88111
4,Dallas man comes down with case of Zika,,Dallas,44.91928,-123.31705
5,Trinidad confirms first Zika case,,Trinidad,-33.5165,-56.89957
6,Zika Concerns are Spreading in Houston,,Houston,29.76328,-95.36327
7,Geneve Scientists Battle to Find Cure,,Geneve,,
8,The CDC in Atlanta is Growing Worried,,Atlanta,33.749,-84.38798
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo,,


In [42]:
country_codes = {}
for key, value in gc.get_countries().items():
    country_codes[value["name"]] = value["iso"]

data['countrycode'] = data['countries'].apply(lambda x: country_codes.get(x, None))

In [43]:
data.head(10)

Unnamed: 0,headline,countries,cities,latitude,longitude,countrycode
0,Zika Outbreak Hits Miami,,Miami,25.77427,-80.19366,
1,Could Zika Reach New York City?,,New York City,40.71427,-74.00597,
2,First Case of Zika in Miami Beach,,Miami Beach,25.79065,-80.13005,
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife,-8.05389,-34.88111,BR
4,Dallas man comes down with case of Zika,,Dallas,44.91928,-123.31705,
5,Trinidad confirms first Zika case,,Trinidad,-33.5165,-56.89957,
6,Zika Concerns are Spreading in Houston,,Houston,29.76328,-95.36327,
7,Geneve Scientists Battle to Find Cure,,Geneve,,,
8,The CDC in Atlanta is Growing Worried,,Atlanta,33.749,-84.38798,
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo,,,
