In [1]:
import pandas as pd
import geonamescache
import unidecode
import re
gc = geonamescache.GeonamesCache()

Load dataframe of city/country matches and remove any headlines with no matches and removing any cities matching 'of'.
As you can see below, this leaves us with 612 headlines matched.

In [23]:
df = pd.read_csv('data/df1.csv', sep=';', header=0, usecols=['headline', 'city', 'country'], index_col=False)

In [24]:
df2 = df[(df['city'] != 'of') & (df['city'].notnull()) | (df['country'].notnull())]

In [29]:
df2.reset_index(drop=True, inplace=True)

In [30]:
df2.style

Unnamed: 0,headline,city,country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,
5,Trinidad confirms first Zika case,Trinidad,
6,Zika Concerns are Spreading in Houston,Houston,
7,Geneve Scientists Battle to Find Cure,Geneve,
8,The CDC in Atlanta is Growing Worried,Atlanta,
9,Zika Infested Monkeys in Sao Paulo,Sao Paulo,


After further inspection of the the country column, very few have been matched and the ones that have matched, have city matches too so this column is redundant for analysis purposes.  Later we will remove this column and add 3 further columns for longitude, latitude and country code which are available in the geonamescache.get_cities() dictionary

Another stumbling block is city names with diacritics.
Using the unidecode versions of city names in the get_cities_by_names() method will fail.

To get around this, it is proposed that a dictionary of all diacritic names could be made from gc.get_cities() and then all unidecode names can be mapped back to the original name. 

In [6]:
diacritics = {}
cities = gc.get_cities()

In [7]:
for city in cities:
    original = cities[city]['name'] 
    decode = unidecode.unidecode(cities[city]['name'])
    if original != decode:
        diacritics[decode]=original

In [8]:
# to confirm how many diacritics we have found !
print(diacritics)

{'Khawr Fakkan': 'Khawr Fakkān', 'Shindand': 'Shīnḏanḏ', 'Shibirghan': 'Shibirghān', 'Sang-e Charak': 'Sang-e Chārak', 'Aibak': 'Aībak', 'Rustaq': 'Rustāq', 'Qarqin': 'Qarqīn', 'Qarawul': 'Qarāwul', 'Pul-e Khumri': 'Pul-e Khumrī', 'Paghman': 'Paghmān', 'Nahrin': 'Nahrīn', 'Mehtar Lam': 'Mehtar Lām', 'Mazar-e Sharif': 'Mazār-e Sharīf', 'Lashkar Gah': 'Lashkar Gāh', 'Khost': 'Khōst', 'Khash': 'Khāsh', 'Kandahar': 'Kandahār', 'Jalalabad': 'Jalālābād', 'Herat': 'Herāt', 'Bamyan': 'Bāmyān', 'Baghlan': 'Baghlān', 'Art Khwajah': 'Ārt Khwājah', 'Asmar': 'Āsmār', 'Asadabad': 'Asadābād', 'Andkhoy': 'Andkhōy', 'Bazarak': 'Bāzārak', 'Markaz-e Woluswali-ye Achin': 'Markaz-e Woluswalī-ye Āchīn', "Saint John's": 'Saint John’s', 'Sarande': 'Sarandë', 'Kukes': 'Kukës', 'Korce': 'Korçë', 'Gjirokaster': 'Gjirokastër', 'Vlore': 'Vlorë', 'Shkoder': 'Shkodër', 'Lushnje': 'Lushnjë', 'Lezhe': 'Lezhë', 'Lac': 'Laç', 'Kucove': 'Kuçovë', 'Kruje': 'Krujë', 'Kavaje': 'Kavajë', 'Fier-Cifci': 'Fier-Çifçi', 'Durres':

In [9]:
print(diacritics['Durres'])

Durrës


In [10]:
geoLocationData = {'latitude': [], 'longitude': [], 'countrycode': []}

Now that we have taken care of the accented names we are in a good position to gather the geo location data

After defining a new dictionary as a placeholder for each city info, al loop is required to go through each city and extract the required data. 

The method used will be to use gc.get_cities_by_name(), but some city names are common and any may appear more than once in the list.

This means we will have to assume that the headline city will be the most populated city and hence the city with the greatest population size will be used in this instance.

To determine which city has the maximum size a function will be defined outside the loop to make the code cleaner.


In [11]:
def gatherCityInfo(name):
    #clist = name.strip('[]').replace("\'", "").asplit(',')
    #for item in clist:
    try:
        bestCity = max(gc.get_cities_by_name(name), key=lambda x: list(x.values())[0]['population'])
        data = [list(bestCity.values())[0]['latitude'], list(bestCity.values())[0]['longitude'], list(bestCity.values())[0]['countrycode']]
        return data
    except:
        return ["NaN", "NaN", "NaN"]

In [12]:
n = 0
for city in df2['city']:
    info = gatherCityInfo(city)
    if info == ["NaN", "NaN", "NaN"]:
        try:
            info = gatherCityInfo(diacritics[city])
            # this will add the orginal city name with accent marks to datafram...
            # remove next two lines if city name needed without diacritics
            print(diacritics[city], info)
            n += 1
            geoLocationData['latitude'].append(info[0])
            geoLocationData['longitude'].append(info[1])
            geoLocationData['countrycode'].append(info[2])
            continue
        except:
            n += 1
            geoLocationData['latitude'].append(info[0])
            geoLocationData['longitude'].append(info[1])
            geoLocationData['countrycode'].append(info[2])
            continue
            print(n, city, info)
    print(n, city, info)
    n += 1
    geoLocationData['latitude'].append(info[0])
    geoLocationData['longitude'].append(info[1])
    geoLocationData['countrycode'].append(info[2])

0 Miami [25.77427, -80.19366, 'US']
1 New York City [40.71427, -74.00597, 'US']
2 Miami Beach [25.79065, -80.13005, 'US']
3 Recife [-8.05389, -34.88111, 'BR']
4 Dallas [32.78306, -96.80667, 'US']
5 Trinidad [-14.83333, -64.9, 'BO']
6 Houston [29.76328, -95.36327, 'US']
Genève [46.20222, 6.14569, 'CH']
8 Atlanta [33.749, -84.38798, 'US']
São Paulo [-23.5475, -46.63611, 'BR']
10 Brownsville [25.90175, -97.49748, 'US']
11 St. Louis [38.62727, -90.19789, 'US']
12 San Juan [-31.5375, -68.53639, 'AR']
13 Galveston [29.30135, -94.7977, 'US']
14 Manila [14.6042, 120.9822, 'PH']
15 Iloilo [10.69694, 122.56444, 'PH']
16 Los Angeles [34.05223, -118.24368, 'US']
17 Orlando [28.53834, -81.37924, 'US']
18 Chicago [41.85003, -87.65005, 'US']
19 Tampa [27.94752, -82.45843, 'US']
20 Flint [43.01253, -83.68746, 'US']
21 Baltimore [39.29038, -76.61219, 'US']
22 London [51.50853, -0.12574, 'GB']
23 Ho Chi Minh City [10.82302, 106.62965, 'VN']
24 Philadelphia [39.95233, -75.16379, 'US']
25 Boston [42.35843

214 Farmington [36.72806, -108.21869, 'US']
215 Union [40.6976, -74.2632, 'US']
216 Albany [42.65258, -73.75623, 'US']
217 Bello [6.33732, -75.55795, 'CO']
218 Hamburg [53.57532, 10.01534, 'DE']
219 Madera [36.96134, -120.06072, 'US']
220 Lubbock [33.57786, -101.85517, 'US']
221 Boise [43.6135, -116.20345, 'US']
222 Lagos [6.45407, 3.39467, 'NG']
223 Ibadan [7.37756, 3.90591, 'NG']
224 Birmingham [52.48142, -1.89983, 'GB']
225 Waldorf [38.62456, -76.93914, 'US']
226 McLean [38.93428, -77.17748, 'US']
227 Newark [40.73566, -74.17237, 'US']
228 Sparks [39.53491, -119.75269, 'US']
229 Berlin [52.52437, 13.41053, 'DE']
230 Ardmore [34.17426, -97.14363, 'US']
231 Florida [21.52536, -78.22579, 'CU']
232 Fontainebleau [48.40908, 2.70177, 'FR']
233 Frisco [33.15067, -96.82361, 'US']
234 Dubai [25.07725, 55.30927, 'AE']
235 Benton [34.56454, -92.58683, 'US']
236 Calgary [51.05011, -114.08529, 'CA']
237 Pinewood [25.86898, -80.21699, 'US']
238 Ljubljana [46.05108, 14.50513, 'SI']
239 Tehran [35.

466 Yakima [46.60207, -120.5059, 'US']
467 Luanda [-8.83682, 13.23432, 'AO']
468 Dumai [1.66711, 101.44316, 'ID']
469 Redmond [47.67399, -122.12151, 'US']
470 Concord [37.97798, -122.03107, 'US']
471 Rockland [42.13066, -70.91616, 'US']
472 Mankato [44.15906, -94.00915, 'US']
473 Toms River [39.95373, -74.19792, 'US']
474 Zanzibar [-6.16394, 39.19793, 'TZ']
475 Zanzibar [-6.16394, 39.19793, 'TZ']
476 Arusha [-3.36667, 36.68333, 'TZ']
477 New Kingston [18.00747, -76.78319, 'JM']
478 Yokohama [35.43333, 139.65, 'JP']
479 Kitwe [-12.80243, 28.21323, 'ZM']
480 Bismarck [46.80833, -100.78374, 'US']
481 Minot [48.23251, -101.29627, 'US']
482 Terrebonne [45.70004, -73.64732, 'CA']
483 North Vancouver [49.31636, -123.06934, 'CA']
484 Hemet [33.74761, -116.97307, 'US']
485 Darien [41.75198, -87.97395, 'US']
486 Fairfield [38.24936, -122.03997, 'US']
487 Princeton [40.34872, -74.65905, 'US']
488 Copenhagen [55.67594, 12.56553, 'DK']
489 Wuhan [30.58333, 114.26667, 'CN']
San Luis Potosí [22.14982

Once we have extracted as many city geo-location data a possible, we should create a pandas dataframe from the geoLacation dictionary

In [18]:
dfLocation = pd.DataFrame(geoLocationData)

In [31]:
dfLocation.style

Unnamed: 0,latitude,longitude,countrycode
0,25.77427,-80.19366,US
1,40.71427,-74.00597,US
2,25.79065,-80.13005,US
3,-8.05389,-34.88111,BR
4,32.78306,-96.80667,US
5,-14.83333,-64.9,BO
6,29.76328,-95.36327,US
7,46.20222,6.14569,CH
8,33.749,-84.38798,US
9,-23.5475,-46.63611,BR


In [32]:
df3 = df2.drop(['country'], axis=1)

Merging the two dataframes gives us...

In [33]:
df3.join(dfLocation)

Unnamed: 0,headline,city,latitude,longitude,countrycode
0,Zika Outbreak Hits Miami,Miami,25.7743,-80.1937,US
1,Could Zika Reach New York City?,New York City,40.7143,-74.006,US
2,First Case of Zika in Miami Beach,Miami Beach,25.7906,-80.13,US
3,"Mystery Virus Spreads in Recife, Brazil",Recife,-8.05389,-34.8811,BR
4,Dallas man comes down with case of Zika,Dallas,32.7831,-96.8067,US
...,...,...,...,...,...
608,Rumors about Rabies spreading in Jerusalem hav...,Jerusalem,31.769,35.2163,IL
609,More Zika patients reported in Indang,Indang,14.1953,120.877,PH
610,Suva authorities confirmed the spread of Rotav...,Suva,-18.1416,178.441,FJ
611,More Zika patients reported in Bella Vista,Bella Vista,18.4554,-69.9454,DO
