In [16]:
import pandas as pd
import geonamescache
import unidecode
import re
gc = geonamescache.GeonamesCache()

Load dataframe of city/country matches and remove any headlines with no matches and removing any cities matching 'of'.
As you can see below, this leaves us with 612 headlines matched.

In [17]:
df = pd.read_csv('data/df1.csv', sep=';', header=0, usecols=['headline', 'city', 'country'], index_col=False)

In [18]:
df2 = df[(df['city'] != 'of') & (df['city'].notnull()) | (df['country'].notnull())]

In [19]:
df2.reset_index(drop=True, inplace=True)

After further inspection of the the country column, very few have been matched and the ones that have matched, have city matches too so this column is redundant for analysis purposes.  Later we will remove this column and add 3 further columns for longitude, latitude and country code which are available in the geonamescache.get_cities() dictionary

Another stumbling block is city names with diacritics.
Using the unidecode versions of city names in the get_cities_by_names() method will fail.

To get around this, it is proposed that a dictionary of all diacritic names could be made from gc.get_cities() and then all unidecode names can be mapped back to the original name. 

In [20]:
diacritics = {}
cities = gc.get_cities()

In [21]:
for city in cities:
    original = cities[city]['name'] 
    decode = unidecode.unidecode(cities[city]['name'])
    if original != decode:
        diacritics[decode]=original

In [22]:
# to confirm how many diacritics we have found !
#print(diacritics)

In [23]:
print(diacritics['Durres'])

Durrës


In [24]:
geoLocationData = {'latitude': [], 'longitude': [], 'countrycode': []}

Now that we have taken care of the accented names we are in a good position to gather the geo location data

After defining a new dictionary as a placeholder for each city info, al loop is required to go through each city and extract the required data. 

The method used will be to use gc.get_cities_by_name(), but some city names are common and any may appear more than once in the list.

This means we will have to assume that the headline city will be the most populated city and hence the city with the greatest population size will be used in this instance.

To determine which city has the maximum size a function will be defined outside the loop to make the code cleaner.


In [25]:
def gatherCityInfo(name):
    #clist = name.strip('[]').replace("\'", "").asplit(',')
    #for item in clist:
    try:
        bestCity = max(gc.get_cities_by_name(name), key=lambda x: list(x.values())[0]['population'])
        data = [list(bestCity.values())[0]['latitude'], list(bestCity.values())[0]['longitude'], list(bestCity.values())[0]['countrycode']]
        return data
    except:
        return ["NaN", "NaN", "NaN"]

In [26]:
n = 0
for city in df2['city']:
    info = gatherCityInfo(city)
    if info == ["NaN", "NaN", "NaN"]:
        try:
            info = gatherCityInfo(diacritics[city])
            # this will add the orginal city name with accent marks to datafram...
            # remove next two lines if city name needed without diacritics
            #print(diacritics[city], info)
            n += 1
            geoLocationData['latitude'].append(info[0])
            geoLocationData['longitude'].append(info[1])
            geoLocationData['countrycode'].append(info[2])
            continue
        except:
            n += 1
            geoLocationData['latitude'].append(info[0])
            geoLocationData['longitude'].append(info[1])
            geoLocationData['countrycode'].append(info[2])
            continue
            #print(n, city, info)
    #print(n, city, info)
    n += 1
    geoLocationData['latitude'].append(info[0])
    geoLocationData['longitude'].append(info[1])
    geoLocationData['countrycode'].append(info[2])

Once we have extracted as many city geo-location data a possible, we should create a pandas dataframe from the geoLacation dictionary

In [27]:
dfLocation = pd.DataFrame(geoLocationData)

In [28]:
df3 = df2.drop(['country'], axis=1)

Merging the two dataframes gives us...

In [29]:
df4 = df3.join(dfLocation)

In [30]:
df4.style

Unnamed: 0,headline,city,latitude,longitude,countrycode
0,Zika Outbreak Hits Miami,Miami,25.77427,-80.19366,US
1,Could Zika Reach New York City?,New York City,40.71427,-74.00597,US
2,First Case of Zika in Miami Beach,Miami Beach,25.79065,-80.13005,US
3,"Mystery Virus Spreads in Recife, Brazil",Recife,-8.05389,-34.88111,BR
4,Dallas man comes down with case of Zika,Dallas,32.78306,-96.80667,US
5,Trinidad confirms first Zika case,Trinidad,-14.83333,-64.9,BO
6,Zika Concerns are Spreading in Houston,Houston,29.76328,-95.36327,US
7,Geneve Scientists Battle to Find Cure,Geneve,46.20222,6.14569,CH
8,The CDC in Atlanta is Growing Worried,Atlanta,33.749,-84.38798,US
9,Zika Infested Monkeys in Sao Paulo,Sao Paulo,-23.5475,-46.63611,BR


In [33]:
df4.to_csv('data/df4.csv', sep=';', index=False)