## Finding places names in headlines
First a few considerations: I'm not an experienced python coder, so the code you'll find here is far from what is considered "good python".

Considerations aside, the code employs what could be thought as of a rather brute force approach to find places in the leadlines. Also, I can't say that it is performant, but I don't think the reference **liveProject** suggest this either.

The way this works is:
1. First it builds a list of words found in the headlines. It does some cleaning by removing a few words, as those were creating a lot of noise later on.
2. The found words are concatenated in one giant regular expression.
3. the geonamescache module is used to build a list of places. This places list is built in such a way that if a word extracted from the headlines is found in a particular place, the whole place name will be matched.
4. Then, the list of found places is again searched in the headlines, in a per line basis. This assure us that we will find all places in all headlines.
5. A final list of headelines, places and countries is build.
6. A pandas dataframe is built based on the locations found

In [1]:
# There are parts of this Notebook that will be executed (Or not) based on this flag
debug = False

## Imports
from unidecode import unidecode # To remove accents and stuff from texts
import re                       #Regular expression library
import geonamescache            # For database of places on earth
from hashlib import md5         # Used to generate unique dictionary keys for the lines
import numpy as np              # for array manipulation
import pandas                   # Final resut needs to be in this format

In [2]:
# Open file with data
data_file_path='../data/headlines.txt'
#Read all lines in the headlines and 
## lines=[x.strip() for x in open(data_file_path).readlines()]
## lines=" ".join(lines)
allLines=open(data_file_path).read()

In [3]:
# Remove accents and stuff from the text
lines=unidecode(allLines)

In [4]:
#Find all words in the list that start with a caps letter, _or_ all words that are longer than 2
wordsFound = re.findall("[A-Z][a-zA-Z]+|[a-zA-Z]{3,}", lines)

#This is a list of words that was hand generated by looking at a list of short words that just add noise
removeTheseWords = ['san', 'hit', 'can', 'man']

#Build list of probable city list, prepared for regex search
wordsFound = [('[^#]*' + word.capitalize() + '[^#]*') for word in wordsFound if word.lower() not in removeTheseWords]
wordsFound = list(set(wordsFound))
print("Here's a short list of the auto generated gregular expressions from words from the headlines:")
'|'.join(wordsFound[:5])



Here's a short list of the auto generated gregular expressions from words from the headlines:


'[^#]*Exposure[^#]*|[^#]*Tests[^#]*|[^#]*Cholera[^#]*|[^#]*Chi[^#]*|[^#]*Rockville[^#]*'

#### A word of consideration about geonamescache

I think there are 2 considerations to have regarding this module.

First off, there is no real reference from it on the itnernet (OR I wasn't able to find it).
While looking for it a stumbled on a page that uses it in an example. Using that as a guide, I resorted to use `dir(geonamescache.GeonamesCache)` to find out by trial and error which methods were available.

Second, his module is also (Or at least seems to be) incredible biased towards USA names, and based on the list of matches in the headlines it seems that that also applies to the list.

In [5]:
gc = geonamescache.GeonamesCache()
#Get list of places from the GeonamesCache list

# There was no real reference for the geonamescache library. I used this page below as main reference:
#
# https://galeascience.wordpress.com/2016/03/23/us-city-to-state-python-dictionary/
#
# Then I used dir(geonamescache.GeonamesCache) to find out about what moethods the library had available

geodata = {}
# Build a dictionary of places from geonamescache grouped by place type
geodata['cities'] = [city['name'] for city in list(gc.get_cities().values())]
geodata['counties'] = [county['name'] for county in list(gc.get_us_counties())]
geodata['states'] = [state['name'] for state in list(gc.get_us_states().values())]
geodata['countries'] = [country['name'] for country in list(gc.get_countries().values())]

# This list will be used to detect country names in headlines
countries = geodata['countries']
allLocations = ""
example = ''
for geotype in geodata:
    example = example + geotype + ': ' + unidecode(' #' + '# #'.join(geodata[geotype][:5])) + "\n"
    geodata[geotype] = unidecode(' #' + '# #'.join(geodata[geotype]))
    allLocations = allLocations + geodata[geotype]

countries = [country.lower() for country in countries]
print("Countries get 'normalized' by making names all small caps:\n\t", countries[:5])
#countries[:5]
print("This is an example of how the text is generated from places so they can be extracted later:\n")
print(example)

Countries get 'normalized' by making names all small caps:
	 ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla']
This is an example of how the text is generated from places so they can be extracted later:

cities:  #Andorra la Vella# #Umm Al Quwain City# #Ras Al Khaimah City# #Zayed City# #Khawr Fakkan
counties:  #Baldwin County# #Barbour County# #Bibb County# #Blount County# #Bullock County
states:  #Alaska# #Alabama# #Arkansas# #Arizona# #California
countries:  #Andorra# #United Arab Emirates# #Afghanistan# #Antigua and Barbuda# #Anguilla



In [6]:
# Find all worlds extracted from the headlines in the text of concatenated names
# The way the system works, a word will match a full name:
# E.g. if the word 'York' exists in the headlines word list, it will match all 'York', 'New York' and
# 'New york City' from the list of places
matches = re.findall('|'.join(wordsFound), allLocations)
print('Found ', len(matches), 'word matches in the list of places')

Found  8134 word matches in the list of places


In [7]:
lows = allLines.lower()
found = []
for place in matches:
    placeLow = place.lower()
    if placeLow in lows:
        found.append((place, placeLow))
allMyLines = allLines.split("\n")
linesDict = {}
# Dictionary of regular expressions
regexpDict = {}
for line in allMyLines:
    if not line:
        continue
    # Generate an MD5 digest from the line.This is kind of a pet peeve; it also ensures there are
    # no duplicated headlines
    emmo = md5(line.lower().strip().encode('utf-8')).hexdigest()[:8]
    #Brute force, come to my aid.
    lineDict = {}
    lineDict['line'] = line
    lineDict['place'] = ''
    lineDict['country'] = np.NaN
    lineDict['places'] = []
    lowLine = line.lower()
    # Build list of headlines, places and countries.
    for place, lowPlace in found:
        if lowPlace not in regexpDict:
            regexpDict[lowPlace] = re.compile(r"\b" + lowPlace + r"\b")
        # Search for the place in the headline. If found, add it to the list of places found in the line
        if regexpDict[lowPlace].search(lowLine):
            # Place is a country?
            if place.lower() in countries:
                lineDict['country'] = place
            else:
                lineDict['places'].append(place)
    # If we have no places in the headlines except the country, use the country as place
    if lineDict['country'] is not np.NaN and len( lineDict['places'])<1:
        lineDict['place'] = lineDict['country']
        lineDict['places'] = [lineDict['country']]
        # For debugging purposes: Only 2 occurrences
        print("Found no place in", line, "country: (", lineDict['country'], ") place: (", lineDict['places'], ")")
        print("Defaulting place name to country")
    else:
        # Remove duplicated words from list of places if they exist
        lineDict['places'] = list(set(lineDict['places'])) # Did I say brute force? Brute force
        # If we have only one place in the list use that
        if len(lineDict['places']) == 1:
            lineDict['place'] = lineDict['places'][0]
        else:
            # For lists longer than 1 element, give precedente to the longer name.
            # So, 'New York City' will take precedence over both 'New York' and 'York', and
            # 'New York' will take precedence over 'York'
            placeLen = 0
            realPlace = ''
            for place in lineDict['places']:
                currentPlaceLen = len(place)
                if currentPlaceLen <= placeLen:
                    continue
                placeLen = currentPlaceLen
                realPlace = place
            lineDict['place'] = realPlace
    if emmo not in linesDict: #We don't expect repeated lines, do we? Add headline to list
        linesDict[emmo]=lineDict

Found no place in Norovirus Exposure in Hong Kong country: ( Hong Kong ) place: ( ['Hong Kong'] )
Defaulting place name to country
Found no place in Zika cases in Singapore reach 393 country: ( Singapore ) place: ( ['Singapore'] )
Defaulting place name to country


In [8]:
# This cell was used to look at data and looking at patterns in order tro know which place to use
if debug:
    counter = 0
    subcounter = 0
    for place in linesDict:
        lineData = linesDict[place]
        foundItems = len(lineData['places'])
        if foundItems < 1:
            counter = counter + 1
            print(counter, lineData['line'])
        if foundItems > 1:
            subcounter = subcounter - 1
            print(subcounter, lineData['line'], lineData['places'])

In [9]:
#Generate list of short words to see which to strip out
if debug:
    allPlaces = []
    allSmallPlaces = []
    for place in linesDict:
        #print(linesDict[place])
        allPlaces = allPlaces + linesDict[place]['places']
    allPlaces = set(allPlaces)
    for place in allPlaces:
        if len(place) < 5:
            allSmallPlaces.append(place)
    print("This is the list of short words I found:\n", ", ".join(allSmallPlaces))
    #removeTheseWords = ['san', 'hit', 'can']

In [10]:
# Convert headlines to pandas dataframe
headlinesArray = []
for line in linesDict:
    headline = linesDict[line]
    headlinesArray.append([headline['line'], headline['country'], headline['place']])
df = pandas.DataFrame(headlinesArray, columns = {'headline', 'countries', 'cities'})#.reset_index(drop=True)
blankIndex=[''] * len(df)
df.index=blankIndex
df[:4]


Unnamed: 0,headline,cities,countries
,Zika Outbreak Hits Miami,,Miami
,Could Zika Reach New York City?,,New York City
,First Case of Zika in Miami Beach,,Miami Beach
,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
