# Workflow

1. Load in the headline data and examine it for any data quality issues.
    - Use any library/data structure to read in the headlines
    - Read through some of the headlines and identify potential problems
    
2. Using regular expressions and the cities and countries within the geonamescache library, match any cities/countries within each headline.
    - Make sure to normalize headlines and city/country names by removing accent marks. This can be done with the unidecode library.
    - Watch out for multiple cities in a headline and matches on short words! We want the match to be on the entire city—for example San Marino—and not a partial match—San.
    
3. Put the extracted data into a pandas DataFrame with three columns: headline, city, country.

4. Make sure there were no issues with the extraction by sampling some of the headlines and examining the city and country names.
    - One method for finding problems is to look for the most common names and see if there are any issues.
    
5. Once you are confident you’ve found all the cities/countries in each headline, save the DataFrame for the next part.



In [1]:
# 1. Load Data
with open('data/headlines.txt', 'r') as f:
    raw_headlines = f.readlines()
raw_headlines[:20]

['Zika Outbreak Hits Miami\n',
 'Could Zika Reach New York City?\n',
 'First Case of Zika in Miami Beach\n',
 'Mystery Virus Spreads in Recife, Brazil\n',
 'Dallas man comes down with case of Zika\n',
 'Trinidad confirms first Zika case\n',
 'Zika Concerns are Spreading in Houston\n',
 'Geneve Scientists Battle to Find Cure\n',
 'The CDC in Atlanta is Growing Worried\n',
 'Zika Infested Monkeys in Sao Paulo\n',
 'Brownsville teen contracts Zika virus\n',
 'Mosquito control efforts in St. Louis take new tactics with Zika threat\n',
 'San Juan reports 1st U.S. Zika-related death amid outbreak\n',
 'Flu outbreak in Galveston, Texas\n',
 'Zika alert â€“ Manila now threatened\n',
 'Zika afflicts 7 in Iloilo City\n',
 'New Los Angeles Hairstyle goes Viral\n',
 'Louisiana Zika cases up to 26\n',
 'Orlando volunteers aid Zika research\n',
 'Zika infects pregnant woman in Cebu\n']

Potential Problems
- UTF characters (``` 'Zika alert â€“ Manila now threatened\n' ```)
- Quantities as words vs. digits
- Abbreviations (MCD)
- Quotes
- Capitalization

In [2]:
# Clean data
from unidecode import unidecode

def clean(line):
    return unidecode(line).strip()

headlines = [clean(line) for line in raw_headlines]

In [3]:
# 2. Match cities/countries
import re
import geonamescache 

gc = geonamescache.GeonamesCache()

# Prepare countries
country_structs = gc.get_countries_by_names()
raw_countries = [clean(name) for name in country_structs]
countries = sorted(set(raw_countries))
country_patterns_txt = [f'\\b{c}\\b' for c in countries]
country_patterns = [re.compile(p) for p in country_patterns_txt]


# Prepare cities pattern
city_structs = gc.get_cities()
raw_cities = [city_structs[key]['name'] for key in city_structs]
clean_cities = [clean(c) for c in raw_cities]
cities = sorted(set(clean_cities))

city_patterns_txt  = sorted([f'\\b{c}\\b' for c in cities])
city_patterns = [re.compile(p) for p in city_patterns_txt]




In [4]:
# Process by finding matches and constructing intermediate data form

def match_patterns(patterns, headline):
    """Matches patterns to headlines.  Only returns the longest matching pattern to avoid sub-words"""
    allresults = []
    for pattern in patterns:
        result = pattern.findall(headline)
        if result is not None and len(result) > 0:
            allresults.extend(result)
    # Only return the longest match or None
    if len(allresults) > 0:
        allresults.sort(key=len)
        return allresults.pop()
    else:
        return None


def find_references(debug=False):
    processed_headlines =  []
    for headline in headlines:
        countries = match_patterns(country_patterns, headline)
        cities = match_patterns(city_patterns, headline)
        if debug: 
            print(headline)
            print(f'  COUNTRIES: {countries}')
            print(f'  CITIES   : {cities}')
        processed_headlines.append( (headline, countries, cities) )
    return processed_headlines

results = find_references()



In [5]:
# Check results
#for x in results: print(x)

fnd = [y for y in [x[2] for x in results] if y is not None]
frequencies = set([(c, fnd.count(c)) for c in fnd])
sorted_frequencies = list(sorted(frequencies, key=lambda x: -x[1]))
sorted_frequencies

[f for f in results if f[2] == sorted_frequencies[1][0]]


[('Lower Hospitalization in Monroe after Hepatitis D Vaccine becomes Mandatory',
  None,
  'Monroe'),
 ('Spike of Syphilis Cases in West Monroe', None, 'Monroe'),
 ('West Nile Virus Hits Monroe', None, 'Monroe'),
 ('The Spread of Respiratory Syncytial Virus in Monroe has been Confirmed',
  None,
  'Monroe')]

In [6]:
# Convert to pandas data frame
import pandas as pd
import numpy as np

df = pd.DataFrame(results, columns=['headline', 'country', 'city'])
df.fillna(value=pd.np.nan, inplace=True) # translate Nones to np.NaN
df

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo


In [7]:
# Dump file
df.to_json('geo-headlines.json')