In [1]:
# display all output and eliminate scrolling in output areas

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

<h2>Discovering Disease Outbreaks from News Headlings</h2>

<h3>1. Load headline data and examine for data quality issues</h3>

In [3]:
# key libraries

from unidecode import unidecode
import geonamescache
import re
import numpy as np
import pandas as pd

<h4>Loading and verifying headlines</h4>

In [4]:
# load headlines

fName = "data/headlines.txt"

headlines = [line.strip() for line in open(fName,'r')]

print('Num of headlines', len(headlines))
print('\nSample headlines:\n')
for headline in headlines[:10]:
    print(headline)



Num of headlines 650

Sample headlines:

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika
Trinidad confirms first Zika case
Zika Concerns are Spreading in Houston
Geneve Scientists Battle to Find Cure
The CDC in Atlanta is Growing Worried
Zika Infested Monkeys in Sao Paulo


<h4>Some potential problems</h4>
<ul>
    <li><i>Accent Marks</i> -- A large number of the cities in the world have names with accents (e.g. São Paulo, Brazil). However, in the headlines there are no accent marks (e.g. Zika Infested Monkeys in Sao Paulo). Need to develop a search scheme that ignores the accents.</li>
    <li><i>Headlines without Country or State/Province/Region Names</i> -- Most of the headlines have no country names or state/province/region names (e.g. in the above sample there is only one - Recife, Brazil). Since most city names are not unique (e.g. there are at least 6 cities in the world named Los Angeles and there are at least 11 cities in the U.S. named Dallas), we need to have a way to determine which of the possibilities is most probable.</li>
    <li><i>Number of City Names that are Common (English) Words</i> -- In the 'geonamecache' dictionaries (discussed below) there are at least: 5 two letter words that are city names including the word 'Of'; 125 names with only three letters including 'Man', 'Gap', and 'Bar'; 720 four letter names including: Best, Date, Much, and Same. This means we have to have a strategy for handling uppercase letters and 'capitalized' words and a strategy for determining whether the word is a location or not. For instance, in the headline above "Dallas man comes down with case of Zika" we know that 'man' is not a location, but how do we determine this programmatically.</li>
    <li><i>City Names with Two or More Words</i> -- Many city names consist of two or more words. Often, the individual words are also the names of cities. In the above examples, a case in point is Miami Beach. Miami is a legitimate location and so is Miami Beach. Same is true for a city like San Francisco. 'San' is the name of a city and, obviously, so is San Francisco. Again, we need a search strategy that chooses the most likely candidate.</li>
    <li><i>Punctuation Marks in Location Names</i> -- Occasionally, location names will contain punctuation marks.  For example, names with apostorphes to designate possession (e.g. Chicago's First Zika Case Confirmed) or names that have been hypenated (e.g.Thailand-Zika Virus in Bangkok). Looking at it, the first instinct is to eliminate the punctuation marks. Problem is that 64 of the city names in the geonamecache city dictionary have one or more apostrophes and 758 have hypenated names.</li> 
    <li><i>Headlines with Missing or Unmatched Location Names</i> -- Some headlines lack location names altogether (e.g. 'Zika Virus sparks International Concern'). Others, specify a location name but not a specific city or country name (e.g. 'Louisiana Zika cases up to 26' or 'New Zika Case Confirmed in Sarasota County'). While we can jetison the first example, what about the other two? Eventually, we want to associate specific locations with the names, so what method would we use in these sorts of cases?</li>
    <li><i>Headlines with no Disease Mentioned</i> -- While the initial focus is on delineating location names, there are sentences that have locations but are missing a well-delineated disease. Cases in point are the following headlines in the above examples: 'Geneve Scientists Battle to Find Cure' and 'The CDC in Atlanta is Growing Worried'. Or, one that is not displayed: 'Cancun hit by Party Fever.' With the first two, the full story might provide the name of the disease. In the third case, there is no underlying disease. The question is: do we eliminate these stories. If so, how do we do it programmatically.</li>
</ul>

<h3>2. Match cities/countries within each headline</h3>

Use regex and geonamecache library to match city and country names within each headline, normalizing to remove accent marks and insuring that complete (not partical) city name is matched.

In [5]:
# function to convert country or city names in geonamescache to regex of form
# removing accent marks

def name_to_regex(name):
    decoded_name = unidecode(name)
    if name != decoded_name:
        regex = fr'\b({name}|{decoded_name})\b'
    else:
        regex = fr'\b{name}\b'
    return re.compile(regex, flags=re.IGNORECASE)

<h4>Converting country and city names to regex form</h4>

In [6]:
# converting countries and cities to regex form using geonamescache dictionairies

gc = geonamescache.GeonamesCache()

# Countries
countries = [country['name'] for country in gc.get_countries().values()]
country_to_name = {name_to_regex(name): name for name in countries}
print(f"{len(countries)} countries converted to regex form for country_to_name dictionary")

# Cities
cities = [city['name'] for city in gc.get_cities().values()]
city_to_name = {name_to_regex(name): name for name in cities}
print(f"{len(cities)} cities converted to regex form for city_to_name dictionary")

252 countries converted to regex form for country_to_name dictionary
24336 cities converted to regex form for city_to_name dictionary


In [7]:
# function to find locations (countries and cities) in text
# ensures complete (not partical) city name is matched

def get_locs_in_headline(headline, dictionary):
    locs_in_headline = set()
    for regex, name in sorted(dictionary.items(), key=lambda x: x[1]):          
        match = regex.search(headline)
        if match:
            if headline[match.start()].isupper():
                locs_in_headline.add(unidecode(name))
    locs = list(locs_in_headline)
    if locs:
        return max(locs, key=len)
    return None

In [8]:
# finding city and country names in headlines

matched_countries = [get_locs_in_headline(headline, country_to_name) for headline in headlines]
matched_cities = [get_locs_in_headline(headline, city_to_name) for headline in headlines]

In [9]:
# samples of countries and cities

print('Sample of countries in headlines:', matched_countries[0:10])
print('\nSample of cities in headlines:', matched_cities[0:10])


Sample of countries in headlines: [None, None, None, 'Brazil', None, None, None, None, None, None]

Sample of cities in headlines: ['Miami', 'New York City', 'Miami Beach', 'Recife', 'Dallas', 'Trinidad', 'Houston', 'Geneve', 'Atlanta', 'Sao Paulo']


<h3>3. Extract data into pandas DataFrame with 3 cols - headline, city, country</h3>

In [10]:
data = {'Headline': headlines, 'City': matched_cities, 'Country': matched_countries}
df = pd.DataFrame(data)

<h3>4. Review Sample of headlines, cities and countries for any remaining errors</h3>

In [11]:
#review samples

df.head(5)

df.loc[[17, 236]]

Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,


Unnamed: 0,Headline,City,Country
17,Louisiana Zika cases up to 26,,
236,Zika Virus Sparks 'International Concern',Sparks,


Most of the issues with inadvertently matching common words for city names (e.g. 'of', see #3) and matching 'short, invalid' city names instead of the 'longer, valid' names with (e.g. Miami instead of Miami Beach, #2) have been eliminated. There are others that remain. A case in point is the identification of Sparks as the city in the headline 'Zika Virus Sparks International Concern'. Sparks is a legitimate city name but in this context sparks is a verb rather than a place name.

Additionally, there is still the case of headlines with unmatched city names. The headline 'Louisiana Zika cases up to 26' provides a state name but no city name. There are 38 other cases where the city name is also missing. Because this represents a small percentage, the headlines with missing names are being eliminated.

In [12]:
# Final DataFrame with unmatched headlines eliminated
# None's converted to NaN

df_headlines_cities_countries = df[pd.notna(df['City'])][['Headline','City','Country']]
df_headlines_cities_countries.replace(to_replace=[None], value=np.nan, inplace=True)

print(f"{len(df_headlines_cities_countries)} headlines remaining")
df_headlines_cities_countries

611 headlines remaining


Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami Beach,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,Jerusalem,
646,More Zika patients reported in Indang,Indang,
647,Suva authorities confirmed the spread of Rotav...,Suva,
648,More Zika patients reported in Bella Vista,Bella Vista,


<h3>5. Saving DataFrame for Further Analysis</h3>

In [13]:
# save by storing in 'pickle' file

df_headlines_cities_countries.to_pickle("data/df_headlines_cities_countries.pkl")