# Identifying the most-mentioned countries in the 2021 US spending omnibus
## (A sloppy attempt by someone who is a mostly frontend engineer who hasn't worked with Python in awhile and doesn't know enough about NLP)

## 1. Read bill text

To get the text file used here, I extracted the bill's plaintext from this [source PDF](https://docs.house.gov/billsthisweek/20201221/BILLS-116HR133SA-RCP-116-68.pdf) via xpdf's `pdftotext`. Unfortunately, this leaves the line numbering in place, and there are hyphenated word breaks that cross lines. I cleaned up a bit to unwrap words that got wrapped with a hyphen across multiple lines by removing all matches of this regex:

```
-$\n+\d+\s+
```

The text version is still imperfect after that, with some multiword country names potentially split across multiple lines, possibly causing some country references to be broken, but I don't expect this to affect the count too much.

In [3]:
bill_text = open('bills-116hr133sa-rcp-116-68.txt', 'r', encoding="ISO-8859-1").read()
print(len(bill_text))

6895822


## 2. Use `flashgeotext` to get (rough) country counts

This bill is very big, almost 7 million characters. I tried working with [Spacy](https://spacy.io/) but had to break up the work into chunks so as not to hit memory limits, and ultimately didn't get a great result for various other reasons, mostly patience (Spacy's stock models aren't trained to extract *just* countries, instead extracting "geopolitical entities", and I got impatient with its runtime while trying to figure out how to appropriately filter just to countries).

[flashgeotext](https://github.com/iwpnd/flashgeotext) is simpler and looks more like what I wanted: just produce a count of mentions of countries with no intermediate steps required, and a much faster search method.

In [2]:
from flashgeotext.geotext import GeoText

In [3]:
geotext = GeoText()

2020-12-22 20:48:43.498 | DEBUG    | flashgeotext.lookup:add:194 - cities added to pool
2020-12-22 20:48:43.502 | DEBUG    | flashgeotext.lookup:add:194 - countries added to pool
2020-12-22 20:48:43.502 | DEBUG    | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']


In [4]:
countries = geotext.extract(bill_text, span_info=False).get('countries')

In [5]:
print(countries)

{'United States': {'count': 2408}, 'Mexico': {'count': 53}, 'Canada': {'count': 43}, 'Sudan': {'count': 51}, 'Puerto Rico': {'count': 40}, 'China': {'count': 118}, 'Iran': {'count': 18}, 'North Korea': {'count': 8}, 'Russia': {'count': 39}, 'Singapore': {'count': 1}, 'Australia': {'count': 4}, 'Morocco': {'count': 2}, 'Georgia': {'count': 4}, 'Cuba': {'count': 19}, 'Turkey': {'count': 3}, 'Palau': {'count': 5}, 'South Korea': {'count': 5}, 'Israel': {'count': 34}, 'Ukraine': {'count': 20}, 'Afghanistan': {'count': 66}, 'Iraq': {'count': 29}, 'Jordan': {'count': 10}, 'Lebanon': {'count': 8}, 'Egypt': {'count': 16}, 'Tunisia': {'count': 3}, 'Oman': {'count': 1}, 'Colombia': {'count': 17}, 'Japan': {'count': 3}, 'Taiwan': {'count': 37}, 'United Kingdom': {'count': 8}, 'Bahrain': {'count': 1}, 'Myanmar': {'count': 17}, 'Cambodia': {'count': 12}, 'Ethiopia': {'count': 1}, 'Pakistan': {'count': 5}, 'Philippines': {'count': 2}, 'Sri Lanka': {'count': 6}, 'Zimbabwe': {'count': 3}, 'Palestine':

## 3. Spot-check

Spot-checking the above just by searching inside the original PDFs, the results look decent. But there are some issues that would require a fancier approach to fix: the "Sudan" count includes South Sudan mentions (outdated country list?), and "Mexico" includes New Mexico, which bloats the count for mentions of Mexico-the-country to ~25% higher than it should be. The model isn't aware of US states so it's failing to eliminate New Mexico as a longer string match. So we need to go at least a bit fancier to get a good result.

## 4. Try wiring up `pycountry`'s country database + `pyahocorasick`

flashgeotext uses an [Ahoâ€“Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) implementation under the hood for its string searching. Let's drop down one level of abstraction and re-implement to handle the country lookup we need by building an "automaton" more directly, using [`pyahocorasick`](https://pypi.org/project/pyahocorasick/). Reading around, it looks like it might have better handling for overlapping matches than the implementation used in flashgeotext.

We basically do this by dumping [`pycountry`](https://pypi.org/project/pycountry/)'s database of countries and US subdivisions/states into ahocorasick:

In [6]:
import ahocorasick
import pycountry

A = ahocorasick.Automaton()

# Add countries to the automaton
for country in pycountry.countries:    
    # `country.name` is the [ISO English "short country name"](https://unstats.un.org/unsd/tradekb/knowledgebase/country-code).
    # Unfortunately, it's not always all that short. We need to special-case to handle some countries with long "short" names
    # that are commonly abbreviated further:
    # "Iran, Islamic Republic of" => "Iran"
    country_name = country.name.split(",")[0]
    # Special-case Russian Federation too...
    if country_name == 'Russian Federation':
        country_name = 'Russia'
        A.add_word('Russian Federation', ('COUNTRY', 'Russia')) 
        
    entity = ('COUNTRY', country_name,)
    A.add_word(country_name, entity)
    if hasattr(country, 'official_name'):
        A.add_word(country.official_name, entity)

# Add US states and territories
for subdivision in pycountry.subdivisions.get(country_code='US'):
    if subdivision.type == 'State':
        A.add_word(subdivision.name, ('US_STATE', subdivision.name,))
    else:
        A.add_word(subdivision.name, ('US_TERRITORY', subdivision.name,))
        
A.make_automaton()

## Iterate through matches the automaton found; get counts

In [7]:
import re

last_idx = None
entities = []

word_char = re.compile('\w')

for (idx, entity,) in A.iter(bill_text):
    # Look ahead to next character. If it's a word character, we don't want this
    # entity - it's a prefix match for something else (e.g. "India" in "Indian")
    next_char = bill_text[idx + 1 : idx + 2]
    if (word_char.match(next_char)):
        continue

    if idx == last_idx:
        # Overlapping match found. Filter to the longest match; only replace previous match
        # if new entity name is longer.
        # (e.g. when "New Mexico" match is followed by "Mexico" - "New Mexico" wins)
        if len(entity[1]) > len(entities[-1][1]):
            entities[-1] = entity
    else:
        entities.append(entity)
    last_idx = idx

countries = [entity for entity in entities if entity[0] == 'COUNTRY']
country_counts = {}
for (_, country) in countries:
    if country not in country_counts:
        country_counts[country] = 1
    else:
        country_counts[country] += 1
        
country_counts_sorted = sorted(country_counts.items(), key=lambda kv: kv[1], reverse=True)
print('\n'.join([k[0] + ', ' + str(k[1]) for k in country_counts_sorted]))

United States, 2286
Belarus, 119
China, 116
Afghanistan, 66
Mexico, 47
Sudan, 46
Canada, 43
Russia, 39
Taiwan, 37
Israel, 34
India, 33
Iraq, 29
Hong Kong, 22
Cuba, 19
Ukraine, 19
Iran, 18
Colombia, 17
Egypt, 16
Korea, 13
Cambodia, 12
Guatemala, 11
Honduras, 11
Nepal, 11
Northern Mariana Islands, 10
Jordan, 10
El Salvador, 10
Micronesia, 8
Lebanon, 8
United Kingdom, 7
Marshall Islands, 6
Somalia, 6
Sri Lanka, 6
Palau, 5
Pakistan, 5
South Sudan, 5
Azerbaijan, 5
Australia, 4
Haiti, 4
Venezuela, 4
Armenia, 4
Turkey, 3
Tunisia, 3
Japan, 3
Libya, 3
Yemen, 3
Zimbabwe, 3
Saudi Arabia, 3
Iceland, 3
Norway, 3
New Zealand, 3
Jersey, 3
Kenya, 3
Tanzania, 3
Morocco, 2
Samoa, 2
Kuwait, 2
Nicaragua, 2
Philippines, 2
Palestine, 2
Cameroon, 2
Central African Republic, 2
Congo, 2
Bangladesh, 2
Peru, 2
Jamaica, 2
Lithuania, 2
Singapore, 1
Syrian Arab Republic, 1
Oman, 1
Bahrain, 1
Ethiopia, 1
Greenland, 1
Uzbekistan, 1
Western Sahara, 1
Chad, 1
Niger, 1
Nigeria, 1
Malawi, 1
Thailand, 1
Belize, 1
Costa Ri

## Summary

This is where I stopped due to time constraints. It seems to be much more accurate than the attempt earlier in this notebook. Spot-checking it shows we seem to be matching genuine references to countries, not nationalities, and we're not confusing "New Mexico" with "Mexico". But:

 - "Gulf of Mexico" counts as a "Mexico" reference. To handle that, we need some way of disambiguating non-country, non-US-state geopolitical entities like "Gulf of Mexico".
 - Much worse, "North Korea" is nowhere to be seen - I sloppily stripped the ", Republic of" and ", Democratic People's Republic of" suffixes from the ISO short name! This carelessly reunifies Korea.
 
For my purposes (learning exercise), I think I'll be happy with this after adding a new list of short country names that patches the ISO short names. This should fix the last issue, but not the first one. To get more accuracy, a fancier model with more semantic awareness is likely needed.