# Day 2: Foreign Policies

I thought it would be a nice idea to look at some of U.S. President Obama's speeches and mine them, using `word2vec`. 

Inspired by [this blog post](http://byterot.blogspot.com/2015/06/five-crazy-abstractions-my-deep-learning-word2doc-model-just-did-NLP-gensim.html) by aliostad. 

The first thing to do with any text analysis project is to get the corpus. The speech texts I used were found [here](http://obamaspeeches.com/). `jusText` [(github)](https://github.com/miso-belica/jusText) was used with some success to remove boilerplate text from the website. I download the speeches, clean them up a bit, tokenize them with `spaCy`, map them onto a vector space with `word2vec` ([github](https://github.com/miso-belica/jusText); a Python module which wraps C binaries that implement the word2vec algorithm) trained on the corpus. I used a list of countries I found online to find out which countries were mentioned most and then used the convenience functions in `word2vec` to list the top three tokens ranked according to their cosine similarity in the model's space. Results below. I would have utilized `spaCy` and/or `gensim` (both great) a bit more agressively if I had the time. Perhaps something worth returning to.

Example results:
```
UKRAINE ~ REGION, COURSE, NUMBER
AFGHANISTAN ~ UNION, U.S., BETWEEN
ISRAEL ~ TEST, AMENDMENT, AROUND
GEORGIA ~ PROCESS, BUILT, BUILDING
IRAN ~ REFORM, LOCAL, INCREASE
INDIA ~ THOUSANDS, STUDENTS, EFFORTS
CHINA ~ SERIES, CITIZENSHIP, TERRORIST
KENYA ~ CHICAGO, DURING, MAN
IRAQ ~ WAR, PRESIDENT, IN
RUSSIA ~ HEART, OPEN, STEM
```

In [117]:
import codecs
import glob
import json
import justext
import requests
import word2vec
import spacy.en
import collections
import itertools

In [118]:
# Initialize the parser and prepare token probabilities for stopword filtering

nlp = spacy.en.English()
probs = [lex.prob for lex in nlp.vocab]
probs.sort()

In [119]:
with open('urllist.txt') as f:
    url_list = f.readlines()
    url_list = [url.strip() for url in url_list]

In [120]:
def get_corpus_from_url_list(url_list):
    """Generates a list of documents stripped of boilerplate text."""
    corpus = []
    
    for url in url_list:
        response = requests.get(url)
        
        # Remove boilerplate with jusText
        paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
        content_paragraphs = []

        for paragraph in paragraphs[:-1]:
            if not paragraph.is_boilerplate:
                content_paragraphs.append(paragraph.text)
        
        speech = "\n".join(content_paragraphs)
        corpus.append(speech)
    
    return corpus            

In [121]:
big_corpus = get_corpus_from_url_list(url_list)

In [122]:
# Convenience functions to save from downloading the corpus over an over

def save_corpus():
    with open('corpus.json', 'w') as fp:
        json.dump(big_corpus, fp)

In [123]:
def load_corpus():
    with open('corpus.json', 'r') as fp:
        big_corpus = json.load(fp)

In [124]:
# Lump the whole corpus together; not interested in document-level stats
bundle = "".join(big_corpus)
bundle = bundle.replace('\n', '')
bundle = bundle.replace('  ', ' ')

In [125]:
# Tokenize with spaCy

tokens = nlp(bundle)

In [126]:
# Filter common words (p < that of top 100 in spaCy's model)

cleaned_tokens = [tok for tok in tokens if tok.prob < probs[-100]]
cleaned_tokens_strings = [tok.string.upper().strip() for tok in cleaned_tokens]

In [127]:
with open('countries.txt', 'r') as fp:
    countries = fp.readlines()
    countries = [line.strip() for line in countries]
    
mentioned_countries = [tok for tok in cleaned_tokens_strings if tok in countries]



In [128]:
country_counter = collections.Counter(mentioned_countries)
common_countries= dict(country_counter.most_common(10))
common_countries

{u'AFGHANISTAN': 31,
 u'CHINA': 31,
 u'GEORGIA': 11,
 u'INDIA': 14,
 u'IRAN': 23,
 u'IRAQ': 286,
 u'ISRAEL': 16,
 u'KENYA': 46,
 u'RUSSIA': 29,
 u'UKRAINE': 11}

In [129]:
cleaned_corpus = " ".join(tok.string.upper() for tok in cleaned_tokens)

# Write out the cleaned corpus into a text file for use by word2vec

with open('corpus.txt', 'wb') as f:
    f.write(cleaned_corpus)

# Chunk common stuff into a phrase (see word2vec docs)

word2vec.word2phrase('corpus.txt', 'corpus-phrases.txt')

[u'word2phrase', u'-train', u'corpus.txt', u'-output', u'corpus-phrases.txt', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']


In [130]:
# Train the model and save to corpus.bin

word2vec.word2vec('corpus.txt', 'corpus.bin')

In [131]:
# Load up the model

model = word2vec.load('corpus.bin')

# The model has a default vector size of 100; can be tweaked for performance
# This corpus has 3525 words

len(model.vocab), model.vectors.shape

(3523, (3523, 100))

In [132]:
def top_n(indexes, metrics, n=3):
    top_n = model.generate_response(indexes, metrics).tolist()[:n]
    return [x for x in top_n]

def closest_to(token):
    """Gets words near each other in the space using cosine similarity.
    Gives its results in a tuple which indexes the words in the corpus.
    So it's passed to top_n() which just prettifies and gets the top n."""
    indexes, metrics = model.cosine(token)
    return top_n(indexes, metrics)

In [133]:
for country in common_countries.keys():
    closest = closest_to(country)
    pretty = ", ".join([x[0] for x in closest])
    print "{} ~ {}".format(country, pretty)

UKRAINE ~ REGION, NUMBER, EFFORTS
AFGHANISTAN ~ UNION, U.S., BETWEEN
ISRAEL ~ TEST, AROUND, PERSONAL
GEORGIA ~ PROCESS, PROCEDURES, BUILT
IRAN ~ REFORM, INCREASE, LOCAL
INDIA ~ STUDENTS, HOUR, THOUSANDS
CHINA ~ SERIES, TERRORIST, LAND
KENYA ~ CHICAGO, FOUND, BEING
IRAQ ~ WAR, PRESIDENT, IN
RUSSIA ~ HEART, LEGAL, EFFORT


In [134]:
country_pairings = list(itertools.combinations(common_countries.keys(),2))

In [136]:
# word2vec has the concept of an analogy

for x, y in country_pairings:
    indexes, metrics = model.analogy(pos=[x, y], neg=[], n=10)
    analogy = top_n(indexes, metrics)
    pretty = ", ".join([i[0] for i in analogy])
    print "{} + {} ~ {}".format(x, y, pretty)

UKRAINE + AFGHANISTAN ~ UNION, U.S., IRAQI
UKRAINE + ISRAEL ~ REGION, COURSE, USED
UKRAINE + GEORGIA ~ COURSE, LEADERSHIP, SOLDIERS
UKRAINE + IRAN ~ REGION, EFFORTS, LOCAL
UKRAINE + INDIA ~ EFFORTS, STUDENTS, DANGEROUS
UKRAINE + CHINA ~ REGION, EFFORTS, INTELLIGENCE
UKRAINE + KENYA ~ NUMBER, BOTH, SUCH
UKRAINE + IRAQ ~ AGAINST, UNION, WAR
UKRAINE + RUSSIA ~ EFFORTS, REGION, NUMBER
AFGHANISTAN + ISRAEL ~ ECONOMIC, SUCH, REGION
AFGHANISTAN + GEORGIA ~ AMENDMENT, WITHIN, LEADERSHIP
AFGHANISTAN + IRAN ~ EFFORTS, SUCH, LED
AFGHANISTAN + INDIA ~ EFFORTS, STUDENTS, DANGEROUS
AFGHANISTAN + CHINA ~ LED, EFFORTS, SUCH
AFGHANISTAN + KENYA ~ BOTH, STATE, SUCH
AFGHANISTAN + IRAQ ~ WAR, AGAINST, CIVIL
AFGHANISTAN + RUSSIA ~ EFFORTS, SUCH, U.S.
ISRAEL + GEORGIA ~ TEST, AS, PROCESS
ISRAEL + IRAN ~ ABILITY, AROUND, FULL
ISRAEL + INDIA ~ FULL, STUDENTS, RESEARCH
ISRAEL + CHINA ~ TEST, HEART, AMENDMENT
ISRAEL + KENYA ~ HEART, HE, BEING
ISRAEL + IRAQ ~ U.S., UNION, AGAINST
ISRAEL + RUSSIA ~ HEART, AMENDME