## Candidates identification and keywords extraction
Directions and resources: http://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/

Methodology of automatic keyphrase extraction:

1. Identify: a set of words and phrases that could convey the topical content of a document are identified.

2. Extract: candidates are scored/ranked and the “best” are selected as a document’s keyphrases.


__Candidate Identification__

* remove stop words and punctuation; 
* filter for words with certain parts of speech;
* (or) filter for multi-word phrases, certain POS patterns;
* use external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases.

__Options__

* take all of the n-grams (where 1 ≤ n ≤ 5)
* limit candidates to only "noun phrases" matching the POS pattern '{(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}' (a regular expression written in a simplified format used by NLTK’s RegexpParser()). This matches any number of adjectives followed by at least one noun that may be joined by a preposition to one other adjective(s)+noun(s) sequence.

__Keyphrases extraction__

1. Unsupervised Algorithms
    * no "training data" needed
    * graph-based ranking method: document = network with nodes as candidate keyphrases and whose edges (optionally weighted by the degree of relatedness) connect related candidates. Implementation: Graph-based ranking algorithm.
    

2. Supervised Algorithms
    * need text with already-labeled examples (“training data”)
    * Two primary developments deal with __task reformulation__ and __feature design__

##### Task Refolmulation

__Binary classification problem__: some fraction of candidates are classified as keyphrases and the rest as non-keyphrases. Methods: Naive Bayes, decision trees, and support vector machines. Problems: this reformulation of the task is conceptually problematic; humans judge keyphrases in relative sense (not independent from one another).<br>
Implementation - KEA (as published in Practical Automatic Keyphrase Extraction), used TF*IDF and position of first occurrence (while filtering on phrase length) to identify keyphrases

__Ranking problem__: a function is trained to rank candidates pairwise according to degree of “keyness”. The best candidates rise to the top, and the top N are taken to be the document’s keyphrases.<br> Implementation - Linear Ranking SVM to rank candidate keyphrases. 

###### Feature design

* __frequency statistics + other statistical features__: phrase length (number of constituent words), phrase position (normalized position within a document of first and/or last occurrence therein), and “supervised keyphraseness” (number of times a keyphrase appears as such in the training data). 
* __structural features__: titles, abstracts, intros and conclusions, metadata etc.
* __external resource-based features__: “Wikipedia-based keyphraseness” assumes that keyphrases are more likely to appear as Wiki article links and/or titles.
* __phrase commonness__: compare a candidate’s frequency in a document with respect to its frequency in an external corpus.

In [1]:
import os, sys, re, csv
import heapq
import json
import string
import gensim
from gensim import corpora, models, similarities
import itertools
import json
from operator import itemgetter
import nltk
from nltk import *
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus.reader import CategorizedPlaintextCorpusReader

#### Get mapping of corpus categories (classified with Regex based on headings) on document id

In [3]:
def map_dictionary():
    path1 = '/Users/dariaulybina/Desktop/georgetown-analytics_global-economics/global-economics/KeywordsANDClustering/CorpusCatMapJuly1.json'
    with open(path1) as df:
        maindict = json.load(df)
        df.close()
    return maindict
maindict = map_dictionary()

In [4]:
# run this just in case something doesnt work out with corpus
def get_data():
    path1 = '/Users/dariaulybina/Desktop/georgetown-analytics_global-economics/global-economics/KeywordsANDClustering/FinalCleanJuly1.json'
    with open(path1) as datafile:
        data = json.load(datafile)
        datafile.close()
    return data
data = get_data()

In [27]:
#corpusdir = nltk.data.find('/Users/dariaulybina/Desktop/global-economics-master/corpusCategory/') 
### Check if working and get all corpus
reader = CategorizedPlaintextCorpusReader('/Users/dariaulybina/Desktop/georgetown-analytics_global-economics/global-economics/KeywordsANDClustering/corpusCategory/', r'\w+\d+_.*\.txt', cat_map=maindict)
fids = reader.fileids() #names of files
example = reader.raw(fileids = 'Zambia2013_5.txt') #example for testing

#Check if you can see the category per assigned id
print(reader.categories(fileids = 'Zambia2013_5.txt'))

#Check if you can see all the categories available
print("All categories: {}".format(reader.categories())) #print all categories in a list

#Check all file names assigned to 1 category
# This sample is too small and has to be combined with other category !!!!!!!
print(reader.fileids(categories=['Monetary'])) #check docIDs in fiscal category

['Monetary']
All categories: ['Context', 'External', 'Financial', 'Fiscal', 'Monetary', 'Other', 'Real', 'Risks']
['Afghanistan2015_4.txt', 'Afghanistan2015_6.txt', 'Albania2013_4.txt', 'Albania2016_6.txt', 'Angola2014_6.txt', 'Angola2015_4.txt', 'Angola2016_4.txt', 'Armenia2014_4.txt', 'Aruba2015_6.txt', 'Azerbaijan2014_4.txt', 'Azerbaijan2016_5.txt', 'Belarus2014_5.txt', 'Belarus2016_6.txt', 'Bhutan2014_3.txt', 'Bhutan2016_2.txt', 'Bolivia2015_4.txt', 'Bolivia2016_4.txt', 'Brazil2016_5.txt', 'Burundi2014_4.txt', 'Cabo_Verde2014_7.txt', 'Cabo_Verde2016_9.txt', 'Canada2014_3.txt', 'Canada2016_8.txt', 'China2016_7.txt', 'Costa_Rica2014_7.txt', 'Costa_Rica2016_7.txt', 'Croatia2016_8.txt', 'Czech_Republic2014_5.txt', 'Czech_Republic2016_5.txt', 'Dominican_Republic2015_3.txt', 'Ethiopia2014_2.txt', 'Ethiopia2016_4.txt', 'Fiji2014_3.txt', 'Fiji2014_5.txt', 'Fiji2015_5.txt', 'Fiji2015_7.txt', 'Germany2014_6.txt', 'Ghana2014_4.txt', 'Guyana2016_2.txt', 'Honduras2016_6.txt', 'Hong_Kong2015_3.t

In [6]:
def extract_candidate_chunks(text, grammar=r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'):
    # exclude candidates that are stop words or entirely punctuation
    punct = set(string.punctuation)
    stop_words = set(nltk.corpus.stopwords.words('english'))
    # tokenize, POS-tag, and chunk using regular expressions
    chunker = nltk.chunk.regexp.RegexpParser(grammar)
    tagged_sents = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text))
    all_chunks = list(itertools.chain.from_iterable(nltk.chunk.tree2conlltags(chunker.parse(tagged_sent))
                                                    for tagged_sent in tagged_sents))
    # join constituent chunk words into a single chunked phrase
    candidates = [' '.join(word for word, pos, chunk in group).lower()
                  for key, group in itertools.groupby(all_chunks, lambda word__pos__chunk: word__pos__chunk[2] != 'O') if key]
    return [cand for cand in candidates if cand not in stop_words and not all(char in punct for char in cand) and not cand.isdigit()]

def extract_candidate_words(text, good_tags=set(['JJ','JJR','JJS','NN','NNP','NNS','NNPS'])):
    # exclude candidates that are stop words or entirely punctuation
    punct = set(string.punctuation)
    stop_words = set(nltk.corpus.stopwords.words('english'))
    # tokenize and POS-tag words
    tagged_words = itertools.chain.from_iterable(nltk.pos_tag_sents(nltk.word_tokenize(sent)
                                                                    for sent in nltk.sent_tokenize(text)))
    # filter on certain POS tags and lowercase all words
    candidates = [word.lower() for word, tag in tagged_words if tag in good_tags and word.lower() not in stop_words and not all(char in punct for char in word)]
    return candidates

#Keyphrase selection - frequency statistic-based approach with gensim (another option - with skilearn)
# Replace candidates='chunks' with 'words' to compare outputs
def score_keyphrases_by_tfidf(texts, candidates):
    MODELS_DIR = '/Users/dariaulybina/Desktop/georgetown-analytics_global-economics/global-economics/KeywordsANDClustering/models/'
    # extract candidates from each text in texts, either chunks or words
    extract = {
        'chunks': extract_candidate_chunks,
        'words': extract_candidate_words,
    }[candidates]

    boc_texts = [
        extract(texts.raw(fileid)) for fileid in texts.fileids()
    ]

    # make gensim dictionary and corpus
    dictionary = corpora.Dictionary(boc_texts)
    dictionary.save(os.path.join(MODELS_DIR,'GenDict.dict'))
    #compile corpus (vectors number of times each elements appears)
    corpus = [dictionary.doc2bow(boc_text) for boc_text in boc_texts]
    #Then convert tokenized documents to vectors
    #Save the vectorized corpus as a .mm file
    corpora.MmCorpus.serialize(os.path.join(MODELS_DIR,"CorpSer.mm"),corpus) 
    # transform corpus with tf*idf model
    tfidf = gensim.models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]

    return corpus_tfidf, dictionary

#### Keyphrases by TF-IDF based on identifying relevant words belonging to the predefined part-of-speech tagging were less interesting comparing with 'chunks' - predefined patterns or sequences of part-of-speech

In [7]:
#tfidfs, id2word = score_keyphrases_by_tfidf(reader, 'words')
#print(type(id2word))
#print(id2word)

In [8]:
tfidfc, id2wordc = score_keyphrases_by_tfidf(reader, 'chunks')

In [9]:
print(type(id2wordc))
print(id2wordc)

<class 'gensim.corpora.dictionary.Dictionary'>
Dictionary(151754 unique tokens: ['dec', 'total demand', 'new spending priorities', 'strong record on financial inclusion', 'country policies']...)


### I find the results of "chunks" + gensim extraction to look more reliable and interesting, compared with "words"

You can try everything and decide, what works for you

__Option 'Chunks':__ Dictionary(151754 unique tokens: ['dilution of government ownership', 'safety net system', 'maltese financial system', 'nordic cooperation', 'decisive structural reform']...)<br>
__Option 'Words':__ Dictionary(31153 unique tokens: ['priv', 'renewable', 'prices—equivalent', 'mine—brought', 'terms-of-trade-driven']...)

In [15]:
print(data[0])

{'text': "(As of September 30, 2016) Membership Status: Joined: August 31, 1962; General Resources Account: Quota Fund holdings of currency (Exchange rate) Reserve Tranche Position SDR Department: Net cumulative allocation Holdings Outstanding Purchases and Loans: ESF Arrangements B. Latest Financial Arrangements Article VIII %Allocation Type ESF Date of Expiration Amount Approved Amount Drawn Arrangement Dec 19, 2008 Jun 10, 2010 Apr 28, 2003 Apr 20, 1998 Apr 27, 2006 Apr 19, 2002 Formerly PRGF. Projected Payments to Fund Principal Charges/Interest Forthcoming Total When a member has overdue financial obligations outstanding for more than three months, the amount of such arrears will be shown in this section. Page 86 Implementation of HIPC Initiative: Commitment of HIPC assistance Decision point date Assistance committed by all creditors (US$ million) Of which: IMF assistance (US$ million) (SDR equivalent in millions) Completion point date II. Disbursement of IMF assistance (SDR milli

In [16]:
updateD = {}
for d in data:
    identif = d['key2']
    upD = {
        identif: {'text': d['text'],'header': d['header'],'tag':d['tag'],
                  'country': d['country'],'year': d['year'], 'key_old':d['key1']}
    }
    updateD.update(upD)

In [18]:
#Print and save top 10 keywords by TF-IDF ranking per document
finalD = {}
with open(os.path.join('/Users/dariaulybina/Desktop/georgetown-analytics_global-economics/global-economics/KeywordsANDClustering/gensimTop10Ranks.txt'),'w') as f:
    for idx, doc in enumerate(tfidfc):
        internalD = updateD[fids[idx]]
        print("Document '{}' key phrases:".format(fids[idx]))
        #Get top 10 terms by TF-IDF score
        newL = []
        for wid, score in heapq.nlargest(10, doc, key=itemgetter(1)):
            newDict = {
                'word': id2wordc[wid],
                'score': score
            }
            newL.append(newDict)
            print("{:0.3f}: {}".format(score, id2wordc[wid]))
            f.write("{:0.3f}: {}".format(score, id2wordc[wid]))
        print()
        internalD['vocab'] = newL
        updateD.update(internalD)


Document 'Afghanistan2014_0.txt' key phrases:
0.343: afghanistan
0.283: donors
0.124: donor support
0.119: macroeconomic stability
0.111: nato summit in chicago
0.111: monitoring board meeting
0.111: joint coordination
0.111: subsequent reviews
0.111: external current account positions
0.111: aid at similar levels

Document 'Afghanistan2014_1.txt' key phrases:
0.412: drug industry
0.151: candidates
0.139: first round
0.135: presidential election
0.118: united nations office on drugs
0.118: nkb
0.118: provincial elections
0.118: impact on economic stability
0.118: unodc
0.118: opium cultivation

Document 'Afghanistan2014_10.txt' key phrases:
0.274: identified vehicle value approach
0.274: tarval
0.137: purchasing module
0.137: valuation database
0.137: pre-implementation compliance program
0.137: local valuation specialists
0.137: large ministries
0.137: valuation module of asycuda world
0.137: trained unit
0.137: valuation management tool

Document 'Afghanistan2014_11.txt' key phrases:

In [19]:
updateD['Senegal2016_0.txt']

{'country': 'Senegal',
 'header': 'RELATIONS WITH THE FUND',
 'key_old': '-doc-cr1701_0.txt',
 'tag': 'Other',
 'text': "(As of September 30, 2016) Membership Status: Joined: August 31, 1962; General Resources Account: Quota Fund holdings of currency (Exchange rate) Reserve Tranche Position SDR Department: Net cumulative allocation Holdings Outstanding Purchases and Loans: ESF Arrangements B. Latest Financial Arrangements Article VIII %Allocation Type ESF Date of Expiration Amount Approved Amount Drawn Arrangement Dec 19, 2008 Jun 10, 2010 Apr 28, 2003 Apr 20, 1998 Apr 27, 2006 Apr 19, 2002 Formerly PRGF. Projected Payments to Fund Principal Charges/Interest Forthcoming Total When a member has overdue financial obligations outstanding for more than three months, the amount of such arrears will be shown in this section. Page 86 Implementation of HIPC Initiative: Commitment of HIPC assistance Decision point date Assistance committed by all creditors (US$ million) Of which: IMF assistance

In [21]:
import json
almost = json.dumps(updateD)
with open("POSpatternOUT.json","w") as f:
    f.write(almost)

### Text Rank Algorithm

Importance of a candidate is determined by its relatedness to other candidates, where “relatedness” may be measured by two terms’ frequency of co-occurrence or semantic relatedness. Method assumes that more important candidates are related to a greater number of other candidates, and that more of those related candidates are also considered important; it does not, however, ensure that selected keyphrases cover all major topics, although multiple variations try to compensate for this weakness.

Essentially, a document is represented as a network whose nodes are candidate keyphrases (typically only key words) and whose edges (optionally weighted by the degree of relatedness) connect related candidates. Then, a graph-based ranking algorithm, such as Google’s famous PageRank, is run over the network, and the highest-scoring terms are taken to be the document’s keyphrases.

* __TextRank__
* __DivRank__: attempts to ensure good topic coverage
* __Topic-based clustering method__

In [22]:
# Text Rank Algorithm
def score_keyphrases_by_textrank(text, n_keywords=0.05):
    from itertools import takewhile, tee
    import networkx, nltk
    
    # tokenize for all words, and extract *candidate* words
    words = [word.lower()
             for sent in nltk.sent_tokenize(text)
             for word in nltk.word_tokenize(sent)]
    candidates = extract_candidate_words(text)
    # build graph, each node is a unique candidate
    graph = networkx.Graph()
    graph.add_nodes_from(set(candidates))
    # iterate over word-pairs, add unweighted edges into graph
    def pairwise(iterable):
        """s -> (s0,s1), (s1,s2), (s2, s3), ..."""
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)
    for w1, w2 in pairwise(candidates):
        if w2:
            graph.add_edge(*sorted([w1, w2]))
    # score nodes using default pagerank algorithm, sort by score, keep top n_keywords
    ranks = networkx.pagerank(graph)
    if 0 < n_keywords < 1:
        n_keywords = int(round(len(candidates) * n_keywords))
    word_ranks = {word_rank[0]: word_rank[1]
                  for word_rank in sorted(ranks.items(), key=lambda x: x[1], reverse=True)[:n_keywords]}
    keywords = set(word_ranks.keys())
    # merge keywords into keyphrases
    keyphrases = {}
    j = 0
    for i, word in enumerate(words):
        if i < j:
            continue
        if word in keywords:
            kp_words = list(takewhile(lambda x: x in keywords, words[i:i+10]))
            avg_pagerank = sum(word_ranks[w] for w in kp_words) / float(len(kp_words))
            keyphrases[' '.join(kp_words)] = avg_pagerank
            # counter as hackish way to ensure merged keyphrases are non-overlapping
            j = i + len(kp_words)
    
    return sorted(keyphrases.items(), key=lambda x: x[1], reverse=True)

def extract_candidate_features(candidates, doc_text, doc_excerpt, doc_title):
    import collections, math, nltk, re
    
    candidate_scores = collections.OrderedDict()
    
    # get word counts for document
    doc_word_counts = collections.Counter(word.lower()
                                          for sent in nltk.sent_tokenize(doc_text)
                                          for word in nltk.word_tokenize(sent))
    
    for candidate in candidates:
        
        pattern = re.compile(r'\b'+re.escape(candidate)+r'(\b|[,;.!?]|\s)', re.IGNORECASE)
        
        # frequency-based
        # number of times candidate appears in document
        cand_doc_count = len(pattern.findall(doc_text))
        # count could be 0 for multiple reasons; shit happens in a simplified example
        if not cand_doc_count:
            print('**WARNING: {} not found!'.format(candidate))
            continue
    
        # statistical
        candidate_words = candidate.split()
        max_word_length = max(len(w) for w in candidate_words)
        term_length = len(candidate_words)
        # get frequencies for term and constituent words
        sum_doc_word_counts = float(sum(doc_word_counts[w] for w in candidate_words))
        try:
            # lexical cohesion doesn't make sense for 1-word terms
            if term_length == 1:
                lexical_cohesion = 0.0
            else:
                lexical_cohesion = term_length * (1 + math.log(cand_doc_count, 10)) * cand_doc_count / sum_doc_word_counts
        except (ValueError, ZeroDivisionError) as e:
            lexical_cohesion = 0.0
        
        # positional
        # found in title, key excerpt
        in_title = 1 if pattern.search(doc_title) else 0
        in_excerpt = 1 if pattern.search(doc_excerpt) else 0
        # first/last position, difference between them (spread)
        doc_text_length = float(len(doc_text))
        first_match = pattern.search(doc_text)
        abs_first_occurrence = first_match.start() / doc_text_length
        if cand_doc_count == 1:
            spread = 0.0
            abs_last_occurrence = abs_first_occurrence
        else:
            for last_match in pattern.finditer(doc_text):
                pass
            abs_last_occurrence = last_match.start() / doc_text_length
            spread = abs_last_occurrence - abs_first_occurrence

        candidate_scores[candidate] = {'term_count': cand_doc_count,
                                       'term_length': term_length, 'max_word_length': max_word_length,
                                       'spread': spread, 'lexical_cohesion': lexical_cohesion,
                                       'in_excerpt': in_excerpt, 'in_title': in_title,
                                       'abs_first_occurrence': abs_first_occurrence,
                                       'abs_last_occurrence': abs_last_occurrence}

    return candidate_scores


### Score keyphrases by textrank
__score_keyphrases_by_textrank__ function is one of the two implementations of the TextRank algorithm. Only unigram candidates (not chunks or n-grams) are added to the network as nodes, the co-occurrence window size is fixed at 2 (so only adjacent words are said to “co-occur”), and the edges between nodes are unweighted (rather than weighted by the number of co-occurrences). The N top-scoring candidates are taken to be its keywords; sequences of adjacent keywords are merged to form key phrases and their individual PageRank scores are averaged, so as not to bias for longer keyphrases.

In [31]:
# earlier I assigned text of random article to give an example of a variable
#mon = reader.fileids(categories=['Monetary'])
print(score_keyphrases_by_textrank(example))

[('boz', 0.02475723282140375), ('exchange', 0.02000276709204734), ('policy', 0.019263767620855295), ('exchange rate', 0.01908776216451627), ('policy rate', 0.018718262428920246), ('rate', 0.018172757236985197), ('percent', 0.01559210434929047), ('reserves', 0.015532798112084877), ('staff', 0.014991675698561273), ('foreign exchange', 0.014641565208155165), ('foreign exchange transactions', 0.01378903925569018), ('monetary policy', 0.013087421179656377), ('transactions', 0.01208398735076021), ('international reserves', 0.011595661937878594), ('authorities', 0.011502547281351314), ('boz’s', 0.010012013408469612), ('international transactions', 0.009871256557216261), ('payments', 0.009170146375930832), ('inflation', 0.009015946761577985), ('kwacha', 0.00887784231604226), ('bond', 0.008465357075252672), ('tax payments', 0.00793567274349104), ('subsidies', 0.00762343132444717), ('monitoring', 0.007065969305773568), ('regulation', 0.006940068661316348), ('fees', 0.006914981070863331), ('tax',

#### Extraction of candidate features - specific features of a keyphrase
#### Optional to see quality and stat parameters of a chosen keyword/group of keywords

In [None]:
candidates = ["liquidity"]
doc_text = example
doc_title  = 'None'
doc_excerpt = 'None'
candidate_scores = extract_candidate_features(candidates, doc_text, doc_excerpt, doc_title)
print("Title: {}".format(doc_title))
print('Term of interest: "banks")
print(candidate_scores)