In [1]:
doc = """It is, perhaps, dreadfully apt that an invasion which began 20 years ago as a counter-terrorism operation has ended in the horror of a mass casualty terrorist attack. The US-led attempt to destroy al-Qaida and rescue Afghanistan from the Taliban was undercut by the Iraq war, which spawned Islamic State. Now the circle is complete as an Afghan IS offshoot emerges as America’s new nemesis.
The Kabul airport atrocity shows just how difficult it is to break the cycle of violence, vengeance and victimisation. Joe Biden’s swift vow to hunt down the perpetrators and “make them pay” presumably means US combat forces will again be in action in Afghanistan soon. If the past is any guide, mistakes will be made, civilians will die, local communities will be antagonised. Result: more terrorists.

It is an obvious irony that US military chiefs in Kabul are collaborating with the Taliban, their sworn enemy, against the common IS foe as the evacuation ends. This suggests negotiators, on both sides, could have tried harder to reach a workable peace deal. It may augur well for future cooperation, for example on humanitarian aid. But the Taliban has many faces – and many cannot be trusted.

Last week’s events have raised yet more questions about Biden’s judgment and competence. He will be blamed personally. His predicament recalls the downfall of another Democratic president, Jimmy Carter. After the disastrous failure of Operation Eagle Claw to rescue US hostages in Tehran in April 1980, Carter was voted out of office the following November.

Biden faces Republican calls to resign. His approval ratings have plunged. But he defiantly insists that quitting Afghanistan is the right thing to do. Polls suggest most Americans agree, though they are critical of how it has been managed. Unlike in Carter’s time, the next presidential election is three years away. By then the agony and humiliations of recent days may be a distant memory.

The Kabul debacle also casts doubt on Biden’s new counter-terrorism strategy, which reportedly downgrades the threat posed by Islamist terrorism to the US. His national security team wants to shift global priorities and resources to meet different, 21st-century challenges to American hegemony, such as China, cyberwarfare and the climate crisis.

Biden is said to want to use the 20th anniversary of the 11 September al-Qaida attacks on New York and Washington to declare America’s “forever wars” over – for which he will claim credit. Setting the Afghan shambles aside, he is expected to say the era of invasion, occupation, nation-building and the “global war on terror” is at an end. “The US approach should centre on gathering intelligence, training indigenous forces, and maintaining air power as well as special forces capabilities for the occasional strike when necessary,” foreign policy analysts Bruce Riedel and Michael O’Hanlon argued recently.

No one knows whether such a costly, hard to organise strategy will work in the long term. But the shift is already having tangible consequences. In Iraq, for example, US combat operations will cease in December. About 2,500 Americans will stay, to train and advise. In Syria, a small number of special forces will remain. Iraqis understandably worry about an IS comeback and an Afghan-style implosion. The same story of US disengagement and drawback is heard across the Middle East as the US “pivots” to Asia. Combat aircraft are being redeployed, carrier battlegroups may be reassigned to the Pacific theatre, and anti-missile batteries are being withdrawn from Iraq, Kuwait, Jordan and Saudi Arabia. Most of these assets were pointed at Iran, deemed a prime sponsor of terrorism.

In the Sahel, west Africa, the Democratic Republic of the Congo and Mozambique, the US barely registers in the fight against Boko Haram and assorted IS and al-Qaida affiliates. The impressively named US Africa Command is headquartered in Stuttgart. President Muhammadu Buhari warns that Nigeria could suffer a similar fate to Afghanistan without a “comprehensive partnership” with the US. “Some sense the west is losing its will for the fight,” he said. For US allies, all this points to a new era of enforced self-sufficiency and greater uncertainty. While Islamist-inspired attacks in the US have been rare since 9/11, in Europe many hundreds have died. Yet collective European counter-terrorism efforts often lack a military cutting edge. An exception was France’s ill-supported Operation Barkhane in Mali – until it was halted this year after suffering many casualties for little gain.

The chaos in Afghanistan has vividly dramatised the ongoing threat from international terrorism. With up to 10,000 foreign Islamist fighters in the country, according to the UN, fears grow it will again become a launchpad for global jihad. So the prospect of a less directly engaged, homeland-focused American counter-terrorism approach is alarming for partners dependent on US leadership and protection.

European Nato allies, sniping at Biden, are in denial. They don’t want to admit his Afghan withdrawal is just the start of something bigger. And as recent events painfully demonstrate, the UK is not remotely able to fend for itself."""

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

n_gram_range = (1, 1)
stop_words = "english"

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names()

In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

top_n = 10
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]

In [5]:
keywords

['battlegroups',
 'kabul',
 'wars',
 'vengeance',
 'afghanistan',
 'war',
 'taliban',
 'terrorism',
 'terrorists',
 'terrorist']

## Max Sum Similarity

The maximum sum distance between pairs of data is defined as the pairs of data for which the distance between them is maximized. In our case, we want to maximize the candidate similarity to the document whilst minimizing the similarity between candidates.
To do this, we select the top 20 keywords/keyphrases, and from those 20, select the 5 that are the least similar to each other:

In [6]:
import numpy as np
import itertools

def max_sum_sim(doc_embedding, word_embeddings, words, top_n, nr_candidates):
    # Calculate distances and extract keywords
    distances = cosine_similarity(doc_embedding, candidate_embeddings)
    distances_candidates = cosine_similarity(candidate_embeddings, 
                                            candidate_embeddings)

    # Get top_n words as candidates based on cosine similarity
    words_idx = list(distances.argsort()[0][-nr_candidates:])
    words_vals = [candidates[index] for index in words_idx]
    distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]

    # Calculate the combination of words that are the least similar to each other
    min_sim = np.inf
    candidate = None
    for combination in itertools.combinations(range(len(words_idx)), top_n):
        sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
        if sim < min_sim:
            candidate = combination
            min_sim = sim

    return [words_vals[idx] for idx in candidate]

In [7]:
max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=10, nr_candidates=20)

['nemesis',
 'iraqis',
 'afghan',
 'casualties',
 'iraq',
 'battlegroups',
 'kabul',
 'vengeance',
 'afghanistan',
 'terrorist']

## Maximal Marginal Relevance

The final method for diversifying our results is Maximal Marginal Relevance (MMR). MMR tries to minimize redundancy and maximize the diversity of results in text summarization tasks. Fortunately, a keyword extraction algorithm called EmbedRank has implemented a version of MMR that allows us to use it for diversifying our keywords/keyphrases.
We start by selecting the keyword/keyphrase that is the most similar to the document. Then, we iteratively select new candidates that are both similar to the document and not similar to the already selected keywords/keyphrases:

In [8]:
import numpy as np

def mmr(doc_embedding, word_embeddings, words, top_n, diversity):

    # Extract similarity within words, and between words and the document
    word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)
    word_similarity = cosine_similarity(word_embeddings)

    # Initialize candidates and already choose best keyword/keyphras
    keywords_idx = [np.argmax(word_doc_similarity)]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    for _ in range(top_n - 1):
        # Extract similarities within candidates and
        # between candidates and selected keywords/phrases
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

        # Calculate MMR
        mmr = (1-diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update keywords & candidates
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return [words[idx] for idx in keywords_idx]

In [9]:
mmr(doc_embedding, candidate_embeddings, candidates, top_n=5, diversity=0.4)

['terrorist', 'taliban', 'president', 'stuttgart', 'war']

## Using Yakes

In [10]:
import yake

In [12]:
language = "en"
max_ngram_size = 2
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 20

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_thresold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(doc)

for kw in keywords:
    print(kw)

('dreadfully apt', 0.022785133990938462)
('mass casualty', 0.02352471727593688)
('Islamic State', 0.0327129687997089)
('Biden', 0.0389498680684717)
('Afghanistan', 0.045659116945111045)
('casualty terrorist', 0.05995041515882425)
('years ago', 0.07344405230642585)
('spawned Islamic', 0.07395018514095082)
('Taliban', 0.07651791790620827)
('Kabul', 0.07809060377021813)
('Carter', 0.09388114756437542)
('forces', 0.09907256564367235)
('Afghan', 0.1032908658437702)
('Iraq', 0.10571918067182962)
('counter-terrorism', 0.10602240199663233)
('operation', 0.1071839223733533)
('rescue Afghanistan', 0.11903176152297258)
('Jimmy Carter', 0.12708494174394855)
('American counter-terrorism', 0.13268264221041906)
('Biden faces', 0.13444137671172468)
