# **In-Class Assignment: Keyphrase Extraction**

## *IS 5150*
## Name: KEY

In this first in-class assignment for Topic 6, we will learn the most basic form of document summarization: key phrase extraction. We will go through an example of pulling key phrases via collocation of various sized n-grams, and then utilize some annotation methods to perform key-phrase extracted via weighted tag-based phrase extraction.

Let's start with finding collocations in *Alice in Wonderland*, which we can access through the `gutenberg` corpus. We will use several functions from `nltk` to extract common bi-grams (two word combos) and tri-grams (three word combos).

First, we will import our dependencies, we can access the text_normalizer function from my github repo [here](https://github.com/docsfox/Text_Mining/).

## **1) Setting up the Corpus**

#### Text Normalizer Function

In [None]:
# text normalizer

import nltk
nltk.download('stopwords')
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

from pprint import pprint
import numpy as np
import re
from bs4 import BeautifulSoup

import spacy
nlp = spacy.load('en_core_web_sm')                                                                                            # dependencies

import unicodedata

!pip install contractions
import contractions

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]                                                                         # html parsing
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def tokenize_text(text):                                                                                                      # text tokenization
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences] 
    return word_tokens

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')                            # accent removal
    return text

def expand_contractions(text):                                                                                                # expand contractions
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
        expanded_text = ' '.join(expanded_words)
    return expanded_text

def remove_special_characters(text, remove_digits=False):                                                                    # special character removal
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def simple_stemmer(text):                                                                                                   # stemmer
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

def lemmatize_text(text):
    text = nlp(text)                                                                                                        # lemmatizer
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):                                                   # stopword removal
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,                                                # define normalize corpus function
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Other dependencies

In [None]:
#nltk.download() 'punkt', 'gutenberg', 'averaged_perceptron_tagger
from nltk.corpus import gutenberg

from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import TrigramAssocMeasures

from operator import itemgetter
import itertools

from gensim import corpora, models
from gensim.summarization import keywords

#### Let's bring in our Alice text from the `gutenberg` corpus and then go ahead and apply our normalize_corpus function to clean the text.

In [None]:
alice = gutenberg.sents(fileids='carroll-alice.txt')
alice = [' '.join(ts) for ts in alice]
norm_alice = list(filter(None, normalize_corpus(alice, text_lemmatization=False)))

In [None]:
norm_alice[0:10]

['[ alice adventures wonderland lewis carroll ]',
 'chapter',
 'rabbit hole',
 'alice beginning get tired sitting sister bank nothing twice peeped book sister reading pictures conversations use book thought alice without pictures conversation',
 'considering mind well could hot day made feel sleepy stupid whether pleasure making daisy chain would worth trouble getting picking daisies suddenly white rabbit pink eyes ran close',
 'nothing remarkable alice think much way hear rabbit say oh dear',
 'oh dear',
 'shall late',
 'thought afterwards occurred ought wondered time seemed quite natural rabbit actually took watch waistcoat pocket looked hurried alice started feet flashed across mind never seen rabbit either waistcoat pocket watch take burning curiosity ran across field fortunately time see pop large rabbit hole hedge',
 'another moment went alice never considering world get']

**Why would we set lemmatization to false, if the end goal is to find frequent collocations?**

We may want to retain information about tense or other inflection information so we can best represent the meaning of key phrases (i.e., "the cat pajamas" vs. "the cats pajamas") shouldn't be treated as the same phrase.

## **2) Defining an N-Grams function**

Next we will define a function for computing n-grams, so that we can group collocations based on different sizes of n; n = 1 is a unigram, n = 2 is a bi-gram, etc.

In [None]:
def compute_ngrams(sequence, n):
    return list(
            zip(*(sequence[index:] 
                     for index in range(n)))
    )

In [None]:
# let's see what it does

print("bigrams:", compute_ngrams(["the", "broken", "door", "hinge"], 2))            # bigrams
print("trigrams:", compute_ngrams(["the", "broken", "door", "hinge"], 3))           # trigrams

bigrams: [('the', 'broken'), ('broken', 'door'), ('door', 'hinge')]
trigrams: [('the', 'broken', 'door'), ('broken', 'door', 'hinge')]


## **3) Generate Top N-grams**

Before we can generate the most frequent n-grams, we need to "flatten" our corpus inot one continuous string, so that we can find the most frequent n-grams over the whole text.

To get the most frequent n-grams, we will combine our flatten text function, our compute_ngrams function, and some existing functions from `FreqDist` in `nltk` to examine the frequency of different n-grams and then return the most frequent.

In [None]:
def flatten_corpus(corpus):
    return ' '.join([document.strip() 
                     for document in corpus])
    

def get_top_ngrams(corpus, ngram_val=1, limit=5):                                                     # default parameters are unigram, top 5 most frequent

    corpus = flatten_corpus(corpus)                                                                   # convert corpus to one long string
    tokens = nltk.word_tokenize(corpus)                                                               # tokenize the flattened corpus to word tokens

    ngrams = compute_ngrams(tokens, ngram_val)                                                        # compute n_grams on tokens, based on ngram_val value
    ngrams_freq_dist = nltk.FreqDist(ngrams)                                                          # find the frequency distribution of the ngrams
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(),                                               # sort the nrgams from most to least frequent
                              key=itemgetter(1), reverse=True)
    sorted_ngrams = sorted_ngrams_fd[0:limit]                                                         # select first in list, up to the limit
    sorted_ngrams = [(' '.join(text), freq)                                                           # join together the n-grams into a list
                     for text, freq in sorted_ngrams]

    return sorted_ngrams

**Let's find the top 10 bigrams. What parameters do I need to change?**

In [None]:
get_top_ngrams(corpus=norm_alice, ngram_val=2, limit=10)

[('said alice', 123),
 ('mock turtle', 56),
 ('march hare', 31),
 ('said king', 29),
 ('thought alice', 26),
 ('white rabbit', 22),
 ('said hatter', 22),
 ('said mock', 20),
 ('said caterpillar', 18),
 ('said gryphon', 18)]

**Next, let's find the top 10 trigrams. What information do we ascertain about the text from these n-grams?**

In [None]:
get_top_ngrams(corpus=norm_alice, ngram_val=3, limit=10)

[('said mock turtle', 20),
 ('said march hare', 10),
 ('poor little thing', 6),
 ('little golden key', 5),
 ('certainly said alice', 5),
 ('white kid gloves', 5),
 ('march hare said', 5),
 ('mock turtle said', 5),
 ('know said alice', 4),
 ('might well say', 4)]

Appears to be primarily dialogue, which might indicate that there is a lot of dialogue in the book. In addition, it gives us an idea of the main characters in the book. It doesn't tell us much about the plot though.

**We can also utilize `nltk` collaction finder functions if we want to go beyond finding just the raw frequencies, like so:**

You can find the other metrics of collocation in the documentation [here](https://tedboy.github.io/nlps/generated/generated/nltk.BigramAssocMeasures.html)

In [None]:
finder = BigramCollocationFinder.from_documents([item.split() 
                                                for item 
                                                in norm_alice])
finder

print("Raw Frequencies:", finder.nbest(BigramAssocMeasures.raw_freq, 10))
print("PMI:", finder.nbest(BigramAssocMeasures.pmi, 10))
print("MI_like:", finder.nbest(BigramAssocMeasures.mi_like, 10))

Raw Frequencies: [('said', 'alice'), ('mock', 'turtle'), ('march', 'hare'), ('said', 'king'), ('thought', 'alice'), ('said', 'hatter'), ('white', 'rabbit'), ('said', 'mock'), ('said', 'caterpillar'), ('said', 'gryphon')]
PMI: [('abide', 'figures'), ('acceptance', 'elegant'), ('accounting', 'tastes'), ('accustomed', 'usurpation'), ('act', 'crawling'), ('adjourn', 'immediate'), ('adoption', 'energetic'), ('affair', 'trusts'), ('agony', 'terror'), ('alarmed', 'proposal')]
MI_like: [('mock', 'turtle'), ('march', 'hare'), ('said', 'alice'), ('soo', 'oop'), ('white', 'rabbit'), ('join', 'dance'), ('beg', 'pardon'), ('beau', 'ootiful'), ('mary', 'ann'), ('yer', 'honour')]


In [None]:
finder = TrigramCollocationFinder.from_documents([item.split() 
                                                for item 
                                                in norm_alice])

print("Raw Frequencies:", finder.nbest(TrigramAssocMeasures.raw_freq, 10))
print("PMI:", finder.nbest(TrigramAssocMeasures.pmi, 10))
print("MI_like:", finder.nbest(TrigramAssocMeasures.mi_like, 10))

Raw Frequencies: [('said', 'mock', 'turtle'), ('said', 'march', 'hare'), ('poor', 'little', 'thing'), ('little', 'golden', 'key'), ('march', 'hare', 'said'), ('mock', 'turtle', 'said'), ('white', 'kid', 'gloves'), ('beau', 'ootiful', 'soo'), ('certainly', 'said', 'alice'), ('might', 'well', 'say')]
PMI: [('accustomed', 'usurpation', 'conquest'), ('adjourn', 'immediate', 'adoption'), ('adoption', 'energetic', 'remedies'), ('ancient', 'modern', 'seaography'), ('apple', 'roast', 'turkey'), ('arithmetic', 'ambition', 'distraction'), ('brother', 'latin', 'grammar'), ('canvas', 'bag', 'tied'), ('cherry', 'tart', 'custard'), ('circle', 'exact', 'shape')]
MI_like: [('accustomed', 'usurpation', 'conquest'), ('adjourn', 'immediate', 'adoption'), ('adoption', 'energetic', 'remedies'), ('ancient', 'modern', 'seaography'), ('apple', 'roast', 'turkey'), ('arithmetic', 'ambition', 'distraction'), ('brother', 'latin', 'grammar'), ('canvas', 'bag', 'tied'), ('cherry', 'tart', 'custard'), ('circle', 'ex

**What, if any additional information do we glean about the topic/plot of the text that would be helpful for text summarization?**

## **Weighted Tag-Based Phrase Extration**

We will use a different text source for this example; instead using some wikipedia text about elephants (why not)!



## **1) Setting up the corpus**

In [None]:
data = open('/content/elephants.txt', 'r+').readlines()
sentences = nltk.sent_tokenize(data[0])
len(sentences)

29

In [None]:
sentences[:3]

['Elephants are large mammals of the family Elephantidae and the order Proboscidea.',
 'Three species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus).',
 'Elephants are scattered throughout sub-Saharan Africa, South Asia, and Southeast Asia.']

**Thinking about the fact that we want to extract noun phrases from our corpus AND retain meaning of the key phrases we extract, what are some of the normalization steps we might want to skip and why?**

Capitals can help us identify proper nouns better, stopwords similarly may be important for parsing (ID noun phrase by determiner noun), lemmatization can reduce word meaning in key phrase interpretation.

In [None]:
norm_sentences = normalize_corpus(sentences, text_lower_case=False, 
                                  text_lemmatization=False, stopword_removal=False)
norm_sentences[:3]

['Elephants are large mammals of the family Elephantidae and the order Proboscidea ',
 'Three species are currently recognised the African bush elephant Loxodonta africana the African forest elephant L cyclotis and the Asian elephant Elephas maximus ',
 'Elephants are scattered throughout subSaharan Africa South Asia and Southeast Asia ']

## **2) Extract all noun phrase chunk using shallow parsing**

In our first step of weighted tag-based extraction, we first want to select all of the noun phrases. Remember that shallow parsing relies on basic part-of-speech tags that we can produce using our built-in `nltk.pos_tag`.

We will use some regex to identify noun phrases based on the combination of the typical POS that constitute a noun phrase: determiners, adjectives, and nouns.

> **What are some noun phrases we could miss out on based on our basic regex?**

More elaborate noun phrases that have additional modifiers or other tags preceding the noun; also missing post modification of nouns ("the girl with red hair").

In [None]:
def get_chunks(sentences, grammar = r'NP: {<DT>? <JJ>* <NN.*>+}', stopword_list=stopword_list):
    all_chunks = []
    chunker = nltk.chunk.regexp.RegexpParser(grammar)
    for sentence in sentences:      
        tagged_sents = [nltk.pos_tag(nltk.word_tokenize(sentence))]      
        chunks = [chunker.parse(tagged_sent) 
                      for tagged_sent in tagged_sents]
        wtc_sents = [nltk.chunk.tree2conlltags(chunk)
                         for chunk in chunks]                                                       # creates a triple of words, tags, and chunk tags
        flattened_chunks = list(
                            itertools.chain.from_iterable(
                                wtc_sent for wtc_sent in wtc_sents)
                           )
        valid_chunks_tagged = [(status, [wtc for wtc in chunk]) 
                                   for status, chunk 
                                       in itertools.groupby(flattened_chunks, 
                                                lambda word_pos_chunk: word_pos_chunk[2] != 'O')]   # remove all tags with chunk tag  = 0
        valid_chunks = [' '.join(word.lower() 
                                for word, tag, chunk in wtc_group 
                                    if word.lower() not in stopword_list)                           # generate phrases from each chunk group
                                        for status, wtc_group in valid_chunks_tagged
                                            if status]                                   
        all_chunks.append(valid_chunks)
    return all_chunks

In [None]:
chunks = get_chunks(norm_sentences)
chunks

## **3) Compute TF-IDF Weights for each chunk**

In [None]:
def get_tfidf_weighted_keyphrases(sentences, 
                                  grammar=r'NP: {<DT>? <JJ>* <NN.*>+}',                                         # set regex grammar for shallow parsing
                                  top_n=10):
    
    valid_chunks = get_chunks(sentences, grammar=grammar)                                                       # get valid chunks using get_chunks function
                                     
    dictionary = corpora.Dictionary(valid_chunks)                                                               # create dictionary for valid_chunks
    corpus = [dictionary.doc2bow(chunk) for chunk in valid_chunks]                                              # assign valid_chunks as chunks to corpus
    
    tfidf = models.TfidfModel(corpus)                                                                           # apply tfidf to chunks in corpus
    corpus_tfidf = tfidf[corpus]                                                                                
    
    weighted_phrases = {dictionary.get(idx): value                                                              # create dictionary of chunks and their tf-idf weights
                           for doc in corpus_tfidf 
                               for idx, value in doc}
                            
    weighted_phrases = sorted(weighted_phrases.items(), 
                              key=itemgetter(1), reverse=True)                                                 # sort the weighted phrases in descending order
    weighted_phrases = [(term, round(wt, 3)) for term, wt in weighted_phrases]                                 # round weights to 3 decimal places
    
    return weighted_phrases[:top_n]                                                                            # return top_n weighted phrases

## **4) Return top weighted phrases**

In [None]:
get_tfidf_weighted_keyphrases(sentences = norm_sentences, top_n=30)

[('water', 1.0),
 ('asia', 0.807),
 ('wild', 0.764),
 ('great weight', 0.707),
 ('pillarlike legs', 0.707),
 ('southeast asia', 0.693),
 ('subsaharan africa south asia', 0.693),
 ('body temperature', 0.693),
 ('flaps', 0.693),
 ('fissionfusion society', 0.693),
 ('multiple family groups', 0.693),
 ('art folklore religion literature', 0.693),
 ('popular culture', 0.693),
 ('ears', 0.681),
 ('males', 0.653),
 ('males bulls', 0.653),
 ('family elephantidae', 0.607),
 ('large mammals', 0.607),
 ('years', 0.607),
 ('environments', 0.577),
 ('impact', 0.577),
 ('keystone species', 0.577),
 ('cetaceans', 0.577),
 ('elephant intelligence', 0.577),
 ('primates', 0.577),
 ('dead individuals', 0.577),
 ('kind', 0.577),
 ('selfawareness', 0.577),
 ('different habitats', 0.57),
 ('marshes', 0.57)]

**What sorts of insights can we derive about the content of this text based on these key phrases? How would you compare these to the n-gram collocation key phrases?**

## **Gensim Keyword Function**

We will talk more about this function in the automated document summarization notebook, but here is one more (fast) way to extract keywords from the text that is built-in to `gensim`.

In [None]:
key_words = keywords(data[0], ratio=1.0, scores=True, lemmatize=True)
[(item, round(score, 3)) for item, score in key_words][:30]

[('african bush elephant', 0.261),
 ('including', 0.141),
 ('family', 0.137),
 ('cow', 0.124),
 ('forests', 0.108),
 ('female', 0.103),
 ('asia', 0.102),
 ('tigers', 0.098),
 ('ivory', 0.098),
 ('sight', 0.098),
 ('objects', 0.098),
 ('males', 0.088),
 ('known', 0.087),
 ('religion', 0.087),
 ('folklore', 0.087),
 ('larger ears', 0.085),
 ('water', 0.075),
 ('highly recognisable', 0.075),
 ('breathing lifting', 0.074),
 ('flaps', 0.073),
 ('africa', 0.072),
 ('gomphotheres', 0.072),
 ('animals tend', 0.071),
 ('success', 0.071),
 ('south', 0.07),
 ('habitat destruction', 0.068),
 ('elephantidae', 0.068),
 ('increased testosterone', 0.067),
 ('iucn', 0.067),
 ('biggest threats', 0.067)]

**This method pulled some of the same and also some different keywords/phrases, and also applied a different scoring/weighting mechanism. What are some of the differences in the keywords/phrases that were identified between the two?**