In [1]:
from polyglot.text import Text
import nltk
import string
from nltk.corpus import stopwords
from collections import Counter

# Term Frequency

The performance of the previous algorithms lead us to review another method. While RAKE and TextRank are more sophisticated methods, they don't work in our case. Some studies need to be done in order to understand why. One possible reason might be because our documents are 'dirty'. It doesn't form a nice and tidy sentences and paragraphs, because it was read from a pdf.

Because of that, we try a simpler method: get the most frequent phrases as keyword. Here is the implementation

* tokenize the documents into word
* get the POS tag for each word
* get candidate keyword -> every noun phrases (consecutive NOUN/PROPN)
* remove candidate keywords that contain stopwords
* make sure that the noun is indeed noun (hard, but we can check if it is not a punctuation)
* return the candidate keywords

In [2]:
s = """
Pemberi kerja adalah orang perseorangan, pengusaha, badan hukum, atau badan-badan lainnya yang mempekerjakan tenaga kerja dengan membayar upah atau imbalan dalam bentuk lain.
Pengusaha adalah orang perseorangan, persekutuan, atau badan hukum yang menjalankan suatu perusahaan milik sendiri.
"""

we tokenize the document into word and extract the POS tag of each word using `polyglot` library. The POS tagger works surprisingly well.

In [3]:
tagged_words = Text(s.lower()).pos_tags
tagged_words

[('pemberi', 'NOUN'),
 ('kerja', 'NOUN'),
 ('adalah', 'VERB'),
 ('orang', 'NOUN'),
 ('perseorangan', 'NOUN'),
 (',', 'PUNCT'),
 ('pengusaha', 'NOUN'),
 (',', 'PUNCT'),
 ('badan', 'NOUN'),
 ('hukum', 'NOUN'),
 (',', 'PUNCT'),
 ('atau', 'CONJ'),
 ('badan', 'NOUN'),
 ('-', 'PUNCT'),
 ('badan', 'NOUN'),
 ('lainnya', 'ADJ'),
 ('yang', 'PRON'),
 ('mempekerjakan', 'VERB'),
 ('tenaga', 'NOUN'),
 ('kerja', 'NOUN'),
 ('dengan', 'ADP'),
 ('membayar', 'VERB'),
 ('upah', 'NOUN'),
 ('atau', 'CONJ'),
 ('imbalan', 'NOUN'),
 ('dalam', 'ADP'),
 ('bentuk', 'NOUN'),
 ('lain', 'ADJ'),
 ('.', 'PUNCT'),
 ('pengusaha', 'NOUN'),
 ('adalah', 'VERB'),
 ('orang', 'NOUN'),
 ('perseorangan', 'NOUN'),
 (',', 'PUNCT'),
 ('persekutuan', 'NOUN'),
 (',', 'PUNCT'),
 ('atau', 'CONJ'),
 ('badan', 'NOUN'),
 ('hukum', 'NOUN'),
 ('yang', 'PRON'),
 ('menjalankan', 'VERB'),
 ('suatu', 'DET'),
 ('perusahaan', 'NOUN'),
 ('milik', 'NOUN'),
 ('sendiri', 'ADJ'),
 ('.', 'PUNCT')]

Next we want to get keyphrases that we want. We can create n-grams and consider all n-grams to be keyphrases, but that will not scale well because the number of n-grams will be large. The solution is to consider a certain class of keyphrase to extract. In our case, that will be noun phrase, that is a phrase that comprises of consecutive noun or proper noun (NOUN/PROPN) and followed by an optional adjective (ADJ). That is the purpose of POS tagger. With POS tagger, we can easily construct a noun phrase with the rule that we have form before. In the above example, *tenaga kerja* and *badan hukum* can be considered a noun phrase.

Extracting noun phrase is the task of a parser. To do that, we can build our own parser, or use the built-in parser provided by `nltk` library. We choose to do the latter. We use the `RegexpParser` and build our own grammar for the parser using regex rule. The grammar basically says that we are interested in noun phrase (NP), which is constructed from either one or more NOUN or PROPN, followed by one or more ADJ.

In [4]:
grammar = "NP: {<NOUN|PROPN>+ <ADJ>*}"
parser = nltk.RegexpParser(grammar)
parse_tree = parser.parse(tagged_words)
parse_tree.pprint()

(S
  (NP pemberi/NOUN kerja/NOUN)
  adalah/VERB
  (NP orang/NOUN perseorangan/NOUN)
  ,/PUNCT
  (NP pengusaha/NOUN)
  ,/PUNCT
  (NP badan/NOUN hukum/NOUN)
  ,/PUNCT
  atau/CONJ
  (NP badan/NOUN)
  -/PUNCT
  (NP badan/NOUN lainnya/ADJ)
  yang/PRON
  mempekerjakan/VERB
  (NP tenaga/NOUN kerja/NOUN)
  dengan/ADP
  membayar/VERB
  (NP upah/NOUN)
  atau/CONJ
  (NP imbalan/NOUN)
  dalam/ADP
  (NP bentuk/NOUN lain/ADJ)
  ./PUNCT
  (NP pengusaha/NOUN)
  adalah/VERB
  (NP orang/NOUN perseorangan/NOUN)
  ,/PUNCT
  (NP persekutuan/NOUN)
  ,/PUNCT
  atau/CONJ
  (NP badan/NOUN hukum/NOUN)
  yang/PRON
  menjalankan/VERB
  suatu/DET
  (NP perusahaan/NOUN milik/NOUN sendiri/ADJ)
  ./PUNCT)


we can see from the above parse tree the noun phrase which is denoted by the NP. Now we walk through the tree and get the noun phrase that we want. We add another rule to that, that is we only consider noun phrase with length more than 1, and disregard phrases that contain stopwords.

In [5]:
# now we get NP as keywords
# by walking on the parse tree
# keywords if any of the word doesn't contain stopwords
keywords = []
for subtree in parse_tree.subtrees():
    if subtree.label() == 'NP' and len(subtree.leaves()) > 1:
        words = [item[0] for item in subtree.leaves()]
        # this filters out keywords with stop words
        if not bool(set(words).intersection(stopwords.words('indonesian'))):
            keywords.append(' '.join([item[0] for item in subtree.leaves()]))
            
keywords

['pemberi kerja',
 'orang perseorangan',
 'badan hukum',
 'tenaga kerja',
 'orang perseorangan',
 'badan hukum']

Now we have our noun phrases! Let's make a function for it

In [6]:
def nounphrase_extractor(text):
    # tokenize to word and get the POS tag
    tagged_words = Text(text.lower()).pos_tags
    
    # sometimes the POS tagger returns error tag
    # there is no perfect POS tagger
    # one possible correction is to correct
    # every punctuations to PUNCT
    tagged_words = [
        (word, 'PUNCT') if word in string.punctuation else (word, tag) 
        for word, tag in tagged_words
    ]
        
    # now parse the words
    # get every word that has consecutive NOUN or PROPN
    # and optionally followed by one or more ADJ
    # this is called NOUN PHRASE (NP)
    # we use NLTK regex parser
    grammar = "NP: {<NOUN|PROPN>+ <ADJ>*}"
    parser = nltk.RegexpParser(grammar)
    parse_tree = parser.parse(tagged_words)
    
    # now we get NP as keywords
    # by walking on the parse tree
    # keywords if any of the word doesn't contain stopwords
    keywords = []
    for subtree in parse_tree.subtrees():
        if subtree.label() == 'NP' and len(subtree.leaves()) > 1:
            words = [item[0] for item in subtree.leaves()]
            # this filters out keywords with stop words
            if not bool(set(words).intersection(stopwords.words('indonesian'))):
                keywords.append(' '.join([item[0] for item in subtree.leaves()]))
    
    return keywords

In [7]:
nounphrase_extractor(s)

['pemberi kerja',
 'orang perseorangan',
 'badan hukum',
 'tenaga kerja',
 'orang perseorangan',
 'badan hukum']

We consider a phrase to be a keyphrase if it occurs frequently in the document. This is often called term frequency, and it is the first important component of the popular TF-IDF algorithm. Now, let's parse the document to see if term frequency method can extract meaningful keyphrases.

In [8]:
with open('data/13_2003_raw.txt') as f:
    doc = f.read()
    
phrases = nounphrase_extractor(doc)
phrases

['www.hukumonline.com undang',
 'undang republik indonesia nomor',
 'rahmat tuhan',
 'esa presiden republik indonesia',
 'pembangunan nasional',
 'rangka pembangunan manusia indonesia seutuhnya',
 'pelaksanaan pembangunan nasional',
 'tenaga kerja',
 'tujuan pembangunan',
 'kedudukan tenaga kerja',
 'pembangunan ketenagakerjaan',
 'kualitas tenaga kerja',
 'peran sertanya',
 'peningkatan perlindungan tenaga kerja',
 'keluarganya sesuai',
 'martabat kemanusiaan',
 'tenaga kerja',
 'hak dasar',
 'kesamaan kesempatan',
 'dasar apapun',
 'kesejahteraan pekerja',
 'perkembangan kemajuan dunia usaha',
 'bidang ketenagakerjaan',
 'tuntutan pembangunan ketenagakerjaan',
 'huruf a',
 'dewan perwakilan rakyat republik indonesia',
 'presiden republik indonesia',
 'tenaga kerja',
 'tenaga kerja',
 'pemberi kerja',
 'orang perseorangan',
 'badan hukum',
 'tenaga kerja',
 'orang perseorangan',
 'badan hukum',
 'orang perseorangan',
 'badan hukum',
 'orang perseorangan',
 'badan hukum',
 'huruf a',
 

if we count the occurence, we get TF (term frequency)

In [9]:
c = Counter(phrases)
c.most_common(20)

[('perjanjian kerja', 81),
 ('serikat buruh', 74),
 ('serikat pekerja', 60),
 ('peraturan perusahaan', 49),
 ('peraturan perundang', 36),
 ('pemutusan hubungan kerja', 35),
 ('bidang ketenagakerjaan', 32),
 ('tenaga kerja', 31),
 ('keputusan menteri', 30),
 ('ketentuan pasal', 24),
 ('undang nomor', 24),
 ('kali ketentuan pasal', 20),
 ('pelatihan kerja', 19),
 ('mogok kerja', 19),
 ('huruf a', 18),
 ('lock out', 17),
 ('hubungan kerja', 16),
 ('uang penggantian hak sesuai', 16),
 ('pemberi kerja', 15),
 ('tenaga kerja asing', 15)]

Now if we look at the top-20 most frequent phrases, some of them are relevant, like *perjanjian kerja*, *serikat buruh*, *serikat pekerja*, and *peraturan perusahaan*. Some of the phrases are not useful though, like *huruf a* or *keputusan menteri*. Why? Because we will often see those phrases in other regulations and not unique to our document. 

Now if you know how the TF-IDF works, these phrases can be removed by the IDF part of the algorithm. Using TF-IDF is the natural improvement of this method. But in order to do that, you need to have a large corpus of regulations to calculate the frequency of a phrase across many documents in the corpus. Nevertheless, term frequency has done a decent job on extracting the keyphrase using only single document in an unsupervise manner.