# Keyword Extraction Using RAKE

The RAKE algorithm extracts keywords using a delimiter-based approach to identify candidate keywords and scores them using word co-occurrences that appear in the candidate keywords. Keywords can contain multiple tokens. Furthermore, the RAKE algorithm also merges keywords when they appear multiple times, separated by the same merging delimiter.

Installing dependencies

In [4]:
! pip install rake-nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6


In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm

from rake_nltk import Rake


In this cell, we download the necessary dependencies for the RAKE algorithm through nltk.

In [18]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
data = pd.read_csv('metadata.csv')
data.head()

Unnamed: 0,audio_path,filename,subset,speaker_id,chapter_id,file_id,id,sex,minute,speaker_name,sentence
0,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0019.wav,train-clean-360,100.0,121669.0,19.0,100-121669-0019,F,25.06,Judy Bieber,BUT WHEN HE CAME TO THE STY THERE WAS NO PIG T...
1,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0014.wav,train-clean-360,100.0,121669.0,14.0,100-121669-0014,F,25.06,Judy Bieber,AND SO HE WENT SOFTLY UP TO THE PIG STY AND RE...
2,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0013.wav,train-clean-360,100.0,121669.0,13.0,100-121669-0013,F,25.06,Judy Bieber,THE FARMER GAVE ME NOTHING BUT A SCOLDING BUT ...
3,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0035.wav,train-clean-360,100.0,121674.0,35.0,100-121674-0035,F,25.06,Judy Bieber,AH YOU CAN RUN ABOUT ALL DAY IN SUMMER AND IN ...
4,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0017.wav,train-clean-360,100.0,121674.0,17.0,100-121674-0017,F,25.06,Judy Bieber,IT WAS ONE MORNING AFTER CHRISTMAS SAID THE RA...


In [14]:
# drop rows with missing sentences
data = data.dropna(subset=['sentence'])
data

Unnamed: 0,audio_path,filename,subset,speaker_id,chapter_id,file_id,id,sex,minute,speaker_name,sentence
0,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0019.wav,train-clean-360,100.0,121669.0,19.0,100-121669-0019,F,25.06,Judy Bieber,BUT WHEN HE CAME TO THE STY THERE WAS NO PIG T...
1,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0014.wav,train-clean-360,100.0,121669.0,14.0,100-121669-0014,F,25.06,Judy Bieber,AND SO HE WENT SOFTLY UP TO THE PIG STY AND RE...
2,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0013.wav,train-clean-360,100.0,121669.0,13.0,100-121669-0013,F,25.06,Judy Bieber,THE FARMER GAVE ME NOTHING BUT A SCOLDING BUT ...
3,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0035.wav,train-clean-360,100.0,121674.0,35.0,100-121674-0035,F,25.06,Judy Bieber,AH YOU CAN RUN ABOUT ALL DAY IN SUMMER AND IN ...
4,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0017.wav,train-clean-360,100.0,121674.0,17.0,100-121674-0017,F,25.06,Judy Bieber,IT WAS ONE MORNING AFTER CHRISTMAS SAID THE RA...
...,...,...,...,...,...,...,...,...,...,...,...
27401,../input/librispeech-asr-wav-dataset/train-cle...,5672-88363-0030.wav,train-clean-360,5672.0,88363.0,30.0,5672-88363-0030,M,25.21,Jacob Paul Starr,THERE IS A PROPHECY ABROAD THAT PROHIBITION WI...
27402,../input/librispeech-asr-wav-dataset/train-cle...,5672-88363-0025.wav,train-clean-360,5672.0,88363.0,25.0,5672-88363-0025,M,25.21,Jacob Paul Starr,THE LARGER THE COUNTY SEAT THE LARGER THE NON ...
27403,../input/librispeech-asr-wav-dataset/train-cle...,5672-75791-0019.wav,train-clean-360,5672.0,75791.0,19.0,5672-75791-0019,M,25.21,Jacob Paul Starr,AND ON THIS WISE WE SHALL BE UNDER NO OBLIGATI...
27404,../input/librispeech-asr-wav-dataset/train-cle...,5672-88367-0030.wav,train-clean-360,5672.0,88367.0,30.0,5672-88367-0030,M,25.21,Jacob Paul Starr,AND THE MUMMY MUTILATED OR DESTROYED COULD NOT...


In [15]:
sentences = data['sentence'].values
sentences

array(['BUT WHEN HE CAME TO THE STY THERE WAS NO PIG TO BE SEEN AND HE SEARCHED ALL ROUND THE PLACE FOR A GOOD HOUR WITHOUT FINDING IT',
       'AND SO HE WENT SOFTLY UP TO THE PIG STY AND REACHED OVER AND GRABBED THE LITTLE PIG BY THE EARS THE PIG SQUEALED OF COURSE BUT THE FARMER WAS MAKING SO MUCH NOISE HIMSELF THAT HE DID NOT HEAR IT',
       'THE FARMER GAVE ME NOTHING BUT A SCOLDING BUT THERE WAS A VERY NICE PIG RUNNING AROUND THE YARD HOW BIG WAS IT ASKED BARNEY OH JUST ABOUT BIG ENOUGH TO MAKE A NICE DINNER FOR YOU AND ME',
       ...,
       'AND ON THIS WISE WE SHALL BE UNDER NO OBLIGATION TO THE SAID KING I REPLIED THAT I HEARD AND OBEYED BEING UNABLE TO OPPOSE HIS COMMAND',
       'AND THE MUMMY MUTILATED OR DESTROYED COULD NOT ENTERTAIN THE GUEST EGYPT CRIED OUT THROUGH THOUSANDS OF YEARS FOR THE ULTIMATE RESURRECTION OF THE WHOLE MAN HIS COMING FORTH BY DAY',
       'THE CHOICE RESIDENTIAL DISTRICTS ARE VOTED DRY FOR REAL ESTATE REASONS THE MEN WHO DO THIS DRINK FREELY AT

In this step we do a pre-process on the text to remove the stop words and punctuations and then we use the RAKE algorithm to extract the keywords. 

In [16]:
def preprocess(sent):
    sent = sent.lower()
    sent = sent.replace('.', " ")
    sent = sent.replace(',', " ")
    sent = sent.replace('?', " ")
    sent = sent.replace('!', " ")
    sent = sent.replace('’', "'")
    sent = sent.replace('‘', "'")
    sent = sent.replace('“', '"')
    sent = sent.replace('”', '"')
    sent = sent.replace('—', '-')
    sent = sent.replace('–', '-')
    sent = ' '.join(sent.split()).strip()
    return sent

sentences = [preprocess(sent) for sent in sentences]
sentences

['but when he came to the sty there was no pig to be seen and he searched all round the place for a good hour without finding it',
 'and so he went softly up to the pig sty and reached over and grabbed the little pig by the ears the pig squealed of course but the farmer was making so much noise himself that he did not hear it',
 'the farmer gave me nothing but a scolding but there was a very nice pig running around the yard how big was it asked barney oh just about big enough to make a nice dinner for you and me',
 'ah you can run about all day in summer and in winter and enjoy yourself in your own way said santa but the poor little children are obliged to stay in the house in the winter and on rainy days in the summer',
 'it was one morning after christmas said the rabbit who seemed to enjoy talking now that he had overcome his fear of dorothy and i was sitting by the road side when santa claus came riding back in his empty sleigh he does not come home quite so fast as he goes',
 'and

In this part, we will use the RAKE algorithm to extract the keywords from the text. In case that RAKE extract less keywords than the number of keywords we set, we will use the '' as empty sequence to fill the algorithm output.

In [20]:
keyword_1 = []
keyword_2 = []
keyword_3 = []
keyword_4 = []
keyword_5 = []

for sentence in tqdm.tqdm(sentences):
    r = Rake()
    r.extract_keywords_from_text(sentence)
    try:
        keyword_1.append(r.get_ranked_phrases()[0])
    except IndexError:
        keyword_1.append('')
    try:
        keyword_2.append(r.get_ranked_phrases()[1])
    except IndexError:
        keyword_2.append('')
    try:
        keyword_3.append(r.get_ranked_phrases()[2])
    except IndexError:
        keyword_3.append('')
    try:
        keyword_4.append(r.get_ranked_phrases()[3])
    except IndexError:
        keyword_4.append('')
    try:
        keyword_5.append(r.get_ranked_phrases()[4])
    except IndexError:
        keyword_5.append('')

keyword_1

100%|██████████| 27406/27406 [00:10<00:00, 2654.18it/s]


['good hour without finding',
 'went softly',
 'nice pig running around',
 'way said santa',
 'santa claus came riding back',
 'apparently mournful meditation',
 'big asked dorothy',
 'view mister rivers',
 'kind one morning',
 'bones picked clean finally',
 'toy rabbit look wonderfully life like',
 'scholar pleased',
 'would allow',
 'know every flower',
 'occupations converse',
 'never stopped running',
 'one big room',
 'case replied santa',
 'day thought fitted thought opinion met opinion',
 'making toys',
 'severe beating',
 'time elapsed',
 'earnestly felt yet strictly restrained zeal breathed soon',
 'magic collar asked dorothy',
 'santa never forgets',
 'father go hungry replied barney unless',
 'grey small antique structure',
 'house tom started',
 'tom fell',
 'chased tom away',
 'farmer said never',
 'sisters would expostulate',
 'bad ways one morning tom tom',
 'running back home',
 'force compressed condensed controlled',
 'nicest little room',
 'please old santa',
 'rabbi

In [21]:
data['keyword_1'] = keyword_1
data['keyword_2'] = keyword_2
data['keyword_3'] = keyword_3
data['keyword_4'] = keyword_4
data['keyword_5'] = keyword_5


In [22]:
data.head()


Unnamed: 0,audio_path,filename,subset,speaker_id,chapter_id,file_id,id,sex,minute,speaker_name,sentence,keyword_1,keyword_2,keyword_3,keyword_4,keyword_5
0,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0019.wav,train-clean-360,100.0,121669.0,19.0,100-121669-0019,F,25.06,Judy Bieber,BUT WHEN HE CAME TO THE STY THERE WAS NO PIG T...,good hour without finding,sty,seen,searched,round
1,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0014.wav,train-clean-360,100.0,121669.0,14.0,100-121669-0014,F,25.06,Judy Bieber,AND SO HE WENT SOFTLY UP TO THE PIG STY AND RE...,went softly,pig sty,pig squealed,much noise,little pig
2,../input/librispeech-asr-wav-dataset/train-cle...,100-121669-0013.wav,train-clean-360,100.0,121669.0,13.0,100-121669-0013,F,25.06,Judy Bieber,THE FARMER GAVE ME NOTHING BUT A SCOLDING BUT ...,nice pig running around,asked barney oh,nice dinner,farmer gave,big enough
3,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0035.wav,train-clean-360,100.0,121674.0,35.0,100-121674-0035,F,25.06,Judy Bieber,AH YOU CAN RUN ABOUT ALL DAY IN SUMMER AND IN ...,way said santa,poor little children,rainy days,winter,winter
4,../input/librispeech-asr-wav-dataset/train-cle...,100-121674-0017.wav,train-clean-360,100.0,121674.0,17.0,100-121674-0017,F,25.06,Judy Bieber,IT WAS ONE MORNING AFTER CHRISTMAS SAID THE RA...,santa claus came riding back,come home quite,road side,one morning,enjoy talking
