<a href="https://colab.research.google.com/github/collinjennings/detectiveLLMs/blob/main/keywordExtractionTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyword Extraction Test with Detective Short Stories

This is just a quick test to see what some prominent keyword extraction functions pull from a Sherlock Holmes short story. These functions are designed for pulling keywords from scientific, academic, and generally nonfiction texts.

The `rake` results are pretty much nonsense, and the `spacy` results are largely named entities along with other capitalized words that appear in the middle of sentences.

The ultimate goal is to find an algorithmic way to describe the distinctive features of clues in detective short stories. For a genre that developed well-worn conventions, I expect to find some patterns in how clues appear in the texts. However, following Poe's model in the foundational "Purloined Letter" story, it seems like the patterns may have to do with introducing objects that are in a sense 'hiding in plain sight.'

In [1]:
!pip install rake-nltk

Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl.metadata (6.4 kB)
Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6


In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
from rake_nltk import Rake

In [5]:
rake_nltk_var = Rake()

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
%cd '/content/drive/MyDrive/Colab Notebooks'

/content/drive/MyDrive/Colab Notebooks


In [12]:
text = open('data/texts/MSH03.txt', encoding='utf-8').read()

In [14]:
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()

In [15]:
print(keyword_extracted)

['two small things — mere formalities —', 'slip ?” “ impossible ,” answered pycroft', 'stock exchange list every morning .’ “‘', 'british broken hills ?’ “‘ seven', 'clear enough ,” said holmes impatiently', 'exit ?” “ none .” “', 'whole thing hinges upon two points', 'soul ,” groaned hall pycroft', 'financial agent ,” printed upon', 'easily enough ,” said hall pycroft', 'new zealand consolidated ?’ “‘', 'last sudden move .” “', 'little deductive problems .” “', 'look remarkably robust .” “', 'voice within bade us enter', 'six .’ “‘ wonderful !’', 'seventy minutes ,” holmes remarked', 'saturdays .” “ never mind', 'attendant —” “ ah yes', 'aroused .” hall pycroft shook', 'new clerk named hall pycroft', 'gold .” sherlock holmes rubbed', '.’ “‘ well ,’ said', '.” “‘“ done !” said', '£ 500 .”’ “', 'goes !” cried hall pycroft', 'managing director ?” “ quite', '?” “ certainly .” “', 'brisk drumming upon woodwork', 'san remo .’ “', 'past results .” “', 'accountant ,” said holmes', 'shows real

In [16]:
import spacy

In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
doc = nlp(text)
print(doc.ents)

(Paddington, Farquhar, one, twelve hundred, a very few years, three months, Sherlock Holmes, Baker Street, one morning, June, the British Medical Journal, Watson, Watson, only last night, Birmingham, Holmes, half, Summer, three days last week, Holmes, more than a few weeks, so wet a June, Holmes, Birmingham, four, Holmes, two, three inches, Hall Pycroft, cabby, half, first, Birmingham, Sherlock Holmes, seventy minutes, Holmes, Hall Pycroft, Watson, Pycroft, Johnnie, Watson, Coxon & Woodhouse's, Draper's Gardens, the spring, Venezuelan, five years, Coxon, twenty-seven, three pounds, Coxon's, Mawson & Williams's, Lombard Street, E. C., London, next Monday, first, week, Coxon's, 17, that very evening, Arthur Pinner, Hall Pycroft, ‘Lately, Coxon & Woodhouse's, ‘And now on the staff, Mawson, “‘Quite so, ‘Well, Parker, Coxon, every morning, Ayrshires, “‘A hundred and six and a quarter to a hundred and five and seven-eighths, ‘And New Zealand, “‘A hundred and four., ‘And British Broken Hills,