Skip to content
/ sfpd Public

Library implementing surprising frequent phrase detection as defined in "Characterising Semantically Coherent Classes of Text Using Feature Discovery" (2019).

Notifications You must be signed in to change notification settings

andehr/sfpd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 

Repository files navigation

Surprisingly Frequent Phrase Detection

Library implementing surprising frequent phrase detection as defined in "Characterising Semantically Coherent Classes of Text Using Feature Discovery".

If you use this tool, please include the following citation:

Robertson, Andrew David, 2019. Characterising semantically coherent classes of text through feature discovery (Doctoral thesis, University of Sussex).

Here's the bibtex:

@phdthesis{robertson2019characterising,
  title={Characterising semantically coherent classes of text through feature discovery},
  author={Robertson, Andrew David},
  year={2019},
  school={University of Sussex}
}

Install

From source:

python setup.py install

Then install the relevant spacy language model, e.g. English "en":

python -m spacy download en

Usage

Command-line

Use command line interface:

python -m sfpd.cli --help

For example, the following will find surprising phrases in target.csv versus background.csv where the text is found under the CSV header text:

python -m sfpd.cli --target target.csv --background background.csv --text-col-name text

Programmatically

Get an iterable of strings for the target data and background data. If you have two CSV files with text column, you can use a helper function:

from sfpd.util import iter_large_csv_text

target = iter_large_csv_text(target_path, text_column_name)
background = iter_large_csv_text(background_path, text_column_name)

Build Python Counter objects for frequency distributions of the tokens in target and background:

from sfpd.words import count_words

target_counts = count_words(target, min_count=4, language="en")
background_counts = count_words(background, min_count=4, language="en")

This provides the default way we count tokens for finding surprising words. This counting could be tailored however you you like by providing your own Counter objects. However, the counting for phrase expansion is done internally in a specific way.

Next find surprising words using one of the surprising words methods:

from sfpd.words import top_words_llr, top_words_sfpd, top_words_chi2

# My method 
words = top_words_sfpd(target_counts, background_counts)

# Log likelihood ratio method from https://github.com/tdunning/python-llr
words = top_words_llr(target_counts, background_counts)

# Chi-square method
words = top_words_chi2(target_counts, background_counts)

The returned words is a pandas data frame that contains each proposed word, its score, and its count in both the target and background corpora.

Next you can expand these words to phrases:

from sfpd.phrases import get_top_phrases

top_phrases = get_top_phrases(words["word"].values, iter_large_csv_text(target_path, text_column_name))

The output will be a TopPhrases object, which you can interrogate for phrase information:

sfpd/sfpd/phrases.py

Lines 402 to 420 in 6bfabb8

class TopPhrases:
def __init__(self, counters, num_phrases):
"""
Each word is associated with 1 or more phrases. The ordering of most surprising word first is maintained.
"""
self.data = OrderedDict((c.root_form, c.top_ngrams(num_phrases)) for c in counters)
def raw_phrases(self):
"""
For each surprising word in turn, collect all of its phrases, all together in one big list.
"""
return [self.phrase2str(phrase[0]) for word, phrases in self.data.items() for phrase in phrases]
def top_phrases_per_word(self):
return OrderedDict((word, phrases[0]) for word, phrases in self.data.items())
def phrase2str(self, phrase):
return " ".join(phrase)

About

Library implementing surprising frequent phrase detection as defined in "Characterising Semantically Coherent Classes of Text Using Feature Discovery" (2019).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages