Surprisingly Frequent Phrase Detection

Library implementing surprising frequent phrase detection as defined in "Characterising Semantically Coherent Classes of Text Using Feature Discovery".

If you use this tool, please include the following citation:

Robertson, Andrew David, 2019. Characterising semantically coherent classes of text through feature discovery (Doctoral thesis, University of Sussex).

Here's the bibtex:

@phdthesis{robertson2019characterising,
  title={Characterising semantically coherent classes of text through feature discovery},
  author={Robertson, Andrew David},
  year={2019},
  school={University of Sussex}
}

Install

From source:

python setup.py install

Then install the relevant spacy language model, e.g. English "en":

python -m spacy download en

Usage

Command-line

Use command line interface:

python -m sfpd.cli --help

For example, the following will find surprising phrases in target.csv versus background.csv where the text is found under the CSV header text:

python -m sfpd.cli --target target.csv --background background.csv --text-col-name text

Programmatically

Get an iterable of strings for the target data and background data. If you have two CSV files with text column, you can use a helper function:

from sfpd.util import iter_large_csv_text

target = iter_large_csv_text(target_path, text_column_name)
background = iter_large_csv_text(background_path, text_column_name)

Build Python Counter objects for frequency distributions of the tokens in target and background:

from sfpd.words import count_words

target_counts = count_words(target, min_count=4, language="en")
background_counts = count_words(background, min_count=4, language="en")

This provides the default way we count tokens for finding surprising words. This counting could be tailored however you you like by providing your own Counter objects. However, the counting for phrase expansion is done internally in a specific way.

Next find surprising words using one of the surprising words methods:

from sfpd.words import top_words_llr, top_words_sfpd, top_words_chi2

# My method 
words = top_words_sfpd(target_counts, background_counts)

# Log likelihood ratio method from https://github.com/tdunning/python-llr
words = top_words_llr(target_counts, background_counts)

# Chi-square method
words = top_words_chi2(target_counts, background_counts)

The returned words is a pandas data frame that contains each proposed word, its score, and its count in both the target and background corpora.

Next you can expand these words to phrases:

from sfpd.phrases import get_top_phrases

top_phrases = get_top_phrases(words["word"].values, iter_large_csv_text(target_path, text_column_name))

The output will be a TopPhrases object, which you can interrogate for phrase information:

sfpd/sfpd/phrases.py

Lines 402 to 420 in 6bfabb8

    
           class TopPhrases: 
        
               def __init__(self, counters, num_phrases): 
        
                   """ 
        
                   Each word is associated with 1 or more phrases. The ordering of most surprising word first is maintained. 
        
                   """ 
        
                   self.data = OrderedDict((c.root_form, c.top_ngrams(num_phrases)) for c in counters) 
        
               def raw_phrases(self): 
        
                   """ 
        
                   For each surprising word in turn, collect all of its phrases, all together in one big list. 
        
                   """ 
        
                   return [self.phrase2str(phrase[0]) for word, phrases in self.data.items() for phrase in phrases] 
        
               def top_phrases_per_word(self): 
        
                   return OrderedDict((word, phrases[0]) for word, phrases in self.data.items()) 
        
               def phrase2str(self, phrase): 
        
                   return " ".join(phrase)

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
sfpd		sfpd
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Surprisingly Frequent Phrase Detection

Install

Usage

Command-line

Programmatically

About

Releases

Packages

Languages

	class TopPhrases:

	def __init__(self, counters, num_phrases):
	"""
	Each word is associated with 1 or more phrases. The ordering of most surprising word first is maintained.
	"""
	self.data = OrderedDict((c.root_form, c.top_ngrams(num_phrases)) for c in counters)

	def raw_phrases(self):
	"""
	For each surprising word in turn, collect all of its phrases, all together in one big list.
	"""
	return [self.phrase2str(phrase[0]) for word, phrases in self.data.items() for phrase in phrases]

	def top_phrases_per_word(self):
	return OrderedDict((word, phrases[0]) for word, phrases in self.data.items())

	def phrase2str(self, phrase):
	return " ".join(phrase)

andehr/sfpd

Folders and files

Latest commit

History

Repository files navigation

Surprisingly Frequent Phrase Detection

Install

Usage

Command-line

Programmatically

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages