Skip to content

css459/qsum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qsum

Summarization platform curated towards search queries

By Eric Lin, Cole Smith

Requirements

  • nltk - pip install nltk
  • scipy - pip install scipy

System Design Pipeline

Preprocessing

The pipeline begins with preprocessing the article files in to Python objects. Articles are stored in files, and preprocessing begins by tokenizing every sentence in the article, and building Sentence objects from them.

When a Sentence is initialized, it stores the original sentence string, and a processed string. The processed string undergoes the following pipeline:

  • Non-ascii character removal
  • Stop word removal
  • Stemming (Snowball)

At the end of a preprocessing operation a Sentence might look like this:

  • Original:

    I had to fire General Flynn because he lied to the Vice President and the FBI, Mr. Trump wrote.

  • Preprocessed:

    [u'fire', u'general', u'flynn', u'lie', u'vice', u'presid', u'fbi', 'mr', u'trump', u'wrote']

Similarity Scoring

Similarity scoring uses the count of common stems existing in two preprocessed lists (Sentences) A and B. The more common stems list A and B, the higher the similarity score.

The similarity score is defined as: count_common_terms / log10(len(a) * len(b))

This accounts for strings of any length, and uses the log of that length in an inverse proportion.

Scores are originally ordinal and not normalized. However, the Sentence class provides a method to normalize similarities around its instance:

get_similar_scores_to_self(sentences)

A list of similarity scores are generated by comparing self to all Sentence objects s in list sentences. These scores are then normalized from the min and max similarity scores found for list sentences.

Query (self): Mr. Trump's thoughts on the tax cut 

Output: 0.611894630494 I had to fire General Flynn because he lied to the Vice President and the FBI,  Mr. Trump wrote.

The similarity scores are used as the graph edge weights in our data structure.

About

Summarization platform curated towards search queries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages