Qsum

Summarization platform curated towards search queries

By Eric Lin, Cole Smith

Requirements

nltk - pip install nltk
scipy - pip install scipy

System Design Pipeline

Preprocessing

The pipeline begins with preprocessing the article files in to Python objects. Articles are stored in files, and preprocessing begins by tokenizing every sentence in the article, and building Sentence objects from them.

When a Sentence is initialized, it stores the original sentence string, and a processed string. The processed string undergoes the following pipeline:

Non-ascii character removal
Stop word removal
Stemming (Snowball)

At the end of a preprocessing operation a Sentence might look like this:

Original:

I had to fire General Flynn because he lied to the Vice President and the FBI, Mr. Trump wrote.
Preprocessed:

[u'fire', u'general', u'flynn', u'lie', u'vice', u'presid', u'fbi', 'mr', u'trump', u'wrote']

Similarity Scoring

Similarity scoring uses the count of common stems existing in two preprocessed lists (Sentences) A and B. The more common stems list A and B, the higher the similarity score.

The similarity score is defined as: count_common_terms / log10(len(a) * len(b))

This accounts for strings of any length, and uses the log of that length in an inverse proportion.

Scores are originally ordinal and not normalized. However, the Sentence class provides a method to normalize similarities around its instance:

get_similar_scores_to_self(sentences)

A list of similarity scores are generated by comparing self to all Sentence objects s in list sentences. These scores are then normalized from the min and max similarity scores found for list sentences.

Query (self): Mr. Trump's thoughts on the tax cut 

Output: 0.611894630494 I had to fire General Flynn because he lied to the Vice President and the FBI,  Mr. Trump wrote.

The similarity scores are used as the graph edge weights in our data structure.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qsum

Summarization platform curated towards search queries

By Eric Lin, Cole Smith

Requirements

System Design Pipeline

Preprocessing

Similarity Scoring

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qsum

Summarization platform curated towards search queries

By Eric Lin, Cole Smith

Requirements

System Design Pipeline

Preprocessing

Similarity Scoring

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages