nltk-pip install nltkscipy-pip install scipy
The pipeline begins with preprocessing the article files in to Python objects.
Articles are stored in files, and preprocessing begins by tokenizing every sentence
in the article, and building Sentence objects from them.
When a Sentence is initialized, it stores the original sentence string, and a
processed string. The processed string undergoes the following pipeline:
- Non-ascii character removal
- Stop word removal
- Stemming (Snowball)
At the end of a preprocessing operation a Sentence might look like this:
-
Original:
I had to fire General Flynn because he lied to the Vice President and the FBI, Mr. Trump wrote. -
Preprocessed:
[u'fire', u'general', u'flynn', u'lie', u'vice', u'presid', u'fbi', 'mr', u'trump', u'wrote']
Similarity scoring uses the count of common stems existing in two preprocessed lists (Sentences) A and B. The more common stems list A and B, the higher the similarity score.
The similarity score is defined as: count_common_terms / log10(len(a) * len(b))
This accounts for strings of any length, and uses the log of that length in an inverse proportion.
Scores are originally ordinal and not normalized. However, the Sentence class
provides a method to normalize similarities around its instance:
get_similar_scores_to_self(sentences)
A list of similarity scores are generated by comparing self to all Sentence
objects s in list sentences. These scores are then normalized from the min and max
similarity scores found for list sentences.
Query (self): Mr. Trump's thoughts on the tax cut
Output: 0.611894630494 I had to fire General Flynn because he lied to the Vice President and the FBI, Mr. Trump wrote.
The similarity scores are used as the graph edge weights in our data structure.