#### Main Pipeline
By: Daniel Gobalakrishnan

In [64]:
import sys
sys.path.append("../")

from src.inc.parser import Parser
from src.inc.topic_selector import TopicSelector
from src.inc.wiki_summarizer import WikiSummarizer

_______________________
Read text from a file using the Parser class. In this case, I extract the text from a PDF file.

In [65]:
path = "../samples/sample.pdf"
parser = Parser(path)
text = parser.get_text()

In [66]:
print(text)

Welcome to Smallpdf Ready to take document management to the next level? Digital Documents\xe2\x80\x94All In One Place With the new Smallpdf experience, you can freely upload, organize, and share digital documents. When you enable the \xe2\x80\x98Storage\xe2\x80\x99 option, we\xe2\x80\x99ll also store all processed files here. Enhance Documents in One Click When you right-click on a file, we\xe2\x80\x99ll present you with an array of options to convert, compress, or modify it. Access Files Anytime, Anywhere You can access files stored on Smallpdf from your computer, phone, or tablet. We\xe2\x80\x99ll also sync files from the Smallpdf Mobile App to our online portal Collaborate With Others Forget mundane administrative tasks. With Smallpdf, you can request e-signatures, send large files, or even enable the Smallpdf G Suite App for your entire organization.


_______________________
Select keywords (named entites) from text using the TopicSelector class. It is set to use automatic language detection by default. The class can also extract a set number of common words in the text based on a given minimum frequency value. 

In [67]:
text = """Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, 
        is an American former professional basketball player and businessman. By acclamation, 
        Michael Jordan is the greatest basketball player of all time. He was integral in helping 
        to popularize the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the 
        process. He played 15 seasons in the NBA, winning six championships with the Chicago 
        Bulls. He is the principal owner and chairman of the Charlotte Hornets of the National 
        Basketball Association and of 23XI Racing in the NASCAR Cup Series."""

ts = TopicSelector(text=text, min_freq=2, lang="auto", n_common_words=8)
keywords = ts.get_keywords()

In [68]:
print(keywords)

{'NBA Chicago Bulls', 'Charlotte Hornets National Basketball', 'Michael Jordan', 'NBA', 'American', 'Michael Jeffrey Jordan'}


_______________________
Find wiki pages for the keywords. Summarize each one using a frequency-based approach or a KMeansClusterer approach. The summaries can also be translated to a target language.

In [69]:
keywords = ['National Basketball Association', 'Michael Jordan', 'Lebron James']
wiki = WikiSummarizer(keywords=keywords, summarizer='freq', min_word_freq=3, dist_metric='cosine', n_clusters=3,
                      max_sent_len=30, summary_len=5, lang='auto', min_summary_char_len=100, target='french')
summaries = wiki.get_summaries()

In [70]:
for keyword in summaries.keys():
    print(keyword, ": ", summaries[keyword], "\n")

National Basketball Association :  Chaque équipe joue six des deux équipes des deux autres divisions de sa conférence quatre fois (24 matchs) et les quatre équipes restantes trois fois (12 matchs).Les équipes post-saison sont l'équipe AL-NBA, l'équipe de toute la défense et l'équipe entièrement recrue;Chacun se compose de cinq joueurs.Par conséquent, l'équipe avec le meilleur record de saison régulière de la ligue est garantie à domicile à domicile à chaque série qu'il joue.La ligue a commencé à utiliser son format actuel, avec les huit meilleures équipes de chaque conférence avançant indépendamment de l'alignement divisionnaire, à la saison 2015-16.Dans la saison 2015-2016, les guerriers ont terminé la saison 73-9, le record de la meilleure saison dans l'histoire de la NBA. 

Michael Jordan :  Il détient les records de la NBA pour la moyenne de la saison régulière de carrière (30,12 points par match) et la moyenne des séries de changements de séries de carrière (33,45 points par match