#### Main Pipeline
By: Daniel Gobalakrishnan

In [None]:
import sys
sys.path.append("../")

from src.inc.parser import Parser
from src.inc.topic_selector import TopicSelector
from src.inc.wiki_summarizer import WikiSummarizer

_______________________
Read text from a file using the Parser class. In this case, I extract the text from a PDF file.

In [None]:
path = "../samples/sample.pdf"
parser = Parser(path)
text = parser.get_text()

In [None]:
print(text)

_______________________
Select keywords (named entites) from text using the TopicSelector class. It is set to use automatic language detection by default. The class can also extract a set number of common words in the text based on a given minimum frequency value. 

In [None]:
text = """Michael Jeffrey Jordan (born February 17, 1963), also known by his initials MJ, 
        is an American former professional basketball player and businessman. By acclamation, 
        Michael Jordan is the greatest basketball player of all time. He was integral in helping 
        to popularize the NBA around the world in the 1980s and 1990s, becoming a global cultural icon in the 
        process. He played 15 seasons in the NBA, winning six championships with the Chicago 
        Bulls. He is the principal owner and chairman of the Charlotte Hornets of the National 
        Basketball Association and of 23XI Racing in the NASCAR Cup Series."""

ts = TopicSelector(text=text, min_freq=2, lang="auto", n_common_words=8)
keywords = ts.get_keywords()

In [None]:
print(keywords)

_______________________
Find wiki pages for the keywords. Summarize each one using a frequency-based approach or a KMeansClusterer approach. The summaries can also be translated to a target language.

In [None]:
keywords = ['National Basketball Association', 'Michael Jordan', 'Lebron James']
wiki = WikiSummarizer(keywords=keywords, summarizer='freq', dist_metric='cosine', n_clusters=3,
                      max_sent_len=30, summary_len=5, lang='auto', min_summary_char_len=100, target='french')
summaries = wiki.get_summaries()

In [None]:
for keyword in summaries.keys():
    print(keyword, ": ", summaries[keyword], "\n")