## There are some instructions you need to follow:
<li> You only need to write your code in the comment area "Your Code Here".</li>
<li>Do not upload your own file. Please make the necessary changes in the Jupyter notebook file already present in the server.</li>
<li>Please note, there are several cells in the Assignment Jupyter notebook that are empty and read only. Do not attempt to remove them or   edit them. They are used in grading your notebook. Doing so might lead to 0 points.</li>

In [1]:
import nltk
import os
import _sqlite3
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from gensim import corpora, models, similarities
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity

# Question 1

In [2]:
"""
Question 1

Write a function that takes the file name of the Wikipedia page containing all Greek ancient
philosophers (saved as "Index.html" in your workspace) and returns a list tuples containing 
the name of the philosopher and the path to its individual article file.

Example of use: get_philosophers("Index.html")

The output should be a list of tuples:

[('Acrion', 'Philosophers/Acrion.html'),
 ('Adrastus of Aphrodisias', 'Philosophers/Adrastus of Aphrodisias.html'),
 ('Aedesia', 'Philosophers/Aedesia.html'),
 ('Aedesius', 'Philosophers/Aedesius.html'),
 ('Aeneas of Gaza', 'Philosophers/Aeneas of Gaza.html'),
 ('Aenesidemus', 'Philosophers/Aenesidemus.html'),
 ...]
 
  
NOTE: For processing speed purposes, the table in "Index.html" has been shortened compared
to the one online on wikipedia.org. Do not worry if you do not find some philosophers in 
your results, this is made on purpose. 

"""

def get_philosophers(filename):
    
    import codecs
    from bs4 import BeautifulSoup
    f = codecs.open(filename, 'r', 'utf-8')
    soup = BeautifulSoup(f.read(),'lxml')
    
    result = []
    
    sub_table = soup.find("table", class_="wikitable sortable")
    for tr in sub_table.find_all("tr"):
        if tr.td:
            text = tr.td.a.get("title")
            path = 'Philosophers/' + text + '.html'
            result.append((text,path))
        
    return result
# Once done, try this:
filenames = get_philosophers("Index.html")
filenames

[('Acrion', 'Philosophers/Acrion.html'),
 ('Adrastus of Aphrodisias', 'Philosophers/Adrastus of Aphrodisias.html'),
 ('Aedesia', 'Philosophers/Aedesia.html'),
 ('Aedesius', 'Philosophers/Aedesius.html'),
 ('Aeneas of Gaza', 'Philosophers/Aeneas of Gaza.html'),
 ('Aenesidemus', 'Philosophers/Aenesidemus.html'),
 ('Aesara', 'Philosophers/Aesara.html'),
 ('Aeschines of Neapolis', 'Philosophers/Aeschines of Neapolis.html'),
 ('Aeschines of Sphettus', 'Philosophers/Aeschines of Sphettus.html'),
 ('Aetius of Antioch', 'Philosophers/Aetius of Antioch.html'),
 ('Agapius (philosopher)', 'Philosophers/Agapius (philosopher).html'),
 ('Agathobulus', 'Philosophers/Agathobulus.html'),
 ('Agathosthenes', 'Philosophers/Agathosthenes.html'),
 ('Agrippa the Skeptic', 'Philosophers/Agrippa the Skeptic.html'),
 ('Albinus (philosopher)', 'Philosophers/Albinus (philosopher).html'),
 ('Alcinous (philosopher)', 'Philosophers/Alcinous (philosopher).html'),
 ('Alcmaeon of Croton', 'Philosophers/Alcmaeon of Crot

# Question 2

In [67]:
"""
Question 2


Write a function that scrapes the text on a philosophers’s page and returns it as a text 
string. The input is the name of the file that contains the philosoph's page.

Example of use: get_text('Philosophers/Acrion.html')
should output the text of the page.
'Acrion was a Locrian and a Pythagorean philosopher...'
"""

def get_text(file):
    import codecs
    from bs4 import BeautifulSoup
    f = codecs.open(file, 'r', 'utf-8')
    soup = BeautifulSoup(f.read(),'lxml')
    
    #result = soup.find('p').text.replace('\n',"")
    
    all_text = ""
    for tag in soup.find_all('p'):
        all_text += tag.get_text() 
    
    return all_text


# Once done, try this:
get_text("Philosophers/Acrion.html")

'Acrion was a Locrian and a Pythagorean philosopher.[1]  He is mentioned by Valerius Maximus[2] under the name of Arion. According to William Smith, Arion is a false reading, instead of Acrion.[3]\n'

# Question 3

In [60]:
"""
Question 3

Use the files under "Philosophers" folder to construct an LSI model.
Then, use the LSI model to find the most similar philosopher for each of the philosophers
found in Question 1, based on the content of their Wikipedia articles. You should not go
online to scrape the data; everything you need is in your Jupyter notebook working directory.

The function should have as input the list of tuples created in Question 1.

The output format should be a list of tuples too. Each tuple should contain a philosopher's name
and its most similar other philosopher. Please note both names can't be the same.

The output should look like that:

[('Acrion', 'Arignote'),
 ('Adrastus of Aphrodisias', 'Lycophron (Sophist)'),
 ('Aedesia', 'Heliodorus of Alexandria'),
 ('Aedesius', 'Chrysanthius'),
 ('Aeneas of Gaza', 'Archytas'),
 ...]


"""

def run(filenames):
    import codecs
    from bs4 import BeautifulSoup
    
    result = []

    philosophers_texts = [[word for word in get_text(file[1]).lower().split()
                        if word not in STOPWORDS and word.isalnum()]
                         for file in filenames]
    dictionary = corpora.Dictionary(philosophers_texts)
    corpus = [dictionary.doc2bow(text) for text in philosophers_texts]
    
    lsi = models.LsiModel(corpus,id2word=dictionary, num_topics = 10)
    
    for file in filenames:
        comparative = get_text(file[1])
        vec_bow = dictionary.doc2bow(comparative.lower().split())
        vec_lsi = lsi[vec_bow]
        index = similarities.MatrixSimilarity(lsi[corpus])
        sims = index[vec_lsi]
        sims = sorted(enumerate(sims), key=lambda item: -item[1])
        value = sims[1][0]
        result.append((file[0],filenames[value][0]))
        
        
    return result

# Once done, try this:
run(filenames)

[('Acrion', 'Xenophilus'),
 ('Adrastus of Aphrodisias', 'Aspasius'),
 ('Aedesia', 'Arete of Cyrene'),
 ('Aedesius', 'Amelius'),
 ('Aeneas of Gaza', 'Agrippa the Skeptic'),
 ('Aenesidemus', 'Ammonius Saccas'),
 ('Aesara', 'Apollonius Cronus'),
 ('Aeschines of Neapolis', 'Aedesius'),
 ('Aeschines of Sphettus', 'Antipater of Cyrene'),
 ('Aetius of Antioch', 'Apollonius of Tyre (philosopher)'),
 ('Agapius (philosopher)', 'Zenodotus (philosopher)'),
 ('Agathobulus', 'Arete of Cyrene'),
 ('Agathosthenes', 'Anaxarchus'),
 ('Agrippa the Skeptic', 'Arete of Cyrene'),
 ('Albinus (philosopher)', 'Alexamenus of Teos'),
 ('Alcinous (philosopher)', 'Aristoclea'),
 ('Alcmaeon of Croton', 'Apollodorus the Epicurean'),
 ('Alexamenus of Teos', 'Aristocles of Messene'),
 ('Alexander of Aegae', 'Athenaeus of Seleucia'),
 ('Alexander of Aphrodisias', 'Aristotle of Mytilene'),
 ('Alexicrates', 'Arete of Cyrene'),
 ('Alexinus', 'Antipater of Cyrene'),
 ('Amelius', 'Aedesius'),
 ('Ammonius Hermiae', 'Anaxarch

In [61]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
