# Document Similarity Analysis

### I authored this code as part of an assignment for the master's level Python course I took at Columbia University. 

### This example uses Wikipedia’s List of Ancient Greek Philosophers to compare philosophers' biographies and identify which philosopher is most similar to each other.  The link to the original page can be found here: 
#### https://en.wikipedia.org/wiki/List_of_ancient_Greek_philosophers

### My code is broken down into three parts:

- First, I scrape philosophers' biographies from their Wikipedia page and construct a corpus of documents. 
- Next, I created a function that gets the content from an article given only its file path.
- Finally, I build the LSI model to match every philosopher to its most similar one based on their Wikipedia biographies.
 
 
### Skills used: 
- Web Scraping
- LSI/LSA


### Installs, loading libraries and necessary methods 

In [None]:
!pip3 install gensim

In [None]:
import nltk
import os
import _sqlite3

from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from gensim import corpora, models, similarities
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity

## Part One

### Write a function that takes the file name of the Wikipedia page containing all Ancient Greek Philosophers (saved as "Index.html" in my workspace) and returns a list tuples containing the name of the philosopher and the path to its individual article file.

### Expected output: A list of tuples containing the name of the Greek God and the link to their biography. 
[('Acrion', 'Philosophers/Acrion.html'),
 ('Adrastus of Aphrodisias', 'Philosophers/Adrastus of Aphrodisias.html'),
 ('Aedesia', 'Philosophers/Aedesia.html'),
 ('Aedesius', 'Philosophers/Aedesius.html'),
 ('Aeneas of Gaza', 'Philosophers/Aeneas of Gaza.html'),
 ('Aenesidemus', 'Philosophers/Aenesidemus.html'),
 ...]

In [None]:
import codecs
from bs4 import BeautifulSoup

In [None]:
def get_philosophers(filename):

    f = codecs.open(filename, 'r', 'utf-8')
    soup = BeautifulSoup(f.read(),'lxml')
        
    filenames = list()     
    table_body=soup.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        try:
            philosopher_name=row.find('a').get('title')
            print(philosopher_name)
            philosopher_link = "Philosophers/" + philosopher_name + ".html"
            filenames.append((philosopher_name,philosopher_link))
        except:
            pass
        
    return filenames
                        
    ###
    
# Once done, try this:
filenames = get_philosophers("philosophers.html")
filenames

## Part Two

### This section scrapes the text on a philosophers’s page and returns it as a text string. The input is the name of the file that contains the philosoph's page.

### For example: 
get_text('Philosophers/Acrion.html') will output the text of the page.
'Acrion was a Locrian and a Pythagorean philosopher...'


In [None]:
def get_text(file):  
    f = codecs.open(file, 'r', 'utf-8')
    soup = BeautifulSoup(f.read(),'lxml')
    all_text = ""

    for tag in soup.find_all('p'):
        all_text += tag.get_text() 
    return all_text    


# Once done, try this:
get_text("Philosophers/Agathobulus.html")

## Part Three

### This section uses the files found in the "Philosophers" folder to construct an LSA model.  The LSA model is then used to find the most similar philosopher for each of the philosophers found in Part One, based on the content of their Wikipedia articles. 

Note: In this section, I do not go online to scrape the data; everything needed is in this Jupyter notebook working directory.

The function takes an the list of tuples created in Part One as the input.

The output is also a list of tuples. Each tuple contains a philosopher's name and its most similar other philosopher. Please note both names will not be the same.

#### The output looks like:

[('Acrion', 'Athenodoros Cananites'),

 ('Adrastus of Aphrodisias', 'Andronicus of Rhodes'),
 
 ('Aedesia', 'Ammonius of Athens'),
 
 ('Aedesius', 'Arete of Cyrene'),
 
 ('Aeneas of Gaza', 'Ammonius Hermiae'),
 
 ...]



In [None]:
def run(filenames):

    from collections import defaultdict
    from gensim import models
    
    philosophers_texts=[]
    most_similar=[]
    documents=[]
    
    for x in range(len(filenames)):
        philosophers_text = get_text(filenames[x][1])
        philosophers_text = philosophers_text.replace('\n\n', ' ')
        philosophers_text = philosophers_text.replace('\n', ' ')
        documents.append(philosophers_text)
    
    stoplist = set('for a of the and to in'.split())
    texts = [
        [word for word in document.lower().split() if word not in STOPWORDS and word not in stoplist and word.isalnum()]
        for document in documents
    ]

    dictionary = corpora.Dictionary(texts) 
    corpus = [dictionary.doc2bow(text) for text in texts] 

    lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=10)
    
    for x in range(len(filenames)):
        doc = get_text(filenames[x][1])
        doc = doc.replace('\n\n', ' ')
        doc = doc.replace('\n', ' ')

        vec_bow = dictionary.doc2bow(doc.lower().split())
        vec_lsi = lsi[vec_bow]  
        
        index = similarities.MatrixSimilarity(lsi[corpus])  
        sims = index[vec_lsi]  # perform a similarity query against the corpus
        sims = sorted(enumerate(sims), key=lambda item: -item[1])
        most_similar.append((filenames[x][0], filenames[sims[1][0]][0]))
        
    return most_similar

###

# Once done, try this:
run(filenames)
