<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians). You can pick any five
you like but make sure that the list has the words “musicians” in
it and that the list has at least 30 musicians listed
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is your "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [1]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_klezmer_musicians"
get_musicians(url)

[('Michael Alpert', 'https://en.wikipedia.org/wiki/Michael_Alpert'),
 ('József Balogh', 'https://en.wikipedia.org/wiki/J%C3%B3zsef_Balogh'),
 ('Shloimke (Sam) Beckerman',
  'https://en.wikipedia.org/wiki/Shloimke_(Sam)_Beckerman'),
 ('Sidney Beckerman',
  'https://en.wikipedia.org/wiki/Sidney_Beckerman_(musician)'),
 ('Ofer Ben-Amots', 'https://en.wikipedia.org/wiki/Ofer_Ben-Amots'),
 ('Alan Bern', 'https://en.wikipedia.org/wiki/Alan_Bern'),
 ('Geoff Berner', 'https://en.wikipedia.org/wiki/Geoff_Berner'),
 ('Naftule Brandwein', 'https://en.wikipedia.org/wiki/Naftule_Brandwein'),
 ('Stuart Brotman', 'https://en.wikipedia.org/wiki/Veretski_Pass_(band)'),
 ('Don Byron', 'https://en.wikipedia.org/wiki/Don_Byron'),
 ('Brian Choper', 'https://en.wikipedia.org/wiki/Brian_Choper'),
 ('Adrienne Cooper', 'https://en.wikipedia.org/wiki/Adrienne_Cooper'),
 ('Abe Elenkrieg', 'https://en.wikipedia.org/wiki/Abe_Elenkrieg'),
 ('Giora Feidman', 'https://en.wikipedia.org/wiki/Giora_Feidman'),
 ('German 

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (musician, url) pair from our musicians list

In [3]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text

<h4>testing get_musician_text</h4>

In [4]:
url = "https://en.wikipedia.org/wiki/Ofer_Ben-Amots"
get_musician_text(url)

'Ofer Ben-Amots (Hebrew: עופר בן-אמוץ; born October 20, 1955) is an Israeli-American composer and teacher of music composition and theory at Colorado College. His music is inspired by Jewish folklore of Eastern-European Yiddish and Judeo-Spanish Ladino traditions. The interweaving of folk elements with contemporary textures creates the dynamic tension that permeates and defines Ben-Amots\' musical language.[1]\nBorn in Haifa, Israel, Ofer Ben-Amots gave his first piano concert at age nine and at age sixteen was awarded first prize in the Chet Piano Competition. Later, following composition studies with Joseph Dorfman at Tel Aviv University, he was invited to study at the Conservatoire de Musique in Geneva, Switzerland. There he studied with Pierre Wismer and privately with Alberto Ginastera. Ben-Amots is an alumnus of the Hochschule für Musik in Detmold, Germany, where he studied with Martin C. Redel and Dietrich Manicke and graduated with degrees in composition, music theory, and pian

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_musicians"
<li>construct a url for the list of musicians (I've done these first three steps for you)
<li>call get_musicians for that url
<li>extend all_musicians by what get_musicians returns

In [5]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
    
        #My code here
        musicians = get_musicians(url)
        for musician in musicians:
#             m_url = musician[1]
#             musician_text = get_musician_text(m_url)
            all_musicians.append(musician)
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [88]:
genre_list = ['klezmer_musicians','British_blues_musicians']
all_musicians = get_all_musicians(genre_list)
len(all_musicians)

191

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_musicians list
<li>extract the name and the url of the musician
<li>get the text using the get_musician_text() function
<li>if the function returns None, ignore it and move to the next musician
<li>otherwise, append the name ot the musician_names list and the text to the musician_texts list
<li>return musician_names and musician_texts


In [89]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        #Your code here
        
        text = get_musician_text(url)
        
        musician_names.append(name)
        musician_texts.append(text)
        
    return musician_names,musician_texts
   

In [90]:
musician_names, musician_texts = get_all_musician_docs(all_musicians)

In [91]:
all_musicians

[('Michael Alpert', 'https://en.wikipedia.org/wiki/Michael_Alpert'),
 ('József Balogh', 'https://en.wikipedia.org/wiki/J%C3%B3zsef_Balogh'),
 ('Shloimke (Sam) Beckerman',
  'https://en.wikipedia.org/wiki/Shloimke_(Sam)_Beckerman'),
 ('Sidney Beckerman',
  'https://en.wikipedia.org/wiki/Sidney_Beckerman_(musician)'),
 ('Ofer Ben-Amots', 'https://en.wikipedia.org/wiki/Ofer_Ben-Amots'),
 ('Alan Bern', 'https://en.wikipedia.org/wiki/Alan_Bern'),
 ('Geoff Berner', 'https://en.wikipedia.org/wiki/Geoff_Berner'),
 ('Naftule Brandwein', 'https://en.wikipedia.org/wiki/Naftule_Brandwein'),
 ('Stuart Brotman', 'https://en.wikipedia.org/wiki/Veretski_Pass_(band)'),
 ('Don Byron', 'https://en.wikipedia.org/wiki/Don_Byron'),
 ('Brian Choper', 'https://en.wikipedia.org/wiki/Brian_Choper'),
 ('Adrienne Cooper', 'https://en.wikipedia.org/wiki/Adrienne_Cooper'),
 ('Abe Elenkrieg', 'https://en.wikipedia.org/wiki/Abe_Elenkrieg'),
 ('Giora Feidman', 'https://en.wikipedia.org/wiki/Giora_Feidman'),
 ('German 

In [27]:
for name, text in zip(musician_names, musician_texts):
    text_file = open(name, "w", encoding='utf-8')
    print(name)
    text_file.write(text)
    text_file.close()

Michael Alpert
József Balogh
Shloimke (Sam) Beckerman
Sidney Beckerman
Ofer Ben-Amots
Alan Bern
Geoff Berner
Naftule Brandwein
Stuart Brotman
Don Byron
Brian Choper
Adrienne Cooper
Abe Elenkrieg
Giora Feidman
German Goldenshteyn
David Julian Gray
Elaine Hoffman-Watts
Alex Jacobowitz
David Krakauer
César Lerner
Margot Leverett
Frank London
Joseph Moskowitz
Hankus Netsky
Moni Ovadia
Pete Rushefsky
Henry Sapoznik
Abe Schwartz
Elizabeth Schwartz
Cookie Segelstein
Andy Statman
Yale Strom
Alicia Svigals
Dave Tarras
Ginger Baker
Long John Baldry
Chris Barber
Norman Beaker
Jeff Beck
Duster Bennett
Graham Bond
Marcus Bonfanti
John Bonham
Geoff Bradford
Jack Bruce
Danny Bryant
Eric Burdon
Eric Clapton
Cyril Davies
Chris Farlowe
Mick Fleetwood
Peter Green
Mick Jagger
Brian Jones
Laurence Jones
Paul Jones
Wizz Jones
Jo Ann Kelly
Dave Kelly
Danny Kirwan
Alexis Korner
Paul Kossoff
Hugh Laurie
Alvin Lee
Bernie Marsden
John Mayall
Chantel McGregor
Tony McPhee
John McVie
Micky Moody
Gary Moore
Billy Ni

<h4>Example of how to use get_all_musician_docs</h4>

In [92]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)
reference_names

['Michael Alpert',
 'József Balogh',
 'Shloimke (Sam) Beckerman',
 'Sidney Beckerman',
 'Ofer Ben-Amots',
 'Alan Bern',
 'Geoff Berner',
 'Naftule Brandwein',
 'Stuart Brotman',
 'Don Byron',
 'Brian Choper',
 'Adrienne Cooper',
 'Abe Elenkrieg',
 'Giora Feidman',
 'German Goldenshteyn',
 'David Julian Gray',
 'Elaine Hoffman-Watts',
 'Alex Jacobowitz',
 'David Krakauer',
 'César Lerner',
 'Margot Leverett',
 'Frank London',
 'Joseph Moskowitz',
 'Hankus Netsky',
 'Moni Ovadia',
 'Pete Rushefsky',
 'Henry Sapoznik',
 'Abe Schwartz',
 'Elizabeth Schwartz',
 'Cookie Segelstein',
 'Andy Statman',
 'Yale Strom',
 'Alicia Svigals',
 'Dave Tarras',
 'Ginger Baker',
 'Long John Baldry',
 'Chris Barber',
 'Norman Beaker',
 'Jeff Beck',
 'Duster Bennett',
 'Graham Bond',
 'Marcus Bonfanti',
 'John Bonham',
 'Geoff Bradford',
 'Jack Bruce',
 'Danny Bryant',
 'Eric Burdon',
 'Eric Clapton',
 'Cyril Davies',
 'Chris Farlowe',
 'Mick Fleetwood',
 'Peter Green',
 'Mick Jagger',
 'Brian Jones',
 'Lau

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [93]:
import nltk
from nltk.corpus import PlaintextCorpusReader
# from nltk.book import *
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint
import gensim.summarization
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities

# get the corpora and the dictionaries
root = 'C:\\code\\python\\Assignments\\All_musicians'
data = '.*'
musician_data = PlaintextCorpusReader(root, data)
musician_data

<PlaintextCorpusReader in 'C:\\code\\python\\Assignments\\All_musicians'>

In [94]:
doc_list = [musician_data]
all_text = musician_data.raw()

documents = [doc.raw() for doc in doc_list]
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [95]:
# make the LSI model
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [96]:
musician_genre_list = ['acid_rock_artists']
musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(musicians)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [99]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    doc = musician
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    i = similarities.MatrixSimilarity(lsi[corpus])
    sims = i[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])

    most_similar_musician = sims[0][0]
    
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
print(table_data)
    

  if np.issubdtype(vec.dtype, np.int):


[('The 13th Floor Elevators', 'Michael Alpert'), ('Alice Cooper', 'Michael Alpert'), ('The Amboy Dukes', 'Michael Alpert'), ('Amon Düül', 'Michael Alpert'), ('Big Brother and the Holding Company', 'Michael Alpert'), ('Black Sabbath', 'Michael Alpert'), ('Blue Cheer', 'Michael Alpert'), ('Blues Magoos', 'Michael Alpert'), ('The Charlatans', 'Michael Alpert'), ('Count Five', 'Michael Alpert'), ('Country Joe and the Fish', 'Michael Alpert'), ('Coven', 'Michael Alpert'), ('Cream', 'Michael Alpert'), ('Deep Purple', 'Michael Alpert'), ('The Deviants', 'Michael Alpert'), ('The Doors', 'Michael Alpert'), ('The Electric Prunes', 'Michael Alpert'), ('The Fugs', 'Michael Alpert'), ('Grateful Dead', 'Michael Alpert'), ('The Great Society', 'Michael Alpert'), ('The Groundhogs', 'Michael Alpert'), ('Hawkwind', 'Michael Alpert'), ('Iron Butterfly', 'Michael Alpert'), ('Jefferson Airplane', 'Michael Alpert'), ('The Jimi Hendrix Experience', 'Michael Alpert'), ('Janis Joplin', 'Michael Alpert'), ('JPT