<h1>Document Similarity using LSI - Vidhi Agrawal</h1>

<ol>
<li>From Wikipedia’s List of writers page (https://en.wikipedia.org/wiki/Lists_of_writers), pick five lists of
writers (e.g., List of detective fiction authors). You can pick any five
you like but make sure that the list has at least 30 writers listed
<li>Collect the urls of all the writers on those five pages and place them in a list
<li>Grab the content of each writer in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of writers from wikipedia and create a new list of documents using the detail from each writers page. This is your "writer" data set
<li>For each writer in the new list, find the writer in the reference data set that is the least close in similarity (with a similarity not lower than 0.6).
<li>Print a table that contains each writer from the writer data set and the most similar writer from the reference data set
</ol>


<p><span style="color:blue">get_writers</span>: A function that, given a "list of writers" url, returns a list containing the names of the writers and the urls for their wikipedia pages
<p>non_writer_finder tries its best to remove links that are not writer links from the page (not perfect, but good enough!)

In [1]:
def get_writers(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_writers = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_writer_finder(link):
                all_writers.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_writers

def non_writer_finder(link):
    non_writer_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User','https']
    for word in non_writer_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_detective_fiction_authors"
get_writers(url)

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https://en.wikipedia.org/wiki/M._C._Beaton'),
 ('E. C. Bentley', 'https://en.wikipedia.org/wiki/Edmund_Clerihew_Bentley'),
 ('Larry Beinhart', 'https://en.wikipedia.

<h4>get_writer_text(url): returns the page text of the wikipedia page associated with a writer</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (writer, url) pair from our writers list

In [3]:
def get_writer_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text

<h4>testing get_writer_text</h4>

In [4]:
url = "https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)"
get_writer_text(url)

'\nKate Atkinson MBE (born 20 December 1951) is an English writer of novels, plays and short stories.[1] She is known for creating the Jackson Brodie series of detective novels, which has been adapted into the BBC One series Case Histories.[1][2] She won the Whitbread Book of the Year prize in 1995 in the Novels category for Behind the Scenes at the Museum, winning again in 2013 and 2015 under its new name the Costa Book Awards.[1]\nThe daughter of a shopkeeper, Atkinson was born in York, the setting for several of her books.[3] She studied English literature at the University of Dundee, gaining her master\'s degree in 1974.[1] Atkinson subsequently studied for a doctorate in American literature, with a thesis titled "The post-modern American short story in its historical context".[3] She failed at the viva (oral examination) stage. After leaving the university, she took on a variety of jobs, from home help to legal secretary and teacher.[4]\nHer first novel, Behind the Scenes at the M

<p><span style="color:blue">get_all_writers</span>: A function that, given a list of genres, returns a list containing the names of the writers and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the writers in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_writers"
<li>construct a url for the list of writers (I've done these first three steps for you)
<li>call get_writers for that url
<li>extend all_writers by what get_writers returns

In [5]:
def get_all_writers(genre_list):
    all_writers = list()
    for genre in genre_list:
        url = f"https://en.wikipedia.org/wiki/List_of_{genre}"
        genre_specific_writers=get_writers(url)
        all_writers.extend(genre_specific_writers)
    return all_writers

<h4>Example of how to use get_all_writers</h4>

In [6]:
genre_list = ['detective_fiction_authors', 'romantic_novelists', 'mystery_writers','thriller_writers', 'political_authors']
all_writers = get_all_writers(genre_list)
all_writers

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https://en.wikipedia.org/wiki/M._C._Beaton'),
 ('E. C. Bentley', 'https://en.wikipedia.org/wiki/Edmund_Clerihew_Bentley'),
 ('Larry Beinhart', 'https://en.wikipedia.

<p><span style="color:blue">get_all_writer_docs</span>: A function that, given the list of (writer,url) pairs, returns two lists, a list of writers and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_writers list
<li>extract the name and the url of the writer
<li>get the text using predefined function
<li>if the function returns None, ignore it and move to the next writer
<li>otherwise, append the name ot the writer_names list and the text to the writer_texts list
<li>return writer_names and writer_texts


In [7]:
def get_all_writer_docs(all_writers):
    writer_names = list()
    writer_texts = list()
    for writer in all_writers:
            writer_doc=get_writer_text(writer[1])
            if writer_doc is not None:
                writer_names.append(writer[0])
                writer_texts.append(writer_doc)
    return writer_names,writer_texts

<h4>Example of how to use get_all_writer_docs</h4>

In [8]:
reference_names,reference_docs = get_all_writer_docs(all_writers)
print(len(reference_names),len(reference_docs))

1881 1881


In [9]:
reference_names

['Mario Acevedo',
 'Douglas Adams',
 'Margery Allingham',
 'Rudolfo Anaya',
 'Gosho Aoyama',
 'Frank Arnau',
 'Taku Ashibe',
 'Ace Atkins',
 'Kate Atkinson',
 'Yukito Ayatsuji',
 'Nevada Barr',
 'Earle Basinsky',
 'M. C. Beaton',
 'E. C. Bentley',
 'Larry Beinhart',
 'Earl Derr Biggers',
 'Cara Black',
 'Emery Bonett',
 'John Bonett',
 'Rhys Bowen',
 'Leigh Brackett',
 'P. J. Brackston',
 'Collin Brooks',
 'Lilian Jackson Braun',
 'John Burdett',
 'James Lee Burke',
 'Meg Cabot',
 'John Dickson Carr',
 'Jessie Chandler',
 'Raymond Chandler',
 'Leslie Charteris',
 'G. K. Chesterton',
 'Shunshin Chin',
 'Agatha Christie',
 'Carol Higgins Clark',
 'Barbara Cleverly',
 'Edmund Crispin',
 'Brian Cleeve',
 'Ann Cleeves',
 'Michael Collins',
 'Michael Connelly',
 'Patricia Cornwell',
 'Robert Crais',
 'Bill Crider',
 'Amanda Cross',
 'Chris Culver',
 'Clive Cussler',
 'Sharadindu Bandyopadhyay',
 'Jordan Dane',
 'Lindsey Davis',
 'Jeffery Deaver',
 'Ted Dekker',
 'Colin Dexter',
 'Graham Diam

In [10]:
reference_docs

['Mario Acevedo (born July 6, 1955) is an American novelist and artist, known for his series of urban fantasy novels featuring the vampire private investigator Felix Gomez.  He lives and works in Denver, Colorado.[1] Acevedo was born in El Paso, Texas.[2][page\xa0needed] Before becoming a published writer, Acevedo held jobs as a military helicopter pilot, paratrooper, infantry officer, engineer, art teacher, software programmer, and assorted others.[2] He was also deployed as a soldier and artist for the U.S. Army during Operation Desert Storm.[2]\n\nThis article about a novelist of the United States born in the 1950s is a stub. You can help Wikipedia by expanding it.',
 '\nDouglas Noël Adams (11 March 1952 – 11 May 2001) was an English author, humorist, and screenwriter, best known for The Hitchhiker\'s Guide to the Galaxy. Originally a 1978 BBC radio comedy, The Hitchhiker\'s Guide to the Galaxy developed into a "trilogy" of five books that sold more than 15\xa0million copies in his 

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [11]:
#Code for LSI model goes here
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities

documents = [doc.strip() for doc in reference_docs]
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in reference_docs]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=3)

In [12]:
list(dictionary.iteritems())

[(0, '1950s'),
 (1, 'acevedo'),
 (2, 'american'),
 (3, 'army'),
 (4, 'art'),
 (5, 'article'),
 (6, 'artist'),
 (7, 'assorted'),
 (8, 'becoming'),
 (9, 'born'),
 (10, 'deployed'),
 (11, 'desert'),
 (12, 'el'),
 (13, 'expanding'),
 (14, 'fantasy'),
 (15, 'featuring'),
 (16, 'felix'),
 (17, 'held'),
 (18, 'helicopter'),
 (19, 'help'),
 (20, 'infantry'),
 (21, 'investigator'),
 (22, 'jobs'),
 (23, 'july'),
 (24, 'known'),
 (25, 'lives'),
 (26, 'mario'),
 (27, 'military'),
 (28, 'novelist'),
 (29, 'novels'),
 (30, 'operation'),
 (31, 'private'),
 (32, 'published'),
 (33, 'series'),
 (34, 'software'),
 (35, 'soldier'),
 (36, 'states'),
 (37, 'united'),
 (38, 'urban'),
 (39, 'vampire'),
 (40, 'wikipedia'),
 (41, 'works'),
 (42, '10'),
 (43, '11'),
 (44, '12'),
 (45, '13'),
 (46, '15'),
 (47, '16'),
 (48, '17'),
 (49, '18610'),
 (50, '1952'),
 (51, '1959'),
 (52, '1970'),
 (53, '1974'),
 (54, '1976'),
 (55, '1977'),
 (56, '1978'),
 (57, '1983'),
 (58, '1984'),
 (59, '1986'),
 (60, '1991'),
 (6

<h3>Construct the writer data set</h3>
<h4>Example</h4>

In [13]:
writer_genre_list = ['Western_fiction_authors']        
all_writers = get_all_writers(writer_genre_list)
writer_names, writer_docs = get_all_writer_docs(all_writers)

<h4>find the least similar writers with at least 0.6 similarity for each new writer from our reference data set</h4>
<li>Write code to print table_data after the for loop ends

In [14]:
table_data_min = list()
table_data_max=list()

#For every writer in the writer dataset 
for i, doc in enumerate(writer_docs):
    
    #Calculate similarity with each writer in the reference dataset
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    
    #Filter writers with only similarity >=0.6
    writers_similar= [(writer, similarity)for writer, similarity in sims if similarity >= 0.6]
    
    #Calculate the writer from the reference data least similar to the writer from the writer dataset
    min_similarity_value = min(writers_similar, key=lambda x: x[1])[1]
    min_similarity_author = [t for t in writers_similar if t[1] == min_similarity_value]
    
    #Calculate the writer from the reference data most similar to the writer from the writer dataset
    most_similar_value = max(writers_similar, key=lambda x: x[1])[1]
    most_similar_author = [t for t in writers_similar if t[1] == most_similar_value]
    
    table_data_min.append((writer_names[i],min_similarity_author[0][1],reference_names[min_similarity_author[0][0]]))
    
    table_data_max.append((writer_names[i],most_similar_author[0][1],reference_names[most_similar_author[0][0]]))
    
table_data_min 

[('Edward Abbey', 0.60937726, 'Claire Robyns'),
 ('Andy Adams', 0.6419992, 'Heather Graham'),
 ('William Lacey Amy', 0.60775954, 'Heather Graham'),
 ('Rudolfo Anaya', 0.6260803, 'Gheorghe Alexandrescu'),
 ('Todhunter Ballard', 0.60401464, 'Jean-Paul Sartre'),
 ('S. Omar Barker', 0.646691, 'Emma Richmond'),
 ('Rex Beach', 0.60420865, 'Karl Marx'),
 ('James Warner Bellah', 0.62159526, 'William F. Buckley Jr.'),
 ('Don Bendell', 0.6295767, 'Karl Marx'),
 ('Tom W. Blackburn', 0.61910653, 'Emma Richmond'),
 ('James Carlos Blake', 0.63592166, 'Gheorghe Alexandrescu'),
 ('William Blinn', 0.6429734, 'Christopher Buckley'),
 ('Stephen Bly', 0.644722, 'Agha Shorish Kashmiri'),
 ('Frank Bonham', 0.6076695, 'Jürgen Habermas'),
 ('Allan R. Bosworth', 0.60314715, 'John Stuart Mill'),
 ('Peter Bowen', 0.6127011, 'Agha Shorish Kashmiri'),
 ('B.M. Bower', 0.60221404, 'Gheorghe Alexandrescu'),
 ('Leigh Brackett', 0.61293584, 'John Stuart Mill'),
 ('Max Brand', 0.60568166, 'Karl Marx'),
 ('Lyle Brandt', 

In [15]:
table_data_max

[('Edward Abbey', 0.9999749, 'David Ignatius'),
 ('Andy Adams', 0.9999399, 'Jillian Hunter'),
 ('William Lacey Amy', 0.99997765, 'Ginny Aiken'),
 ('Rudolfo Anaya', 1.0000001, 'Rudolfo Anaya'),
 ('Todhunter Ballard', 0.9999424, 'Lisa Jackson'),
 ('S. Omar Barker', 0.9999941, 'Jill Christian'),
 ('Rex Beach', 0.99991167, 'Amitav Ghosh'),
 ('James Warner Bellah', 0.99996793, 'Karen Harper'),
 ('Don Bendell', 0.99999994, 'Don Bendell'),
 ('Tom W. Blackburn', 0.99972, 'Thriller fiction'),
 ('James Carlos Blake', 0.99996954, 'Douglas Adams'),
 ('William Blinn', 0.9999321, 'Peter James'),
 ('Stephen Bly', 0.9999753, 'Janet Caird'),
 ('Frank Bonham', 0.99998176, 'Carla Kelly'),
 ('Allan R. Bosworth', 0.99999225, 'Miriam Allen deFord'),
 ('Peter Bowen', 1.0, 'Peter Bowen'),
 ('B.M. Bower', 0.9999721, 'Peter May'),
 ('Leigh Brackett', 1.0, 'Leigh Brackett'),
 ('Max Brand', 0.99999255, "Brian D'Amato"),
 ('Lyle Brandt', 0.9999967, 'Ramsey Campbell'),
 ('Peter Brandvold', 0.99999005, 'Kathryn Ross

# Very simple sentiment analysis

In this part we are gonna run some simple sentiment analysis to find out which writer has the most positive description.

Define a function simple_sentiment_analysis(writer_names, writer_docs) that takes as inputs the list of writers and their corresponding descriptions.
The expected output is a list, each element of this list should be a list with the writer name, the percentage of positive words in their description and the percentage of negative words in their description.

In [16]:
#Example output
"""
[('William Blinn', 0.81, 0.54),
 ('Stephen Bly', 0.75, 0.94),
 ('Frank Bonham', 3.73, 0.62)
 ...]
"""

"\n[('William Blinn', 0.81, 0.54),\n ('Stephen Bly', 0.75, 0.94),\n ('Frank Bonham', 3.73, 0.62)\n ...]\n"

To ensure results can be compared please use the following function to define your list of positive and negative words:

In [17]:
def get_pos_neg_words():
    def get_words(url):
        import requests
        words = requests.get(url).content.decode('latin-1')
        word_list = words.split('\n')
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ';' in word or not word:
                word_list.pop(index)
            else:
                index+=1
        return word_list
    #Get lists of positive and negative words
    p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
    n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words,negative_words

In [18]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vidhiagrawal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
def simple_sentiment_analysis(writer_names, writer_docs):
    results=list()
    positive_words,negative_words = get_pos_neg_words()
    for i, writer in enumerate(writer_names):
        cpos = cneg = 0
        for word in word_tokenize(writer_docs[i]):
            if word in positive_words:
                cpos+=1
            if word in negative_words:
                cneg+=1
        results.append((writer, "%.2f" % (cpos/len(word_tokenize(writer_docs[i]))*100), "%.2f" % (cneg/len(word_tokenize(writer_docs[i]))*100)))
    return results

In [20]:
simple_sentiment_analysis(writer_names, writer_docs)

[('Edward Abbey', '2.09', '2.40'),
 ('Andy Adams', '1.94', '1.94'),
 ('William Lacey Amy', '1.16', '1.62'),
 ('Rudolfo Anaya', '1.77', '1.11'),
 ('Todhunter Ballard', '1.60', '3.19'),
 ('S. Omar Barker', '1.20', '0.13'),
 ('Rex Beach', '2.66', '1.60'),
 ('James Warner Bellah', '0.67', '0.67'),
 ('Don Bendell', '2.27', '1.14'),
 ('Tom W. Blackburn', '1.45', '0.66'),
 ('James Carlos Blake', '2.19', '1.10'),
 ('William Blinn', '0.81', '0.54'),
 ('Stephen Bly', '0.75', '0.94'),
 ('Frank Bonham', '3.73', '0.62'),
 ('Allan R. Bosworth', '1.89', '0.63'),
 ('Peter Bowen', '2.61', '0.87'),
 ('B.M. Bower', '1.60', '1.49'),
 ('Leigh Brackett', '1.37', '2.24'),
 ('Max Brand', '0.26', '0.79'),
 ('Lyle Brandt', '5.62', '0.62'),
 ('Peter Brandvold', '0.88', '0.44'),
 ('Matt Braun', '1.62', '0.00'),
 ('Dee Brown', '2.62', '0.59'),
 ('Anthony Burgess', '1.31', '1.62'),
 ('Walter Noble Burns', '0.33', '1.64'),
 ('Daniel Carlson ', '2.00', '0.67'),
 ('David Wynford Carnegie', '1.75', '1.63'),
 ('Forrest 

In [21]:
simple_sentiment_analysis(writer_names, writer_docs)

[('Edward Abbey', '2.09', '2.40'),
 ('Andy Adams', '1.94', '1.94'),
 ('William Lacey Amy', '1.16', '1.62'),
 ('Rudolfo Anaya', '1.77', '1.11'),
 ('Todhunter Ballard', '1.60', '3.19'),
 ('S. Omar Barker', '1.20', '0.13'),
 ('Rex Beach', '2.66', '1.60'),
 ('James Warner Bellah', '0.67', '0.67'),
 ('Don Bendell', '2.27', '1.14'),
 ('Tom W. Blackburn', '1.45', '0.66'),
 ('James Carlos Blake', '2.19', '1.10'),
 ('William Blinn', '0.81', '0.54'),
 ('Stephen Bly', '0.75', '0.94'),
 ('Frank Bonham', '3.73', '0.62'),
 ('Allan R. Bosworth', '1.89', '0.63'),
 ('Peter Bowen', '2.61', '0.87'),
 ('B.M. Bower', '1.60', '1.49'),
 ('Leigh Brackett', '1.37', '2.24'),
 ('Max Brand', '0.26', '0.79'),
 ('Lyle Brandt', '5.62', '0.62'),
 ('Peter Brandvold', '0.88', '0.44'),
 ('Matt Braun', '1.62', '0.00'),
 ('Dee Brown', '2.62', '0.59'),
 ('Anthony Burgess', '1.31', '1.62'),
 ('Walter Noble Burns', '0.33', '1.64'),
 ('Daniel Carlson ', '2.00', '0.67'),
 ('David Wynford Carnegie', '1.75', '1.63'),
 ('Forrest 