# Towards recommender systems: Topic Modeling
This project starts from the idea of building a recommender system using Python. The final project will be a model able to recommend a book from a corpus based on its position in the vector space and the topics it is made of. <br>
Topic modeling is an unsupervised technique to extract the hidden topics or structures from large corpora. The topics are hidden because they are not clearly stated by the author(s) of the corpus. <br>
There are several algorithms that can be used for topic modeling, such as LDA and LSA; in this project, I am going to use LDA using its Gensim implementation. <br>
The project will consist in four phases: 

1.   Data collection
2.   Data cleaning
3.   LDA implementation
4.   Recommender system implementation.

I decided to do this kind of project because of the many real-life applications that this kind of algorithms have, from filtering spam email to recommending products to customers according to their words on other products.


---



## 1. Data collection through scraping
The first phase consists in the collection of the textual data that will make the dataset. We will collect several .txt files of books from The Gutenberg Project (https://www.gutenberg.org), a website that hosts books whose copyright royalties have expired. The books are freely downloadable in several formats. 

### 1.1 Building the initial dictionary
To get the data, we will use the modules `requests` and `BeautifulSoup`: starting from the website's random search page, we will sort the results by popularity. Each book in the list will be stored as a dictionary inside a bigger dictionary whose keys are the book ids (taken from Project Gutenberg). Once the dictionary is complete, we will dump it as a json file: we need to import the module `json`. Moreover, we will need Regular Expressions to make some checks, therefore we add the module `re` to the import list.






In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
import requests
from bs4 import BeautifulSoup as bs
import re
import json 
from time import sleep

We have to make sure of some things while we put the links with the books in our dictionary: <br>
* We should pick only books in English 
* We should avoid poetry and drama, to keep data consistency 
* We should not pick dictionaries 
* We should avoid picking audiobooks, since they don't have a `.txt` file .

We define the functions `gen_en()`, `not_prose()` and `no_audiobooks()` to accomplish each task.


In [2]:
def gen_en(link):
  """This function takes a link to a Project Gutenberg book page and checks whether the language is English"""
  try:
    soup = bs(requests.get(link).text, "lxml")
    sleep(1)
    if soup.find("tr", property="dcterms:language").get("content") == "en": # we check for the language: we are interested in English books only
      return link
  except AttributeError:
    print(link)
    print("Problem with language!")

In [3]:
def not_prose(link):
  """This function takes a link to a Project Gutenberg book page and checks whether the book is poetry, drama or a dictionary"""
  try:
    soup = bs(requests.get(link).text, "lxml")
    sleep(1)
    subjects = soup.find_all("a", class_ = "block") # we look for the "subjects" section
    for sub in subjects:
      sub = sub.get_text().strip()
      if re.search(r""".*((poetry)|(Poetry)).*""", sub) or re.search(r""".*((drama)|(Drama)).*""", sub) or re.search(r""".*((Dictionaries)|(Glossaries)).*""", sub): # we check whether it is classified as poetry or drama
        print("Excluded because "+link+" contains poetry, drama or a dictionary!")
        return link # the function returns the link that does not refer to a prose book
        break # we break the cycle: even if only one of the subjects is "Poetry" or "drama", the book cannot be put into the dataset
  except AttributeError:
    print(link)
    print("Problem with poetry|drama|dictionary!")

In [4]:
def no_audiobooks(link):
  """This function takes a link to a Project Gutenberg book and checks whether it is an audiobook"""
  try:
    soup = bs(requests.get(link).text, "lxml")
    sleep(1) 
    if soup.find("td", property="dcterms:type").get_text() == "Sound": # we check the html element containing the book type
      print("Excluded because "+link+" contains an audiobook!")
      return link
  except AttributeError:
    print(link)
    print("Problem with audiobook!")

We want to build a dictionary whose keys are the book IDs. Therefore, we create a function to get the book ID from any Project Gutenberg link.

In [5]:
def get_id(link):
  """"This function takes a Project Gutenberg book link and returns the book id """
  if re.search(r"[0-9]+(/also/)$", link):
    link = re.sub(r"(/also/)", "", link)
  span = re.search(r"[0-9]+$",link).span() # from the link we can obtain the id number of the book: we use regex to do so
  book_id = link[span[0]:span[1]] # with span() we are able to know the position of the starting and ending characters that denote the book ID
  return book_id

All the control functions are defined. Now they are all put together in another function, called `get_good_books()`, that will return the list of Project Gutenberg links that are valid for the project and will put their IDs inside a dictionary. But there is a problem: some books might be duplicates with another ID. Therefore, we define two new functions to get the author and the title of each book and we pass to the function another parameter, a control list, that will contain each `(author, title)` tuple. In this way, if the tuple is already in the list, we can avoid to put the book inside the dictionary: it's a duplicate with a different ID.

In [6]:
def get_author(soup):
  """This function takes a soup object of a Project Gutenberg book link and returns the author of the book as Surname, Name"""
  try:
    author = soup.find("a", itemprop = "creator").get_text()
    try: # we need a try-except loop. Some authors are in the forumula "Surname, Name, Birth, Death", while others lack the dates
      sp = re.search(r"(, [0-9]*\??\-[0-9]*\??$)", author).span() # We look whether in the "name" cell there are also numbers
      author_n = author[:sp[0]] # if so, then we let go of the birth-death date and keep the name only
      return author_n # for each book, we create a key "Author" and assign the new vallue to it
    except AttributeError:
      return author # if the author does not have the birth-death date, we just put it in the dictionary
  except AttributeError:
    return "No author identified!"

In [7]:
def get_title(soup):
  """This function takes a soup object of a Project Gutenberg book link and returns the title of the book, checking also whether there is a subtitle"""
  try:
    title = soup.find("td", itemprop = "headline").get_text().strip() # we look for the title of the book
    if re.search(r"(\r)", title) : # some titles are longer because they also contain the subtitle. Given that we're not interested in it, we keep only the title
      title= title.split("\r")[0]
    return title
  except AttributeError:
    print("Problem with title!")

In [8]:
def get_good_books(list_of_links, a_control_list, book_dict):
  """This function takes a list of links to books in the Project Gutenberg website, a control list to avoid duplicates with different IDs and a dictionary that will 
  contain the books. For each link, the function will check the language, the category of text and whether the book was already inserted with another ID.
  If the book passes all the controls, then its ID is put as a key in the dictionary; its value is another (empty) dictionary. """
  base_url= "https://www.gutenberg.org"
  en_books = [] # we also initialize an empty list: it will contain the links to all the english books
  for link in list_of_links:
    new_url = base_url+link # we create a new url to scrape by concatenating the base url with each of the link in the input list
    soup = bs(requests.get(new_url).text, "lxml")
    if gen_en(new_url) and not not_prose(new_url) and not no_audiobooks(new_url): # we call two functions: the first checks whether it is an English book, the second checks whether it is poetry or drama
      author = get_author(soup) # we look for the author's name
      title = get_title(soup) # we look for the book title
      if (author, title) not in a_control_list: # we want to avoid to have the same book twice because of different ids. We make a control list containing (author, title) so as to know that we already have the book in our dictionary.
        en_books.append(new_url) # if the tuple is not in the list, then we can move on and append the link to our final list
        a_control_list.append((author, title)) # we also add the new book to the control list, to avoid duplicates
      else:
        print("Excluded because "+title+" by "+author+" is already in the control list!")
  for link in en_books: # loop through the list of good books
    book_id = get_id(link) # we call the function that takes the id number from the link
    if book_id not in book_dict:
      book_dict[book_id] = {} # we create a key inside the dictionary corresponding to the id number; its value is another dictionary
  
  return en_books, book_dict # the function returns the list containing the links to the "good" books and the dictionary; I decided to dump it as a json object because it is more protable and easy to write inside a separate file


We are ready to create the main function. This is the function that actually scrapes the random search page of Project Gutenberg (https://www.gutenberg.org/ebooks/search/?sort_order=downloads). The function will call `get_good_books()` in order to return only the books that are valid for the project, and then it recursively calls itself until the initial dictionary reaches at least 500 items. When it does, the function will return the list containing the links to each individual book and the dictionary, that will be structured as such:

```
dict = {
  <bookID> : {},
  <bookID> : {},
  <bookID> : {},
  ...
}
```
We also have a `try-except` loop in the function. This is because there will be a second scraping phase where we will pick the books that the site recommends for each book in our dictionary: they are stored in a page that is identical to the initial one, minus the fact that there aren't multiple pages to navigate, but just one page. (see example: https://www.gutenberg.org/ebooks/42324/also/). Therefore, in order to avoid errors, when we are scraping the recommended books page the function will execute the `except` code. 


In [9]:
def scrape_random(a_url, a_control_list=None, a_dict=None, a_list=None):
  """This function scrapes the random search page of Project Gutenberg, when the option "sort by popularity" is selected. 
  It takes as input the corresponding url and returns the link to the next page of search, and a list containing the links to the individual pages of each book found"""
  url = "https://www.gutenberg.org"+a_url
  source = requests.get(url).text # we use get() to get the source code of the webpage
  soup = bs(source, "lxml") # we create the soup object with BeautifulSoup
  try:
    pages_links = soup.find("span", class_="links").find_all("a") # we collect the links for "Next", "Previous", "First" and "Last" pages

    for p in pages_links: # we loop through the links of the pages
      if p.get("title") == "Go to the next page of results.": # check whether tag attribute "title" corresponds to next page
        next_page = (p.get("href")) # we save the link to the next page in a variable; this link will be used as input to this function again, recursively, until we reach at least 500 books.
    books = soup.find_all("li", class_="booklink") # we now look for the links to the individual pages of each book
    booklinks = [book.find("a", class_="link").get("href") for book in books] # we save the links in a list
    good_links, a_dict = get_good_books(booklinks, a_control_list, a_dict)
    a_list.extend(good_links)
    print(len(a_dict))

    if len(a_dict) >= 500: # when we have reached 500 books:
      return a_list, a_dict # we stop the function
    else:
      return scrape_random(next_page,a_control_list, a_dict,a_list) # we haven't reached 500 books yet, so the function recursively calls itself
  
  except TypeError: # This will be useful when we scrape the "recommended books" pages
    books = soup.find_all("li", class_="booklink") # we now look for the links to the individual pages of each book
    booklinks = [book.find("a", class_="link").get("href") for book in books] # we save the links in a list
    return booklinks

In [None]:
# Initialization of variables passed as arguments
control = []
books = {}
to_books = []

In [None]:
# this takes a while to run...
to_books, books = scrape_random("/ebooks/search/?sort_order=downloads", control, books, to_books)

Excluded because https://www.gutenberg.org/ebooks/2542 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/16328 contains poetry, drama or a dictionary!
22
Excluded because https://www.gutenberg.org/ebooks/3825 contains poetry, drama or a dictionary!
46
Excluded because https://www.gutenberg.org/ebooks/58585 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/1727 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/42108 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/1934 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/6130 contains poetry, drama or a dictionary!
66
Excluded because https://www.gutenberg.org/ebooks/1001 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.

In [None]:
# Saving data in drive project folder
with open("drive/MyDrive/CSTA_Proj/data.json", 'w') as f:
  json.dump(books, f, ensure_ascii=False, indent=2) # for portability, we dump the newly created dictionary in a json file

with open("drive/MyDrive/CSTA_Proj/control.txt", 'a') as d:
  for tup in control:
    d.write(str(tup)+"\n")
  d.close()

with open("drive/MyDrive/CSTA_Proj/to_books.txt", "a") as t_b:
  for link in to_books:
    t_b.write(link+"\n")
  t_b.close()

In [None]:
# let's check our control list:
control

[('Shelley, Mary Wollstonecraft', 'Frankenstein; Or, The Modern Prometheus'),
 ('Austen, Jane', 'Pride and Prejudice'),
 ('Dickens, Charles',
  'A Christmas Carol in Prose; Being a Ghost Story of Christmas'),
 ('Swift, Jonathan', 'A Modest Proposal'),
 ('Carroll, Lewis', "Alice's Adventures in Wonderland"),
 ('Hawthorne, Nathaniel', 'The Scarlet Letter'),
 ('Stevenson, Robert Louis', 'The Strange Case of Dr. Jekyll and Mr. Hyde'),
 ('Dickens, Charles', 'A Tale of Two Cities'),
 ('Leblanc, Maurice',
  'The Extraordinary Adventures of Arsene Lupin, Gentleman-Burglar'),
 ('Doyle, Arthur Conan', 'The Adventures of Sherlock Holmes'),
 ('Melville, Herman', 'Moby Dick; Or, The Whale'),
 ('Kafka, Franz', 'Metamorphosis'),
 ('Rand, Ayn', 'Anthem'),
 ('Gilman, Charlotte Perkins', 'The Yellow Wallpaper'),
 ('Brontë, Charlotte', 'Jane Eyre: An Autobiography'),
 ('Twain, Mark', 'Adventures of Huckleberry Finn'),
 ('Benedict, Saint, Abbot of Monte Cassino',
  'St. Benedict’s Rule for Monasteries'),


In [None]:
# let's see our list containing the links to the single book pages:
to_books

['https://www.gutenberg.org/ebooks/84',
 'https://www.gutenberg.org/ebooks/1342',
 'https://www.gutenberg.org/ebooks/46',
 'https://www.gutenberg.org/ebooks/1080',
 'https://www.gutenberg.org/ebooks/11',
 'https://www.gutenberg.org/ebooks/25344',
 'https://www.gutenberg.org/ebooks/43',
 'https://www.gutenberg.org/ebooks/98',
 'https://www.gutenberg.org/ebooks/6133',
 'https://www.gutenberg.org/ebooks/1661',
 'https://www.gutenberg.org/ebooks/2701',
 'https://www.gutenberg.org/ebooks/5200',
 'https://www.gutenberg.org/ebooks/1250',
 'https://www.gutenberg.org/ebooks/1952',
 'https://www.gutenberg.org/ebooks/1260',
 'https://www.gutenberg.org/ebooks/76',
 'https://www.gutenberg.org/ebooks/50040',
 'https://www.gutenberg.org/ebooks/174',
 'https://www.gutenberg.org/ebooks/219',
 'https://www.gutenberg.org/ebooks/74',
 'https://www.gutenberg.org/ebooks/23',
 'https://www.gutenberg.org/ebooks/205',
 'https://www.gutenberg.org/ebooks/160',
 'https://www.gutenberg.org/ebooks/345',
 'https://w

### 1.2 Filling the book dictionary 
In this phase, we define the functions in order to fill the data structure that we defined. We will need some of the functions defined above, and some new functions, the main being `create_dict()`. We will get information about each book, such as author, title, categories (stores as "Subjects" in Project Gutenberg), the link to its `.txt` file that we will need for LDA, and the list of recommended books: they will be the "Silver Standard" against to which the recommender system will be evaluated.

### 1.3 Adding books to the list
The recommended books will not be just saved as a list of strings, but they will also be added to the dictionary following the whole procedure described to far: they will be added if they're in English, they're not poetry or drama nor audiobooks, and if they're not already in the dictionary under a different ID. The recommended books will then be filled as well, but their silver standard will not be added to the dictionary: the parameter `Recommended` in the function `create_dict()` will be a flag in order to know whether the dictionary being filled is an initial book or a recommended one. In this case, the part of the function that adds the recommended books to the dictionary will not work.  

In [10]:
def get_texts(soup):
  """This function takes a soup object of a Project Gutenberg book page and returns the link to its .txt file"""
  for a in soup.find_all("a", title= "Download"): # we want to get the files to the .txt utf-8 encoded files 
    ty = a.get("type") # we check the attribute "type"
    if re.search(r""".*(text/plain).*""", ty): # given that each book has a different value for the attribute "type", but all the possible values contain the string "text/plain", I look for the substring using regex
      return a.get("href") # the function returns the link

In [11]:
def get_recommendations(soup):
  """This function takes two parameters: a soup object and a control list. The soup object refers to a Project Gutenberg book page and the function looks
  for the recommended books; the recommended books, if they pass the checks, are returned as a tuple containing their id, author and title. The ID only
  would be a problem, given that we got rid of duplicates with different IDs, so in this case the author and title would be enough to make the 
  algorithm work"""
  links = soup.find("h2", class_="header", text="Similar Books").findAllNext("a")
  base_url = "https://www.gutenberg.org"
  for l in links:
    if l.find("span", text = "Readers also downloaded…") in l.findAllNext("span"): # we look for the "recommended books" page
      also = l.get("href")
      recommendations = scrape_random(also) # we call again the function scrape_random() that will execute the code in the "except" mode; it returns a list of links
      # we prepare two lists: one will contain id, author and title of the recommended book, while the other will contain the link to it
      recoms = [] 
      rec_links = []
      for link in recommendations: # we loop through the recommendations
        new_link = base_url+link # we make a gutenberg link
        if gen_en(new_link) and not not_prose(new_link) and not no_audiobooks(new_link): # we check if the book is valid
          soup = bs(requests.get(new_link).text, "lxml") # soup 
          author = get_author(soup) # we get author and title of the book
          title = get_title(soup)
          id = get_id(link) 
          recoms.append((id, author, title)) # we append to recoms a tuple containing the id, the author and the title of the recommended book
          rec_links.append(new_link) # we append the link to the other list
      return recoms, rec_links # the function returns the two lists

In [12]:
def create_dict(links, a_control_list, book_dict, to_texts, Recommended = False):
  """This function takes a link to a Gutenberg book and the json dictionary that will contain info about them, created with 
  get_good_books()"""
  with open(book_dict) as f:
    books = json.load(f) # loading the json file containing the dictionary; the keys of the dictionary are the books' id numbers
  for link in links:
    if not re.search(r"(https://www.gutenberg.org).*", link):
      link = "https://www.gutenberg.org"+link
    id = get_id(link)
    soup = bs(requests.get(link).text, "lxml")
    if id in books: # if the id of the book is in the dictionary:
      print(id)
      author = get_author(soup) #we get the author of the book by calling the function get_author()
      books[id]["Author"] = author  # we create a key inside the dictionary corresponding to the book id and we store the author there
      title = get_title(soup) # we get the title of the book by calling the fucntion get_title()
      books[id]["Title"] = title # we assign the title value to the key "Title" to each book in the dictionary
      subjects = [subject.get_text().strip() for subject in soup.find_all("a", class_="block")] # we look for the subjects, sort of categories. We save them in a list; the list is the value of the key "Subjects" for each book in the dictionary
      books[id]["Subjects"] = subjects
      href = get_texts(soup) # we call get_texts() to get the link to the text of each book
      to_texts.append(href)  # we append the link to the list I created
      books[id]["Text link"] = href # we also add a specific key in the dictionary containing the link  
      recoms, rec_links = get_recommendations(soup) # we call the function get_recommendations(), that returns the lists of recommended books
      books[id]["Similar books"] = recoms # we store them as silver standard
      if Recommended == False: # flag for recursivity
        for rec in rec_links:
          rec_id = get_id(rec) # we look for the id of the recommended book
          rec_soup = bs(requests.get(rec).text, "lxml") # we get the recommended book soup object
          # we look for the author and title of the recommended book: 
          rec_author = get_author(rec_soup) 
          rec_title = get_title(rec_soup)
          # we make sure to add to the dictionary books that are entirely new: not only a different ID, but also different author and title
          if rec_id not in books and (rec_author, rec_title) not in a_control_list: 
            a_control_list.append((author, title)) # if the book is totally new, we make sure the system remembers it to avoid duplication 
            books[rec_id] = {} # we add its ID to the dictionary
        new_recs = [rec for rec in rec_links if rec not in links]
        with open(book_dict, 'w') as f:
          json.dump(books, f, ensure_ascii=False, indent=2) #finally, we dump the finished dictionary in a json file in order for the function to process it again
        books, to_texts = create_dict(new_recs, a_control_list, book_dict, to_texts, Recommended=True) # the function calls itself again in order to fill the newly created dictionaries, but this time it won't add new recommended books
  return books, list(set(to_texts)) # the function returns the dictionary and the list containing the links to the texts (without accidental duplicates)


In [None]:
to_texts =[]
type(control)

In [None]:
# Due to runtime errors, I had to divide the dictionaries in chunks. Unfortunately I was able to fill only 
# the first and part of the second.
books_1 = {k: books[k] for k in list(books)[:101]}
to_books_1 = to_books[:101]
with open("drive/MyDrive/CSTA_Proj/books_1.json", "w") as f:
  json.dump(books_1, f, ensure_ascii=False, indent=2) 

with open("drive/MyDrive/CSTA_Proj/to_books_1.txt", "a") as f:
  for l in to_books_1:
    f.write(l+"\n")
  f.close()

books_2 = {k:books[k] for k in list(books)[101:201]}
books_3 = {k:books[k] for k in list(books)[201:301]}
books_4 = {k:books[k] for k in list(books)[301:401]}
books_5 = {k:books[k] for k in list(books)[401:]}
to_books_2 = to_books[101:201]
to_books_3 = to_books[201:301]
to_books_4 = to_books[301:401]
to_books_5 = to_books[401:]

In [None]:
# Again dumping dictionaries in json
with open("drive/MyDrive/CSTA_Proj/books_2.json", "w") as f:
  json.dump(books_2, f, ensure_ascii=False, indent=2) 

with open("drive/MyDrive/CSTA_Proj/books_3.json", "w") as f:
  json.dump(books_3, f, ensure_ascii=False, indent=2)

with open("drive/MyDrive/CSTA_Proj/books_4.json", "w") as f:
  json.dump(books_4, f, ensure_ascii=False, indent=2)  

with open("drive/MyDrive/CSTA_Proj/books_5.json", "w") as f:
  json.dump(books_5, f, ensure_ascii=False, indent=2) 

with open("drive/MyDrive/CSTA_Proj/to_books_2.txt", "a") as f:
  for l in to_books_2:
    f.write(l+"\n")
  f.close()

with open("drive/MyDrive/CSTA_Proj/to_books_3.txt", "a") as f:
  for l in to_books_3:
    f.write(l+"\n")
  f.close()

with open("drive/MyDrive/CSTA_Proj/to_books_4.txt", "a") as f:
  for l in to_books_4:
    f.write(l+"\n")
  f.close()

with open("drive/MyDrive/CSTA_Proj/to_books_5.txt", "a") as f:
  for l in to_books_5:
    f.write(l+"\n")
  f.close()

In [None]:
# this takes a really long time. It fills the little dictionaries in the main dictionary, and adds more items to
# the latter if it is recommended by another item that was already there.
books, to_texts = create_dict(to_books_1, control, "drive/MyDrive/CSTA_Proj/books_1.json", to_texts)

84
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/2542 contains poetry, drama or a dictionary!
1342
46
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
30368
1080
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/2542 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/9800 contains poetry, drama or a dictionary!
Excluded because https://www.gutenberg.org/ebooks/16328 contains poetry, drama or a dictionary!
11
25344
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
43
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
98
Excluded because https://www.gutenberg.org/ebooks/844 contains poetry, drama or a dictionary!
6133
Excluded beca

In [None]:
# again dumping dicts
with open("drive/MyDrive/CSTA_Proj/books_1_final.json", "w") as f:
  json.dump(books, f, ensure_ascii = False, indent =2)

with open("drive/MyDrive/CSTA_Proj/to_texts_1.txt", "a") as f:
  for l in to_texts:
    f.write(l+"\n")
  f.close()

In [None]:
with open("drive/MyDrive/CSTA_Proj/books_1.json") as f:
  books_1 = json.load(f)

Now we have the dictionary. Let's see the links that will lead us to the text files of each book:

In [93]:
to_texts = []
for k in books_1: # looping through the dictionary
  if books_1[k] == {}: # there are some dictionaries that couldn't be filled due to runtime errors
    print(k+" is missing")
  if books_1[k] != {}:
    href_txt = books_1[k]["Text link"]
    if href_txt != None:
      to_texts.append(href_txt) # let's get only valid links
    else:
      print(None)

to_texts

3296 is missing
375 is missing
7256 is missing
None
16377 is missing
5614 is missing
4913 is missing
34180 is missing
4902 is missing
4542 is missing
64061 is missing
63884 is missing
55278 is missing


['/files/84/84-0.txt',
 '/files/1342/1342-0.txt',
 '/files/46/46-0.txt',
 '/files/1080/1080-0.txt',
 '/files/11/11-0.txt',
 '/files/25344/25344-0.txt',
 '/files/43/43-0.txt',
 '/files/98/98-0.txt',
 '/files/6133/6133-0.txt',
 '/files/1661/1661-0.txt',
 '/files/2701/2701-0.txt',
 '/ebooks/5200.txt.utf-8',
 '/files/1250/1250-0.txt',
 '/files/1952/1952-0.txt',
 '/files/1260/1260-0.txt',
 '/files/76/76-0.txt',
 '/files/50040/50040-0.txt',
 '/ebooks/174.txt.utf-8',
 '/files/219/219-0.txt',
 '/files/74/74-0.txt',
 '/ebooks/23.txt.utf-8',
 '/files/205/205-0.txt',
 '/files/160/160-0.txt',
 '/ebooks/345.txt.utf-8',
 '/files/1400/1400-0.txt',
 '/files/2852/2852-0.txt',
 '/files/1232/1232-0.txt',
 '/files/16/16-0.txt',
 '/files/120/120-0.txt',
 '/files/55/55.txt',
 '/files/25929/25929-0.txt',
 '/files/408/408-0.txt',
 '/files/2591/2591-0.txt',
 '/ebooks/514.txt.utf-8',
 '/files/215/215-0.txt',
 '/files/2600/2600-0.txt',
 '/ebooks/1497.txt.utf-8',
 '/ebooks/19942.txt.utf-8',
 '/files/158/158-0.txt

In [None]:
# From the first dictionary, this is the number of links to texts that we collected:
len(to_texts)

242

## 2. Data cleaning
Now our dictionary is filled and we have a list of links leading to each book's .txt file. The next step is collecting the text from the links and cleaning it, in order to make it usable for training the LDA model.

In [13]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import spacy
nltk.download("stopwords")
nltk.download("punkt")
nlp = spacy.load("en_core_web_sm", disable = ['parser', 'ner'])
stops = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [16]:
def gutenberg_clean_text(book_url):
    """This function takes a url to a gutenberg .txt file and cleans it, removing the Gutenberg head and tail. 
    Since there are many different formattings, there are several options to be checked. """
    r = requests.get(book_url)
    r.encoding = "utf-8-sig"
    t = r.text
    try:
      if re.search("LETTERS FROM AN AMERICAN FARMER", t):
        head = re.search("LETTERS FROM AN AMERICAN FARMER", t).span()
        tail = re.search("The End", t).span()
        return t[head[0]:tail[0]]
      head = re.search(r"\*\*\*\s*START OF .* PROJECT GUTENBERG EBOOK.*\s*.*\*\*\*", t).span()
      tail = re.search("(?i)\**End of .* Project Gutenberg (EBook)?.*", t).span()
      if re.search("ST. BENEDICT’S\s*RULE FOR MONASTERIES", t):
        head = re.search("ST. BENEDICT’S\s*RULE FOR MONASTERIES\s*PROLOGUE\s*", t).span()
        return t[head[0]:tail[0]]
      return t[head[1]: tail[0]]
    except AttributeError:
      pattern = r"\*\*\* START OF .* PROJECT GUTENBERG EBOOK\s*.**\*\*\*"
      if re.search(pattern, t):
        head = re.search(pattern, t).span()
        tail = re.search("(?i)\**End of .* Project Gutenberg (EBook)?\s*.*", t).span()
        return t[head[1]: tail[0]]
      else: 
        print("This is not cleanable! Remove it!")
        return book_url

In [None]:
gutenberg_clean_text("https://www.gutenberg.org/files/1257/1257-0.txt")



In [17]:
def clean_text(raw_text):
  """This function takes a text, cleaned from Gutenberg head and tail, and returns a list of lemmatized tokens and
  the frequencies of the tokens."""
  unwanted = ["”", "’", "“","-", "''", "`",".", "--", "n't", "'s", "'", "_","...", "..","—", "chapter", "would", "could", "take", "there"]
  if re.search(r"(?i)Chapter [ixcvld]*[0-9]*", raw_text):
    chap_list = re.findall(r"(?i)Chapter [ixcvld]*[0-9]*.*", raw_text)
    unwanted.extend([chap.split()[1].lower() for chap in chap_list[1:]])
  tokens = [tok.lower() for tok in word_tokenize(raw_text) if tok not in punctuation and tok not in unwanted and tok.lower() not in stops]
  freq_dict = {}
  
  lemmatized_toks = lemmatize(tokens)
  lemmatized_toks = [tok for tok in lemmatized_toks if tok != "-PRON-" and tok.lower() not in unwanted and tok.lower() not in punctuation and tok.lower() not in stops]
  for tok in lemmatized_toks:
    if tok not in freq_dict:
      freq_dict[tok] = 1
    else:
      freq_dict[tok] += 1
  return lemmatized_toks, freq_dict

In [18]:
def lemmatize(tokens):
  """This function uses SpaCy to lemmatize a list of tokens"""
  lemmatized_tokens = []
  for token in tokens:
    lemma = [tok.lemma_ for tok in nlp(token)]
    lemmatized_tokens.extend(lemma)
  return lemmatized_tokens

In [94]:
# checking some problematic link
to_texts[205]

'/ebooks/3726.txt.utf-8'

In [None]:
# bad links from the first chunk of data (books_1 dictionary)
excluded_files_1 = [to_texts[129], to_texts[205], to_texts[234]]

In [None]:
# bad links from the second chunk of data (books_2 dictionary)
excluded_files_2 = ["/ebooks/2800.txt.utf-8", "/ebooks/1.txt.utf-8", "/ebooks/13309.txt.utf-8", "/ebooks/3330.txt.utf-8", "/ebooks/31100.txt.utf-8"]

In [None]:
# how many good links do we have from the second dictionary?
len(to_texts_2)-len(excluded_files)

89

In [None]:
to_texts.index("/ebooks/690.txt.utf-8")

234

In [None]:
# checking if the formatting is problematic
for link in to_texts_2[80:]:
  if link not in excluded_files:
    link = "https://www.gutenberg.org"+link
    print(link)
    gutenberg_clean_text(link)

https://www.gutenberg.org/ebooks/14407.txt.utf-8
https://www.gutenberg.org/ebooks/14872.txt.utf-8
https://www.gutenberg.org/ebooks/14814.txt.utf-8
https://www.gutenberg.org/ebooks/14837.txt.utf-8
https://www.gutenberg.org/ebooks/14868.txt.utf-8
https://www.gutenberg.org/ebooks/15284.txt.utf-8
https://www.gutenberg.org/ebooks/14220.txt.utf-8
https://www.gutenberg.org/ebooks/15137.txt.utf-8
https://www.gutenberg.org/ebooks/582.txt.utf-8
https://www.gutenberg.org/ebooks/45264.txt.utf-8
https://www.gutenberg.org/files/1212/1212-0.txt
https://www.gutenberg.org/ebooks/42078.txt.utf-8
https://www.gutenberg.org/ebooks/601.txt.utf-8


In [None]:
# creating the corpus
corpus_new = []
for link in to_texts_2:
  if link not in excluded_files:
    link = "https://www.gutenberg.org"+link
    print(link)
    txt = gutenberg_clean_text(link)
    lemmas, freqs = clean_text(txt)
    corpus_new.append(lemmas)
    with open("drive/MyDrive/CSTA_Proj/freq_dicts_new.json", "a") as f:
      json.dump(freqs, f, ensure_ascii=False, indent=2)


with open("drive/MyDrive/CSTA_Proj/corpus_new.txt","a") as f:
  for t in corpus_new:
    f.write("\t".join(t)+"\n")
  f.close()

corpus_new

https://www.gutenberg.org/ebooks/851.txt.utf-8
https://www.gutenberg.org/files/3090/3090-0.txt
https://www.gutenberg.org/ebooks/600.txt.utf-8
https://www.gutenberg.org/files/1837/1837-0.txt
https://www.gutenberg.org/files/10/10-0.txt
https://www.gutenberg.org/ebooks/19337.txt.utf-8
https://www.gutenberg.org/ebooks/103.txt.utf-8
https://www.gutenberg.org/ebooks/148.txt.utf-8
https://www.gutenberg.org/ebooks/32415.txt.utf-8
https://www.gutenberg.org/files/132/132-0.txt
https://www.gutenberg.org/files/3300/3300-0.txt
https://www.gutenberg.org/ebooks/14838.txt.utf-8
https://www.gutenberg.org/files/121/121-0.txt
https://www.gutenberg.org/files/2413/2413-0.txt
https://www.gutenberg.org/files/786/786-0.txt
https://www.gutenberg.org/ebooks/815.txt.utf-8
https://www.gutenberg.org/ebooks/21279.txt.utf-8
https://www.gutenberg.org/ebooks/16643.txt.utf-8
https://www.gutenberg.org/files/25717/25717-0.txt
https://www.gutenberg.org/files/141/141-0.txt
https://www.gutenberg.org/ebooks/7142.txt.utf-8
ht

[['produce',
  'anonymous',
  'volunteer',
  'narrative',
  'captivity',
  'restoration',
  'mrs',
  'mary',
  'rowlandson',
  'mrs',
  'mary',
  'rowlandson',
  'sovereignty',
  'goodness',
  'god',
  'together',
  'faithfulness',
  'promise',
  'display',
  'narrative',
  'captivity',
  'restoration',
  'mrs',
  'mary',
  'rowlandson',
  'commend',
  'desire',
  'know',
  'lord',
  'dealing',
  'especially',
  'dear',
  'child',
  'relation',
  'second',
  'addition',
  'sic',
  'correct',
  'amend',
  'write',
  'hand',
  'private',
  'use',
  'make',
  'public',
  'earnest',
  'desire',
  'friend',
  'benefit',
  'afflict',
  'deut',
  '32.39',
  'see',
  'even',
  'god',
  'kill',
  'make',
  'alive',
  'wound',
  'heal',
  'neither',
  'deliver',
  'hand',
  'tenth',
  'february',
  '1675',
  'come',
  'indians',
  'great',
  'number',
  'upon',
  'lancaster',
  'first',
  'come',
  'sunrise',
  'hearing',
  'noise',
  'gun',
  'look',
  'several',
  'house',
  'burn',
  'smoke',

In [None]:
corpus_new = []
for link in to_texts[130:]:
  if link not in excluded_files:
    link = "https://www.gutenberg.org"+link
    print(link)
    txt = gutenberg_clean_text(link)
    lemmas, freqs = clean_text(txt)
    corpus_new.append(lemmas)
    with open("drive/MyDrive/CSTA_Proj/freq_dicts_new.json", "a") as f:
      json.dump(freqs, f, ensure_ascii=False, indent=2)


with open("drive/MyDrive/CSTA_Proj/corpus_new.txt","a") as f:
  for t in corpus_new:
    f.write("\t".join(t)+"\n")
  f.close()

corpus_new

https://www.gutenberg.org/files/59785/59785-0.txt
https://www.gutenberg.org/files/1828/1828-0.txt
https://www.gutenberg.org/ebooks/15143.txt.utf-8
https://www.gutenberg.org/ebooks/34931.txt.utf-8
https://www.gutenberg.org/ebooks/3796.txt.utf-8
https://www.gutenberg.org/files/1354/1354-0.txt
https://www.gutenberg.org/files/5343/5343-0.txt
https://www.gutenberg.org/ebooks/20417.txt.utf-8
https://www.gutenberg.org/files/5000/5000-8.txt
https://www.gutenberg.org/files/1257/1257-0.txt
https://www.gutenberg.org/files/2759/2759-0.txt
https://www.gutenberg.org/files/43515/43515-0.txt
https://www.gutenberg.org/files/2302/2302-0.txt
https://www.gutenberg.org/files/4558/4558-0.txt
https://www.gutenberg.org/ebooks/48731.txt.utf-8
https://www.gutenberg.org/files/58344/58344-0.txt
https://www.gutenberg.org/ebooks/32117.txt.utf-8
https://www.gutenberg.org/files/5946/5946-0.txt
https://www.gutenberg.org/files/46675/46675-0.txt
https://www.gutenberg.org/ebooks/3581.txt.utf-8
https://www.gutenberg.org/e

[['produce',
  'david',
  'widger',
  'index',
  'project',
  'gutenberg',
  'work',
  'james',
  'joyce',
  'compile',
  'david',
  'widger',
  'content',
  'dubliner',
  'chamber',
  'music',
  'portrait',
  'artist',
  'young',
  'man',
  'ulysses',
  'exile',
  'table',
  'content',
  'volume',
  'dubliner',
  'james',
  'joyce',
  'content',
  'sister',
  'encounter',
  'araby',
  'eveline',
  'race',
  'two',
  'gallants',
  'boarding',
  'house',
  'little',
  'cloud',
  'counterpart',
  'clay',
  'painful',
  'case',
  'ivy',
  'day',
  'committee',
  'room',
  'mother',
  'grace',
  'dead',
  'chamber',
  'music',
  'james',
  'joyce',
  'content',
  'first',
  'line',
  'string',
  'earth',
  'air',
  'make',
  'music',
  'sweet',
  'twilight',
  'turn',
  'amethyst',
  'deep',
  'deep',
  'blue',
  'hour',
  'thing',
  'repose',
  'lonely',
  'watcher',
  'sky',
  'shy',
  'star',
  'go',
  'forth',
  'heaven',
  'maidenly',
  'disconsolate',
  'lean',
  'window',
  'goldenh

In [None]:
# aggiungi testi da books_2 a corpus_new
to_texts_2 =[]
with open("drive/MyDrive/CSTA_Proj/to_texts_2", "r") as f:
  for line in f.readlines():
    to_texts_2.append(line.strip())

In [None]:
for l in to_texts_2:
  if l == None:
    print(to_texts_2.index(l))

In [28]:
with open("drive/MyDrive/CSTA_Proj/books_1.json") as f:
  books_1 = json.load(f) 

with open("drive/MyDrive/CSTA_Proj/books_2.json") as f:
  books_2 = json.load(f)

In [None]:
# let's see the first item in the book dictionary
list(books_1.items())[0]

('84',
 {'Author': 'Shelley, Mary Wollstonecraft',
  'Similar books': [['1342', 'Austen, Jane', 'Pride and Prejudice'],
   ['345', 'Stoker, Bram', 'Dracula'],
   ['46',
    'Dickens, Charles',
    'A Christmas Carol in Prose; Being a Ghost Story of Christmas'],
   ['11', 'Carroll, Lewis', "Alice's Adventures in Wonderland"],
   ['42324',
    'Shelley, Mary Wollstonecraft',
    'Frankenstein; Or, The Modern Prometheus'],
   ['98', 'Dickens, Charles', 'A Tale of Two Cities'],
   ['43',
    'Stevenson, Robert Louis',
    'The Strange Case of Dr. Jekyll and Mr. Hyde'],
   ['1661', 'Doyle, Arthur Conan', 'The Adventures of Sherlock Holmes'],
   ['2701', 'Melville, Herman', 'Moby Dick; Or, The Whale'],
   ['25344', 'Hawthorne, Nathaniel', 'The Scarlet Letter'],
   ['1080', 'Swift, Jonathan', 'A Modest Proposal'],
   ['174', 'Wilde, Oscar', 'The Picture of Dorian Gray'],
   ['5200', 'Kafka, Franz', 'Metamorphosis'],
   ['1260', 'Brontë, Charlotte', 'Jane Eyre: An Autobiography'],
   ['76', 'T

In [None]:
title2idx_1 = {books_1[k]["Title"]:k for k in books_1 if books_1[k] != {} and books_1[k]["Text link"] != None and books_1[k]["Text link"] not in excluded_files}

In [None]:
len(title2idx_1)

239

In [None]:
title2idx_2 = {books_2[k]["Title"]:k for k in books_2 if books_2[k] != {} and books_2[k]["Text link"] != None and books_2[k]["Text link"] not in excluded_files}

In [None]:
len(title2idx_2)

89

In [38]:
# loading the corpus
crp = []
with open("drive/MyDrive/CSTA_Proj/corpus_new.txt", "r") as f:
  for l in f.readlines():
    if l not in crp:
      crp.append(l)
  

In [227]:
# what is the first document?(book?)
crp[0]

'frankenstein\tmodern\tprometheus\tmary\twollstonecraft\tgodwin\tshelley\tcontent\tletter\tletter\tletter\tletter\tletter\tmrs\tsaville\tengland\tst\tpetersburgh\tdec\t11th\trejoice\thear\tdisaster\taccompany\tcommencement\tenterprise\tregard\tevil\tforeboding\tarrive\tyesterday\tfirst\ttask\tassure\tdear\tsister\twelfare\tincrease\tconfidence\tsuccess\tundertaking\talready\tfar\tnorth\tlondon\twalk\tstreet\tpetersburgh\tfeel\tcold\tnorthern\tbreeze\tplay\tupon\tcheek\tbrace\tnerve\tfill\tdelight\tunderstand\tfeel\tbreeze\ttravel\tregion\ttowards\tadvance\tgive\tforetaste\ticy\tclimes\tinspirit\twind\tpromise\tdaydreams\tbecome\tfervent\tvivid\ttry\tvain\tpersuade\tpole\tseat\tfrost\tdesolation\tever\tpresent\timagination\tregion\tbeauty\tdelight\tmargaret\tsun\tever\tvisible\tbroad\tdisk\tskirt\thorizon\tdiffusing\tperpetual\tsplendour\tleave\tsister\tput\ttrust\tprecede\tnavigator\tsnow\tfrost\tbanished\tsailing\tcalm\tsea\tmay\twafted\tland\tsurpass\twonder\tbeauty\tevery\tregion\th

In [228]:
# So, with "crp" as the list containing all the texts collected from the two dictionaries, the number of texts collected is:
print("Number of unique texts collected: "+str(len(crp)))

Number of unique texts collected: 319


In [39]:
# we transform the list of strings (plain text with \t separated words) in a list of lists of strings: 
crp = [line.split("\t") for line in crp]

In [230]:
crp[0:1]

[['frankenstein',
  'modern',
  'prometheus',
  'mary',
  'wollstonecraft',
  'godwin',
  'shelley',
  'content',
  'letter',
  'letter',
  'letter',
  'letter',
  'letter',
  'mrs',
  'saville',
  'england',
  'st',
  'petersburgh',
  'dec',
  '11th',
  'rejoice',
  'hear',
  'disaster',
  'accompany',
  'commencement',
  'enterprise',
  'regard',
  'evil',
  'foreboding',
  'arrive',
  'yesterday',
  'first',
  'task',
  'assure',
  'dear',
  'sister',
  'welfare',
  'increase',
  'confidence',
  'success',
  'undertaking',
  'already',
  'far',
  'north',
  'london',
  'walk',
  'street',
  'petersburgh',
  'feel',
  'cold',
  'northern',
  'breeze',
  'play',
  'upon',
  'cheek',
  'brace',
  'nerve',
  'fill',
  'delight',
  'understand',
  'feel',
  'breeze',
  'travel',
  'region',
  'towards',
  'advance',
  'give',
  'foretaste',
  'icy',
  'climes',
  'inspirit',
  'wind',
  'promise',
  'daydreams',
  'become',
  'fervent',
  'vivid',
  'try',
  'vain',
  'persuade',
  'pole

### 3. LDA Training
For topic modeling, we will use the Gensim implementation of LDA (Latent Dirichlet Allocation). LDA is a popular algorithm used to understand the hidden topics in a document or in a corpus. LDA represents each document as a distribution of topics and each topic as a distribution of words: the model leanrs to associate each word to a topic and each topic to the documents. What we need:

*   `corpus`: a document-term matrix, where each cell represents the frequency of a term in each document of the collection 
*   `num_topics`: a number of topics K, that should be decided a priori 
*   `id2word`: a dictionary that associates each word in the corpus to a certain ID 
*   `passes`: the number of times the data should be seen by the model during training. 



In [19]:
# we import the tools we need
import gensim
from gensim import corpora
from gensim import models
from pprint import pprint

In [40]:
# we build the dictionary using the function provided by gensim; the dictionary is a idx2word dic. Each token is mapped to an idx.
dictionary = corpora.Dictionary(crp)

In [41]:
# now we use the doc2bow method to understand the term doc freq. Doc2Bow converts each doc in a list of tuples (token_id, token_count):
corpus = [dictionary.doc2bow(doc) for doc in crp]

In [174]:
# What does "Frankenstein; Or, the modern Prometheus" look like now?
corpus[0]

[(0, 2),
 (1, 2),
 (2, 1),
 (3, 2),
 (4, 2),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 2),
 (12, 3),
 (13, 1),
 (14, 5),
 (15, 1),
 (16, 17),
 (17, 6),
 (18, 1),
 (19, 2),
 (20, 2),
 (21, 9),
 (22, 1),
 (23, 6),
 (24, 1),
 (25, 1),
 (26, 3),
 (27, 1),
 (28, 8),
 (29, 6),
 (30, 1),
 (31, 2),
 (32, 1),
 (33, 3),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 10),
 (39, 3),
 (40, 1),
 (41, 2),
 (42, 11),
 (43, 1),
 (44, 15),
 (45, 4),
 (46, 2),
 (47, 5),
 (48, 11),
 (49, 25),
 (50, 4),
 (51, 2),
 (52, 2),
 (53, 1),
 (54, 4),
 (55, 1),
 (56, 1),
 (57, 10),
 (58, 10),
 (59, 1),
 (60, 2),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 2),
 (65, 20),
 (66, 2),
 (67, 1),
 (68, 6),
 (69, 2),
 (70, 1),
 (71, 3),
 (72, 1),
 (73, 17),
 (74, 17),
 (75, 9),
 (76, 3),
 (77, 3),
 (78, 1),
 (79, 2),
 (80, 2),
 (81, 1),
 (82, 1),
 (83, 3),
 (84, 1),
 (85, 11),
 (86, 4),
 (87, 3),
 (88, 13),
 (89, 2),
 (90, 2),
 (91, 2),
 (92, 1),
 (93, 1),
 (94, 3),
 (95, 14),
 (96, 7),
 (97, 1),
 (98, 1),
 (99, 

In [181]:
# to which word does the id 149 correspond to?
dictionary[149]

'agony'

In [178]:
# human-readable version of output above:
human_c = []
for (id, freq) in corpus[0]:
    human_c.append((dictionary[id], freq))

In [179]:
human_c

[('11th', 2),
 ('12th', 2),
 ('13th', 1),
 ('17—.', 2),
 ('18th', 2),
 ('19th', 1),
 ('26th', 1),
 ('27th', 1),
 ('28th', 1),
 ('2d', 1),
 ('31st', 1),
 ('5th', 2),
 ('7th', 3),
 ('9th', 1),
 ('abandon', 5),
 ('abbey', 1),
 ('abhor', 17),
 ('abhorrence', 6),
 ('abhorrent', 1),
 ('ability', 2),
 ('abject', 2),
 ('able', 9),
 ('aboard', 1),
 ('abode', 6),
 ('abortion', 1),
 ('abortive', 1),
 ('abroad', 3),
 ('abrupt', 1),
 ('absence', 8),
 ('absent', 6),
 ('absolute', 1),
 ('absolutely', 2),
 ('absolution', 1),
 ('absorb', 3),
 ('abstain', 1),
 ('abstruse', 1),
 ('abyss', 1),
 ('accede', 1),
 ('accent', 10),
 ('accept', 3),
 ('acceptance', 1),
 ('access', 2),
 ('accident', 11),
 ('accidentally', 1),
 ('accompany', 15),
 ('accomplish', 4),
 ('accomplishment', 2),
 ('accord', 5),
 ('accordingly', 11),
 ('account', 25),
 ('accumulate', 4),
 ('accumulation', 2),
 ('accuracy', 2),
 ('accurate', 1),
 ('accurse', 4),
 ('accusation', 1),
 ('accusations.—poor', 1),
 ('accuse', 10),
 ('accustomed'

#### 3.1 Building the LDA model: attempt \#1
Now we start building the LDA model. Given that we don't know exactly how many topics we need, we will try with different parameters (number of topics, number of iterations): using a for loop, we build 5 different LDA models. 


In [None]:
# import newly created corpora, put together and eliminate duplicates
# import gensim corpora to create corpus and build dictionary
# import LDA model from gensim
# train using corpus, 100 iterations and 10 topics at least

In [20]:
import time

In [None]:
# Let's try: 10 passes, 10 topics:
lda_model_1 = models.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary,passes=10 )

In [None]:
pprint(lda_model_1.print_topics())

[(0,
  '0.022*"shall" + 0.017*"say" + 0.015*"unto" + 0.014*"lord" + 0.012*"god" + '
  '0.011*"come" + 0.011*"man" + 0.008*"son" + 0.008*"thou" + 0.008*"go"'),
 (1,
  '0.020*"de" + 0.012*"en" + 0.007*"ter" + 0.006*"la" + 0.005*"say" + '
  '0.005*"one" + 0.005*"le" + 0.004*"come" + 0.004*"go" + 0.004*"er"'),
 (2,
  '0.009*"return" + 0.007*"p." + 0.005*"may" + 0.004*"c." + 0.004*"l." + '
  '0.004*"tom" + 0.003*"see" + 0.003*"emperor" + 0.003*"ii" + 0.003*"two"'),
 (3,
  '0.023*"shall" + 0.019*"unto" + 0.018*"lord" + 0.016*"thou" + 0.015*"say" + '
  '0.015*"god" + 0.012*"thy" + 0.011*"ye" + 0.010*"thee" + 0.010*"man"'),
 (4,
  '0.015*"say" + 0.009*"one" + 0.008*"go" + 0.007*"see" + 0.006*"man" + '
  '0.006*"come" + 0.006*"make" + 0.005*"know" + 0.004*"well" + 0.004*"time"'),
 (5,
  '0.006*"one" + 0.005*"may" + 0.004*"man" + 0.004*"see" + 0.004*"make" + '
  '0.003*"great" + 0.003*"say" + 0.003*"time" + 0.003*"two" + 0.003*"give"'),
 (6,
  '0.014*"say" + 0.009*"go" + 0.007*"one" + 0.007*"com

#### 3.2 Building the LDA model: attempt \#2
The results don't look so great. Let's try another approach: firstly, let's increment the number of passes to 20. Secondly, let's have a look at the "Subjects" given by Project Gutenberg: they might be a good indicator of the ideal number of topics.  

In [None]:
# Let's check the subjects
subjects = []
for k in books_1:
  if books_1[k] != {} and books_1[k]["Text link"] != None and books_1[k]["Text link"] not in excluded_files_1:
    subjects.extend(books_1[k]["Subjects"])
for k in books_2:
  if books_2[k] != {} and books_2[k]["Text link"] != None and books_2[k]["Text link"] not in excluded_files_2:
    subjects.extend(books_2[k]["Subjects"] )

list(set(subjects))



['Mathematics',
 'Feminist fiction',
 'English language -- Grammar',
 'Race relations -- Fiction',
 'Political science -- Early works to 1800',
 'Fishes',
 'Superman (Philosophical concept)',
 'Self-actualization (Psychology) -- Fiction',
 'Philosophy and religion',
 'Africa -- Fiction',
 'Sex customs -- Greece',
 'Manned space flight -- Fiction',
 'Saint Petersburg (Russia) -- Fiction',
 'Hens -- Juvenile fiction',
 'Stoics',
 'Russia -- History -- 1801-1917 -- Fiction',
 'Women teachers -- Fiction',
 'Finn, Huckleberry (Fictitious character) -- Fiction',
 "Children's stories, French -- Translations into English",
 'Sled dogs -- Fiction',
 'California -- History, Local',
 'Oral tradition -- England',
 'Murder -- Fiction',
 'Young women -- Social life and customs -- Juvenile fiction',
 'Rationalism',
 'Ducks -- Juvenile fiction',
 'French literature -- 18th century',
 'United States -- Social conditions',
 'Logical atomism',
 'Private investigators -- England -- Fiction',
 'Alice (Fict

In [None]:
# How many unique subjects do we have?
len(list(set(subjects)))

688

In [None]:
sorted(list(set(subjects)))

['Abolitionists -- United States -- Biography',
 'Accident victims -- Fiction',
 'Adoptees -- Fiction',
 'Adultery -- Fiction',
 'Adventure and adventurers -- Fiction',
 'Adventure stories',
 'Adventure stories, English',
 'Adventure stories, French -- Translations into English',
 'Africa -- Fiction',
 'Africa -- History',
 'African American abolitionists -- Biography',
 'African American men -- Fiction',
 'African American orators',
 'African Americans',
 'African Americans -- Crimes against',
 'African Americans -- Fiction',
 'African Americans -- Georgia',
 'African Americans -- History -- Sources',
 'African Americans -- Politics and government -- 20th century',
 'African Americans -- Social conditions -- To 1964',
 'Africans -- France -- Fiction',
 'Ahab, Captain (Fictitious character) -- Fiction',
 'Alchemy',
 'Alcoholics -- Fiction',
 'Alice (Fictitious character from Carroll) -- Juvenile fiction',
 'Alienation (Social psychology) -- Fiction',
 'American fiction -- 19th century'

Giving a look at the "Subjects", we can say that there are some macro-categories that we can take as topics:
"Juvenile fiction", "Women fiction", "Politics", "Erotic fiction", "History", "Mystery", "Science", "Religion and Philosophy", "Biography", "Sci-Fi and Fantasy". We can take 10 as the minimum number of topics and make several models incrementing to a max of 40 topics. We can then use the Coherence Score to evaluate the results: the model with the highest coherence scores might be the one with the best topics. The 5-topics model is just for control.

In [None]:
lda_model_5 = models.LdaModel(corpus=corpus, num_topics=5, id2word=dictionary, passes=20)
lda_model_5.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_5.lda")

In [None]:
lda_model_10 = models.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary, passes=20)
lda_model_10.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_10.lda")

In [None]:
lda_model_15 = models.LdaModel(corpus=corpus, num_topics=15, id2word=dictionary, passes=20)
lda_model_15.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_15.lda")

In [None]:
lda_model_20 = models.LdaModel(corpus=corpus, num_topics=20, id2word=dictionary, passes=20)
lda_model_20.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_20.lda")

In [None]:
lda_model_25 = models.LdaModel(corpus=corpus, num_topics=25, id2word=dictionary, passes=20)
lda_model_25.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_25.lda")

In [None]:
lda_model_30 = models.LdaModel(corpus=corpus, num_topics=30, id2word=dictionary, passes=20)
lda_model_30.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_30.lda")

In [None]:
lda_model_35 = models.LdaModel(corpus=corpus, num_topics=35, id2word=dictionary, passes=20)
lda_model_35.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_35.lda")

In [None]:
lda_model_40 = models.LdaModel(corpus=corpus, num_topics=40, id2word=dictionary, passes=20)
lda_model_40.save("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_40.lda")

In [None]:
lda_models = []
lda_models.append(lda_model_5)
lda_models.append(lda_model_10)
lda_models.append(lda_model_15)
lda_models.append(lda_model_20)
lda_models.append(lda_model_25)
lda_models.append(lda_model_30)
lda_models.append(lda_model_35)
lda_models.append(lda_model_40)

Before moving on, we introduce Topic Coherence in order to evaluate the topics. This metric measures the degree of semantic similarity between high scoring words in the topic. It helps distinguish semantically interpretable topics. In this project, I chose C_v coherence, which is based on cosine similarity.

In [None]:
topicnums = [5,10,15,20,25,30,35,40]
scores = []
for model in lda_models:
  coherence_model_lda = models.CoherenceModel(model=model, texts=crp, dictionary=dictionary, coherence='c_v')
  coherence_lda = coherence_model_lda.get_coherence()
  scores.append(coherence_lda)
  print("\nCoherence Score of lda_model_"+str(topicnums[lda_models.index(model)])+": ", coherence_lda)

print("Highest coherence score: "+str(max(scores)))


Coherence Score of lda_model_5:  0.37060712193829987

Coherence Score of lda_model_10:  0.3804100102150585

Coherence Score of lda_model_15:  0.39526847985354213

Coherence Score of lda_model_20:  0.3804068274515874

Coherence Score of lda_model_25:  0.4266375740410572

Coherence Score of lda_model_30:  0.37065157743965954

Coherence Score of lda_model_35:  0.3827153667922313

Coherence Score of lda_model_40:  0.39799711040404845
Highest coherence score: 0.4266375740410572


The highest coherence scores are achieved by the model with 25 topics. Let's print out the topics:

In [29]:
lda_model_25 = models.LdaModel.load("drive/MyDrive/CSTA_Proj/ldaModels/lda_model_25.lda")

In [None]:
pprint(lda_model_25.print_topics(num_topics=-1))

[(0,
  '0.029*"de" + 0.019*"en" + 0.013*"r" + 0.012*"k" + 0.012*"b" + 0.011*"p" + '
  '0.011*"ter" + 0.010*"kt" + 0.009*"q" + 0.009*"la"'),
 (1,
  '0.020*"say" + 0.015*"go" + 0.012*"come" + 0.009*"one" + 0.009*"see" + '
  '0.008*"little" + 0.007*"get" + 0.007*"make" + 0.006*"know" + 0.006*"look"'),
 (2,
  '0.010*"say" + 0.007*"go" + 0.007*"one" + 0.006*"know" + 0.006*"lupin" + '
  '0.006*"come" + 0.006*"see" + 0.006*"man" + 0.005*"luis" + '
  '0.005*"raskolnikov"'),
 (3,
  '0.012*"prince" + 0.008*"say" + 0.007*"pierre" + 0.006*"one" + 0.006*"see" + '
  '0.005*"go" + 0.005*"come" + 0.005*"time" + 0.005*"feel" + 0.004*"give"'),
 (4,
  '0.024*"shall" + 0.024*"unto" + 0.021*"lord" + 0.016*"say" + 0.016*"thou" + '
  '0.013*"thy" + 0.011*"god" + 0.011*"ye" + 0.011*"man" + 0.011*"thee"'),
 (5,
  '0.021*"man" + 0.008*"book" + 0.007*"call" + 0.007*"woman" + 0.006*"make" + '
  '0.006*"paine" + 0.005*"time" + 0.005*"word" + 0.005*"say" + 0.005*"one"'),
 (6,
  '0.018*"say" + 0.011*"mr" + 0.007*"se

#### 3.3 Building the LDA model: attempt \#3
They don't look bad, but improvements can be made. Let's erase from the corpus all the "w.", "p" and so on. 

In [231]:
for tok in crp[0]:
  if re.search("[a-z]\.", tok):
    print(tok)
  if len(tok) <= 1:
    print(tok)

r.
r.w
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
m.
n
m.
countenance.—ay
m.
m.
accusations.—poor
m.
i.


In [21]:
def remove_confounders(list_of_toks):
  """This function removes tokens like "p." from a list of tokens"""
  for tok in list_of_toks:
    if re.search(r"[a-z]\.", tok):
      list_of_toks.remove(tok)
    if len(tok) ==1:
      list_of_toks.remove(tok)
  return list_of_toks

In [70]:
# we define a new corpus
new_crp = []
for book in crp:
  new_crp.append(remove_confounders(book))

In [71]:
for tok in new_crp[0]:
  if re.search("[a-z]\.", tok):
    print(tok)

In [72]:
len(new_crp)

319

In [287]:
# saving the new corpus in the project folder
with open("drive/MyDrive/CSTA_Proj/new_crp.txt", "a") as f:
  for b in new_crp:
    f.write("\t".join(b)+"\n")
  f.close()

In [42]:
new_crp = []

with open("drive/MyDrive/CSTA_Proj/new_crp.txt", "r") as f:
  for l in f.readlines():
    new_crp.append(l)

In [43]:
new_crp[0]

'frankenstein\tmodern\tprometheus\tmary\twollstonecraft\tgodwin\tshelley\tcontent\tletter\tletter\tletter\tletter\tletter\tmrs\tsaville\tengland\tst\tpetersburgh\tdec\t11th\trejoice\thear\tdisaster\taccompany\tcommencement\tenterprise\tregard\tevil\tforeboding\tarrive\tyesterday\tfirst\ttask\tassure\tdear\tsister\twelfare\tincrease\tconfidence\tsuccess\tundertaking\talready\tfar\tnorth\tlondon\twalk\tstreet\tpetersburgh\tfeel\tcold\tnorthern\tbreeze\tplay\tupon\tcheek\tbrace\tnerve\tfill\tdelight\tunderstand\tfeel\tbreeze\ttravel\tregion\ttowards\tadvance\tgive\tforetaste\ticy\tclimes\tinspirit\twind\tpromise\tdaydreams\tbecome\tfervent\tvivid\ttry\tvain\tpersuade\tpole\tseat\tfrost\tdesolation\tever\tpresent\timagination\tregion\tbeauty\tdelight\tmargaret\tsun\tever\tvisible\tbroad\tdisk\tskirt\thorizon\tdiffusing\tperpetual\tsplendour\tleave\tsister\tput\ttrust\tprecede\tnavigator\tsnow\tfrost\tbanished\tsailing\tcalm\tsea\tmay\twafted\tland\tsurpass\twonder\tbeauty\tevery\tregion\th

In [44]:
new_crp = [line.split("\t") for line in new_crp]

In [45]:
new_crp[:1]

[['frankenstein',
  'modern',
  'prometheus',
  'mary',
  'wollstonecraft',
  'godwin',
  'shelley',
  'content',
  'letter',
  'letter',
  'letter',
  'letter',
  'letter',
  'mrs',
  'saville',
  'england',
  'st',
  'petersburgh',
  'dec',
  '11th',
  'rejoice',
  'hear',
  'disaster',
  'accompany',
  'commencement',
  'enterprise',
  'regard',
  'evil',
  'foreboding',
  'arrive',
  'yesterday',
  'first',
  'task',
  'assure',
  'dear',
  'sister',
  'welfare',
  'increase',
  'confidence',
  'success',
  'undertaking',
  'already',
  'far',
  'north',
  'london',
  'walk',
  'street',
  'petersburgh',
  'feel',
  'cold',
  'northern',
  'breeze',
  'play',
  'upon',
  'cheek',
  'brace',
  'nerve',
  'fill',
  'delight',
  'understand',
  'feel',
  'breeze',
  'travel',
  'region',
  'towards',
  'advance',
  'give',
  'foretaste',
  'icy',
  'climes',
  'inspirit',
  'wind',
  'promise',
  'daydreams',
  'become',
  'fervent',
  'vivid',
  'try',
  'vain',
  'persuade',
  'pole

In [73]:
# we create a new dictionary
new_dictionary = corpora.Dictionary(new_crp)

In [74]:
# and of course a new bow corpus
new_corpus = [new_dictionary.doc2bow(doc) for doc in new_crp]

In [242]:
# we create the models with the indicated topics
new_models = []
topics = [5,10,15,20,25,30]
for n in topics:
  print(time.time())
  new_lda_model = models.LdaModel(corpus=new_corpus, num_topics=n, id2word=new_dictionary, passes=20)
  new_lda_model.save("drive/MyDrive/CSTA_Proj/ldaModels/new_lda_model_"+str(n)+".lda")
  new_models.append(new_lda_model)
  print(time.time())

1612795940.6110427
1612796266.701522
1612796266.7023304
1612796621.9780324
1612796621.979054
1612796996.1095314
1612796996.1096854
1612797393.7125685
1612797393.7138832
1612797817.6818361
1612797817.6834016
1612798290.291002


In [244]:
# Computing topic coherence
scores = []
for model in new_models:
  coherence_model_lda = models.CoherenceModel(model=model, texts=new_crp, dictionary=new_dictionary, coherence='c_v')
  coherence_lda = coherence_model_lda.get_coherence()
  scores.append(coherence_lda)
  print("\nCoherence Score of lda_model_"+str(topics[new_models.index(model)])+": ", coherence_lda)

print("Highest coherence score: "+str(max(scores)))


Coherence Score of lda_model_5:  0.3720924284541359

Coherence Score of lda_model_10:  0.3565481286114929

Coherence Score of lda_model_15:  0.3685054465593391

Coherence Score of lda_model_20:  0.3925468596597481

Coherence Score of lda_model_25:  0.36756264727877985

Coherence Score of lda_model_30:  0.38681294132272703
Highest coherence score: 0.3925468596597481


It looks like the best coherence score for the new models is given by 20 topics. Let's implement this model and see the topics it gives us:

In [30]:
new_lda_20 = models.LdaModel.load("drive/MyDrive/CSTA_Proj/ldaModels/new_lda_model_20.lda")

In [246]:
pprint(new_lda_20.print_topics(num_topics=-1))

[(0,
  '0.007*"see" + 0.006*"one" + 0.006*"may" + 0.005*"make" + 0.005*"give" + '
  '0.004*"time" + 0.004*"say" + 0.004*"must" + 0.004*"eye" + 0.004*"well"'),
 (1,
  '0.007*"one" + 0.007*"man" + 0.007*"write" + 0.006*"great" + 0.006*"love" + '
  '0.006*"make" + 0.006*"come" + 0.006*"time" + 0.006*"book" + 0.006*"day"'),
 (2,
  '0.010*"return" + 0.005*"may" + 0.003*"tom" + 0.003*"emperor" + 0.003*"ii" + '
  '0.003*"see" + 0.003*"de" + 0.003*"two" + 0.003*"year" + 0.003*"first"'),
 (3,
  '0.026*"god" + 0.017*"ye" + 0.017*"shall" + 0.014*"thou" + 0.012*"say" + '
  '0.011*"unto" + 0.010*"nephi" + 0.010*"come" + 0.008*"alma" + '
  '0.008*"people"'),
 (4,
  '0.011*"go" + 0.010*"come" + 0.009*"make" + 0.009*"say" + 0.009*"one" + '
  '0.008*"upon" + 0.008*"great" + 0.007*"people" + 0.007*"time" + 0.007*"man"'),
 (5,
  '0.011*"say" + 0.006*"like" + 0.006*"one" + 0.006*"go" + 0.005*"see" + '
  '0.005*"come" + 0.005*"man" + 0.005*"little" + 0.004*"look" + 0.004*"know"'),
 (6,
  '0.015*"say" + 0.0

In [75]:
# what is frankenstein made of?
transf_new_corpus = new_lda_20[new_corpus]
for article in transf_new_corpus[0]:
  print(article)

(0, 0.15025349)
(1, 0.017877858)
(6, 0.13062367)
(12, 0.2552825)
(13, 0.039789923)
(15, 0.37923113)


#### 3.4 Building the LDA model: attempt \#4
Some of the topics are good, but we can still do better. Let's try to remove everything that is not a noun or an adjective: they tend to convey more meaning than verbs.

In [261]:
# Let's check the POS tag of each token and keep only nouns and adjectives
new_doc = []
doc = nlp(" ".join(new_crp[0]))
for tok in doc:
  print(tok.pos_, tok.tag_)
  if tok.pos_ == "NOUN" or tok.pos_ == "ADJ":
    new_doc.append(str(tok))

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
ADJ JJ
PROPN NNP
NOUN NN
NUM CD
NOUN NN
ADJ JJS
NOUN NN
VERB VB
ADJ JJ
ADJ JJ
PROPN NNP
PROPN NNP
PROPN NNP
PROPN NNP
PRON NN
NOUN NN
VERB VB
ADJ JJ
NOUN NN
NOUN NN
VERB VBP
NOUN NN
VERB VBD
PRON NN
NOUN NN
VERB VBP
ADJ JJ
NOUN NN
ADJ JJ
VERB VBP
ADJ JJ
NOUN NN
VERB VBP
ADJ JJ
NOUN NN
ADP IN
NOUN NN
ADV RB
ADJ JJ
ADV RB
VERB VBP
PROPN NNP
NOUN NN
NOUN NN
PROPN NNP
PROPN NNP
NOUN NN
ADV RB
ADJ JJ
VERB VBP
ADV RB
ADJ JJ
NOUN NN
VERB VBG
ADJ JJ
NOUN NN
VERB VBP
DET DT
ADJ JJ
NOUN NN
PROPN NNP
PROPN NNP
PROPN NNP
ADJ JJ
ADJ JJ
NOUN NN
VERB VBP
ADV RB
PROPN NNP
PROPN NNP
PROPN NNP
VERB VBD
NOUN NN
SCONJ IN
ADJ JJ
PROPN NNP
PROPN NNP
NOUN NN
NOUN NN
NOUN NN
NOUN NN
NOUN NN
ADV RB
VERB VB
NOUN NN
NOUN NN
PROPN NNP
PROPN NNP
VERB VBD
ADJ JJ
PROPN NNPS
PROPN NNP
PROPN NNP
PROPN NNP
VERB VBP
NOUN NN
PROPN NNP
PROPN NNP
PROPN NNPS
PROPN NNP
PROPN NNPS
PROPN NNP
VERB VB
ADV RBR
PROPN NNP
PROPN NNP
PROPN NNP
PROPN NNP
NOUN NN
PROPN NNP

In [260]:
new_doc

['modern',
 'rejoice',
 'disaster',
 'commencement',
 'enterprise',
 'regard',
 'evil',
 'arrive',
 'yesterday',
 'first',
 'task',
 'assure',
 'dear',
 'sister',
 'welfare',
 'increase',
 'confidence',
 'success',
 'undertaking',
 'cold',
 'northern',
 'breeze',
 'play',
 'cheek',
 'brace',
 'nerve',
 'fill',
 'delight',
 'breeze',
 'travel',
 'region',
 'advance',
 'foretaste',
 'wind',
 'promise',
 'daydreams',
 'fervent',
 'vivid',
 'vain',
 'persuade',
 'pole',
 'seat',
 'frost',
 'desolation',
 'present',
 'imagination',
 'region',
 'beauty',
 'delight',
 'visible',
 'broad',
 'disk',
 'skirt',
 'horizon',
 'perpetual',
 'splendour',
 'sister',
 'trust',
 'precede',
 'navigator',
 'snow',
 'frost',
 'sailing',
 'land',
 'surpass',
 'wonder',
 'beauty',
 'region',
 'habitable',
 'globe',
 'production',
 'features',
 'example',
 'phenomena',
 'body',
 'undiscovered',
 'solitude',
 'country',
 'eternal',
 'light',
 'wondrous',
 'power',
 'attract',
 'needle',
 'celestial',
 'observa

In [58]:
# memory errors
nlp.max_length = 20000000 

In [22]:
def na_only(list_of_toks):
  """This function takes a list of toks and removes tokes that aren't nouns or adjectives"""
  na_toks = []
  doc = nlp(" ".join(list_of_toks), disable = ['ner', 'parser'])
  for tok in doc:
    if tok.pos_ == "NOUN" or tok.pos_ == "ADJ":
      na_toks.append(str(tok))
  return na_toks

In [76]:
# We build another corpus
crpna = []
for book in new_crp:
  crpna.append(na_only(book)) 

In [288]:
#Saving to project folder
with open("drive/MyDrive/CSTA_Proj/crpna.txt", "a") as f:
  for b in crpna:
    f.write("\t".join(b)+"\n")
  f.close()

In [48]:
crpna = []
with open("drive/MyDrive/CSTA_Proj/crpna.txt", "r") as f:
  for l in f.readlines():
    crpna.append(l)

In [50]:
crpna = [line.split("\t") for line in crpna]

In [77]:
# another id2word dictionary
na_dictionary = corpora.Dictionary(crpna)

In [78]:
#and another bow corpus
corpusna = [na_dictionary.doc2bow(doc) for doc in crpna]

In [275]:
# building the models and computing the topic coherence scores
scores=[]
na_models = []
for t in topics:
  print("Creating model...")
  model = models.LdaModel(corpus=corpusna, num_topics=t, passes=20, id2word=na_dictionary)
  model.save("drive/MyDrive/CSTA_Proj/ldaModels/na_model_"+str(t)+".lda")
  na_models.append(model)
  print("Calculating coherence score...")
  coherence_model_lda = models.CoherenceModel(model=model, texts=crpna, dictionary=na_dictionary, coherence='c_v')
  coherence_lda = coherence_model_lda.get_coherence()
  scores.append(coherence_lda)
  print("\nCoherence Score of lda_model_"+str(t)+": ", coherence_lda)

print("Highest coherence score: "+str(max(scores)))


Creating model...
Calculating coherence score...

Coherence Score of lda_model_5:  0.33259917231152103
Creating model...
Calculating coherence score...

Coherence Score of lda_model_10:  0.37014386540447475
Creating model...
Calculating coherence score...

Coherence Score of lda_model_15:  0.34798100810551075
Creating model...
Calculating coherence score...

Coherence Score of lda_model_20:  0.36135707747060314
Creating model...
Calculating coherence score...

Coherence Score of lda_model_25:  0.3600935643275151
Creating model...
Calculating coherence score...

Coherence Score of lda_model_30:  0.3610735479108727
Highest coherence score: 0.37014386540447475


In [31]:
# Let's implement the 10 topic na model:
na_model_10 = models.LdaModel.load("drive/MyDrive/CSTA_Proj/ldaModels/na_model_10.lda")

In [277]:
# Visualizing topics
pprint(na_model_10.print_topics())

[(0,
  '0.015*"man" + 0.008*"government" + 0.007*"time" + 0.007*"people" + '
  '0.006*"state" + 0.005*"power" + 0.005*"great" + 0.005*"good" + '
  '0.005*"right" + 0.005*"work"'),
 (1,
  '0.010*"good" + 0.009*"quixote" + 0.008*"time" + 0.008*"great" + '
  '0.006*"sancho" + 0.006*"man" + 0.005*"day" + 0.004*"hand" + 0.004*"word" + '
  '0.004*"lady"'),
 (2,
  '0.013*"man" + 0.011*"time" + 0.010*"great" + 0.008*"ship" + 0.007*"good" + '
  '0.007*"day" + 0.006*"little" + 0.006*"place" + 0.006*"many" + '
  '0.006*"people"'),
 (3,
  '0.015*"great" + 0.012*"country" + 0.010*"value" + 0.010*"price" + '
  '0.009*"part" + 0.008*"money" + 0.008*"time" + 0.006*"different" + '
  '0.006*"good" + 0.006*"capital"'),
 (4,
  '0.027*"man" + 0.019*"people" + 0.016*"son" + 0.015*"day" + 0.012*"thing" + '
  '0.012*"child" + 0.010*"hand" + 0.009*"land" + 0.009*"word" + '
  '0.008*"behold"'),
 (5,
  '0.011*"little" + 0.010*"time" + 0.008*"good" + 0.007*"old" + 0.006*"hand" + '
  '0.006*"day" + 0.005*"much" + 

In [79]:
# What's Frankenstein made of?
transf_na_crp_1 = na_model_10[corpusna]
for article in transf_na_crp_1[0]:
  print(article)

(1, 0.015994446)
(2, 0.21569625)
(5, 0.32476985)
(6, 0.15360296)
(8, 0.0140636545)
(9, 0.26034093)


In [32]:
# Let's also implement the second best:
na_model_20 = models.LdaModel.load("drive/MyDrive/CSTA_Proj/ldaModels/na_model_20.lda")

In [279]:
# Visualizing topics
pprint(na_model_20.print_topics())

[(0,
  '0.013*"man" + 0.008*"time" + 0.007*"little" + 0.006*"eye" + 0.006*"day" + '
  '0.006*"hand" + 0.006*"old" + 0.005*"woman" + 0.004*"good" + 0.004*"night"'),
 (1,
  '0.026*"little" + 0.013*"old" + 0.010*"man" + 0.010*"good" + 0.009*"tree" + '
  '0.008*"child" + 0.008*"day" + 0.008*"great" + 0.007*"time" + 0.006*"water"'),
 (2,
  '0.015*"man" + 0.014*"thing" + 0.008*"word" + 0.008*"idea" + 0.006*"sense" + '
  '0.006*"knowledge" + 0.006*"world" + 0.006*"nature" + 0.005*"great" + '
  '0.005*"fact"'),
 (3,
  '0.010*"little" + 0.010*"time" + 0.009*"man" + 0.009*"old" + 0.008*"good" + '
  '0.007*"hand" + 0.006*"eye" + 0.006*"day" + 0.006*"room" + 0.006*"thing"'),
 (4,
  '0.014*"government" + 0.012*"man" + 0.010*"state" + 0.008*"power" + '
  '0.008*"people" + 0.007*"time" + 0.007*"nation" + 0.006*"right" + '
  '0.006*"law" + 0.006*"great"'),
 (5,
  '0.013*"enemy" + 0.013*"war" + 0.010*"great" + 0.009*"time" + 0.008*"ship" + '
  '0.007*"force" + 0.007*"attack" + 0.007*"man" + 0.006*"orde

In [80]:
# What's Frankenstein made of?
na_transf_crp_2 = na_model_20[corpusna]
for article in na_transf_crp_2[0]:
  print(article)

(0, 0.21325858)
(2, 0.024676401)
(6, 0.52985597)
(7, 0.048470914)
(8, 0.09273489)
(9, 0.018129656)
(11, 0.03640005)
(12, 0.01915581)
(18, 0.010223361)


### 4. Creating the recommender system
The results with both models are pretty satisfactory. Now we're almost at the end. We have the books, we have the topics: it's time to build a recommender system based on the topics. Part of the code I use is taken from https://humboldt-wi.github.io/blog/research/information_systems_1819/is_lda_final/.

We build a similarity Matrix containing our corpus: the cells contain the similarity between each document in the corpus.

In [23]:
from gensim import similarities

In [54]:
#building the similarity matrix
index = similarities.MatrixSimilarity(lda_model_25[corpus])

In [None]:
index

In [55]:
# We transform the BoW counts into a topic space of lower dimensionality
transformed_corpus = lda_model_25[corpus]

We have to take back our dictionary from the beginning: but we have 2. So we have to merge them and make sure that there aren't accidental duplicates. We need the dictionary because we need the title of the books.

In [24]:
def merge_dicts(dict_1, dict_2):
  """This function takes 2 books dictionaries and merges them without losing data, applying the same filters
  applied when building the corpus"""
  clean_dict = {}
  for k in dict_1:
    if dict_1[k] != {} and dict_1[k]["Text link"] != None and dict_1[k]["Text link"] not in excluded_files:
      clean_dict[k] = dict_1[k]
  for k in dict_2:
    if dict_2[k] != {} and dict_2[k]["Text link"] != None and dict_2[k]["Text link"] not in excluded_files:
      clean_dict[k] = dict_2[k]
  return clean_dict

In [33]:
excluded_files = ["/ebooks/2800.txt.utf-8", "/ebooks/1.txt.utf-8", "/ebooks/13309.txt.utf-8", "/ebooks/3330.txt.utf-8", "/ebooks/31100.txt.utf-8", "/ebooks/690.txt.utf-8", "/ebooks/3726.txt.utf-8" ]

In [25]:
def create_catalogue(dict_):
  """This function takes a dictionary and creates a catalogue out of it. It is useful in order to get the id 
  of the book"""
  new_d = {}
  for k in dict_:
    if dict_[k] != {}:
      new_d[k] = dict_[k]
  idx2book = {k:v for k,v in list(enumerate(new_d))}
  book2idx = {v: k for k, v in idx2book.items()}
  return idx2book, book2idx

In [34]:
# Creating the unified dictionary and the catalogue
books_final = merge_dicts(books_1, books_2)
idx2books, books2idx = create_catalogue(books_final)

In [37]:
with open("drive/MyDrive/CSTA_Proj/books_final.json", "w") as f:
  json.dump(books_final,f, ensure_ascii=False, indent=2)

In [None]:
with open("drive/MyDrive/CSTA_Proj/books_final.json") as f:
  books_final = json.load(f)

In [35]:
print("Total number of books: "+str(len(books_final)))
print("----")
print("Exactly the same as the length of our corpus.")

Total number of books: 319
----
Exactly the same as the length of our corpus.


In [56]:
# Creating a title-index mapping
title2idx = {books_final[k]["Title"]:k for k in books_final}

In [26]:
def recommender(title):
  """This function takes a title; if the title is in the dictionary, it finds the top 10 books mosre similar to
  that title according to the topics"""
  books_checked = 0 # we add a counter 
  for i in range(len(books_final)):
    recommendation_scores = [] # we initialize a empty list
    if re.search(title, books_final[idx2books[i]]["Title"]): # we use regex to check the title
      lda_vectors = transformed_corpus[i] #we transform the document in its vectors
      sims = index[lda_vectors] # we look for the vectors in the similarity matrix
      sims = list(enumerate(sims)) # we enumerate the dimensions of the matrix
      for sim in sims: # looping
        book_num = sim[0] # the dimensions have format (enum, similarity score). We take the first, that we use as index
        recommendation_score = [books_final[idx2books[book_num]]["Title"], sim[1]] # we use the index to get the title of the book and the second element is the similarity
        recommendation_scores.append(recommendation_score) # we append the list made of the title+similarity score to the recommendation scores list
            
      recommendation = sorted(recommendation_scores, key=lambda x: x[1], reverse=True) # we sort the recommendations and display the top 20, from the highest score
      
      print('Here are your recommendations for "{}":'.format(title))
      display(recommendation[:11])
    else:
      books_checked +=1 # let's check another book
    # what if we asked for a title that we don't have?        
    if books_checked == len(books_final):
      print('Sorry, but it looks like "{}" is not available.'.format(title))

In [284]:
recommender("Frankenstein")

Here are your recommendations for "Frankenstein":


[['Frankenstein; Or, The Modern Prometheus', 1.0],
 ['The Life and Adventures of Robinson Crusoe', 0.9853834],
 ['The Life and Most Surprising Adventures of Robinson Crusoe, of York, Mariner (1801)',
  0.9830116],
 ['The Life and Adventures of Robinson Crusoe (1808)', 0.98249674],
 ['The Further Adventures of Robinson Crusoe', 0.9802911],
 ["Seneca's Morals of a Happy Life, Benefits, Anger and Clemency", 0.9780187],
 ['The Interesting Narrative of the Life of Olaudah Equiano, Or Gustavus Vassa, The AfricanWritten By Himself',
  0.9757753],
 ["Gulliver's Travels into Several Remote Nations of the World", 0.97442174],
 ['Poor Folk', 0.97437114],
 ['A Discourse Upon the Origin and the Foundation of the Inequality Among Mankind',
  0.9730244],
 ['Letters from an American Farmer', 0.97095066]]

In [202]:
for article in transformed_corpus[0]:
  print(article)

(8, 0.0688134)
(9, 0.12797539)
(11, 0.26709473)
(14, 0.09152394)
(15, 0.015535369)
(16, 0.2611612)
(17, 0.03984548)
(23, 0.092489265)


In [203]:
# checking the silver standard
books_final["84"]["Similar books"]

[['1342', 'Austen, Jane', 'Pride and Prejudice'],
 ['345', 'Stoker, Bram', 'Dracula'],
 ['46',
  'Dickens, Charles',
  'A Christmas Carol in Prose; Being a Ghost Story of Christmas'],
 ['11', 'Carroll, Lewis', "Alice's Adventures in Wonderland"],
 ['42324',
  'Shelley, Mary Wollstonecraft',
  'Frankenstein; Or, The Modern Prometheus'],
 ['98', 'Dickens, Charles', 'A Tale of Two Cities'],
 ['43',
  'Stevenson, Robert Louis',
  'The Strange Case of Dr. Jekyll and Mr. Hyde'],
 ['1661', 'Doyle, Arthur Conan', 'The Adventures of Sherlock Holmes'],
 ['2701', 'Melville, Herman', 'Moby Dick; Or, The Whale'],
 ['25344', 'Hawthorne, Nathaniel', 'The Scarlet Letter'],
 ['1080', 'Swift, Jonathan', 'A Modest Proposal'],
 ['174', 'Wilde, Oscar', 'The Picture of Dorian Gray'],
 ['5200', 'Kafka, Franz', 'Metamorphosis'],
 ['1260', 'Brontë, Charlotte', 'Jane Eyre: An Autobiography'],
 ['76', 'Twain, Mark', 'Adventures of Huckleberry Finn'],
 ['41445',
  'Shelley, Mary Wollstonecraft',
  'Frankenstein; 

It looks like none of the recommendations is in the silver standard provided by Project Gutenberg with the unpolished, 25-topic model. Let's try with the other models:

In [81]:
index_20 = similarities.MatrixSimilarity(new_lda_20[new_corpus])

In [82]:
index_na_10 = similarities.MatrixSimilarity(na_model_10[corpusna])

In [83]:
index_na_20 = similarities.MatrixSimilarity(na_model_20[corpusna])

I redefine the function in order to make it more flexible to different models, corpora and matrices:

In [27]:
def recommender(title, corpus, model, index):
  """This function takes a title; if the title is in the dictionary, it finds the top 10 books mosre similar to
  that title according to the topics"""
  books_checked = 0
  for i in range(len(books_final)):
    recommendation_scores = []
    if re.search(title, books_final[idx2books[i]]["Title"]):
      transf_crp = model[corpus]
      lda_vectors = transf_crp[i]
      sims = index[lda_vectors]
      sims = list(enumerate(sims))
      for sim in sims:
        book_num = sim[0]
        recommendation_score = [books_final[idx2books[book_num]]["Title"], sim[1]]
        recommendation_scores.append(recommendation_score)
            
      recommendation = sorted(recommendation_scores, key=lambda x: x[1], reverse=True)
      
      print('Here are your recommendations for "{}":'.format(title))
      display(recommendation[:21])
    else:
      books_checked +=1
            
    if books_checked == len(books_final):
      print('Sorry, but it looks like "{}" is not available.'.format(title))

In [309]:
recommender("Frankenstein", new_corpus, new_lda_20, index_20)

Here are your recommendations for "Frankenstein":


[['Frankenstein; Or, The Modern Prometheus', 0.9999201],
 ['Narrative of the Life of Frederick Douglass, an American Slave', 0.8747028],
 ['Letters from an American Farmer', 0.8566257],
 ['The Souls of Black Folk', 0.8504074],
 ['1900; or, The last President', 0.8359371],
 ['The Negro Problem', 0.818243],
 ['Masterpieces of Negro EloquenceThe Best Speeches Delivered by the Negro from the days ofSlavery to the Present Time',
  0.8115776],
 ['Why is the Negro Lynched?', 0.7933333],
 ['The Autobiography of Benjamin Franklin', 0.7928943],
 ['The Autobiography of an Ex-Colored Man', 0.78978646],
 ['The Monk: A Romance', 0.78843075],
 ['Autobiography of Benjamin Franklin', 0.78802854],
 ['Ourika', 0.7818582],
 ['Memoirs of Benjamin Franklin; Written by Himself. [Vol. 1 of 2]',
  0.7760607],
 ['The Works of Edgar Allan Poe — Volume 4', 0.77598226],
 ['Barry Lyndon', 0.7580004],
 ['How to Speak and Write Correctly', 0.7575912],
 ['The Writings of Thomas Paine, Complete', 0.7557366],
 ['The Dec

In [85]:
recommender("Pride and Prejudice", corpusna, na_model_10, index_na_10)

Here are your recommendations for "Pride and Prejudice":


[['Pride and Prejudice', 1.0],
 ['Persuasion', 0.99999493],
 ['Emma', 0.9999169],
 ['Mansfield Park', 0.999784],
 ['Sense and Sensibility', 0.99967164],
 ['Northanger Abbey', 0.99856526],
 ['Dombey and Son', 0.9906347],
 ['David Copperfield', 0.9905523],
 ['Bleak House', 0.9905132],
 ['The Letters of Jane Austen', 0.98617226],
 ["Oliver Twist; or, The Parish Boy's Progress. Illustrated", 0.98453766],
 ['Oliver Twist', 0.98435897],
 ['In the Cage', 0.9734034],
 ['Chess Fundamentals', 0.9677506],
 ['The Diary of a Nobody', 0.9639306],
 ['Love and Freindship [sic]', 0.9600153],
 ['Great Expectations', 0.9573937],
 ['Jane Eyre: An Autobiography', 0.9377261],
 ['Villette', 0.93148],
 ['The Turn of the Screw', 0.9287051],
 ['Hard Times', 0.92625314]]

In [84]:
recommender("Pride and Prejudice", corpusna, na_model_20, index_na_20)

Here are your recommendations for "Pride and Prejudice":


[['Pride and Prejudice', 1.0],
 ['Emma', 1.0],
 ['Sense and Sensibility', 1.0],
 ['Mansfield Park', 0.99901533],
 ['Persuasion', 0.9988051],
 ['Northanger Abbey', 0.99871975],
 ['Love and Freindship [sic]', 0.99758077],
 ['Ourika', 0.991024],
 ['The Monk: A Romance', 0.9622138],
 ['Popular Tales', 0.95278895],
 ['Chronicles of the Canongate, 1st Series', 0.9340574],
 ['The Castle of Otranto', 0.91495025],
 ['Frankenstein; Or, The Modern Prometheus', 0.90877664],
 ['The Letters of Jane Austen', 0.9013624],
 ['Mathilda', 0.8689619],
 ['Candide', 0.8106996],
 ['Pamela Censured', 0.7448243],
 ['Barry Lyndon', 0.65044606],
 ['The Idiot', 0.62263393],
 ['Letters from an American Farmer', 0.5524795],
 ['A Modest Proposal', 0.53056765]]

### 5. Conclusion
It seems that the best results are obtained by the model with 10 topics, considering only nouns and adjectives. Even if the position at which we find the books recommended by Project Gutenberg is not the same, it is already quite a satisfaction to see that they are in the top 20.
However, there are some limitations in this work that must be taken into account: firstly, the corpus is quite small: the data collection had to be inturrepted earlier than expected due to runtime/server errors that I could not overcome. Secondly, the data I collected is not entirely clean and some of the books in the book dictionary are missing some information due to the different formatting of the Gutenberg land page.  