In [255]:
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup as bs
from unidecode import unidecode
from collections import Counter

In this notebook, we will try to collect all of the Authors listed on Project Gutemberg's website using the BeautifulSoup library.

## Extracting the Information

In [58]:
authweb = "http://www.gutenberg.org/browse/authors/c"
r = requests.get(authweb).text
soup = bs(r, "html.parser")

In [59]:
print(soup.prettify()[:812])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="ebooks, ebook, books, book, free, online, audio" name="keywords"/>
  <meta content="33000+ free ebooks online" name="description"/>
  <meta content="public" name="classification"/>
  <meta content="text/css" http-equiv="Content-Style-Type"/>
  <script type="application/javascript">
   if (top != self) {
        top.location.replace ('http://www.gutenberg.org');
        alert ('Project Gutenberg is a FREE service with NO membership required. If you paid somebody else to get here, make them give you your money back!');
      }
  </script>
  <link href="/css/pg-002.css" rel="stylesheet" type="text/css"/>


BeautifulSoup has various types of elements. The most common to work with are the 'Tag' element, which may contain other nested Tags, and the 'NavigableString' type, which is text found in the HTML document

In [60]:
print(f"{soup.title}\t {type(soup.title)}")

<title>Browse By Author: C - Project Gutenberg</title>	 <class 'bs4.element.Tag'>


In [61]:
print(f"{soup.title.string}\t {type(soup.title.string)}")

Browse By Author: C - Project Gutenberg	 <class 'bs4.element.NavigableString'>


In [71]:
def match_author(link, author):
    """
    Given a 'bs4.element.Tag' element 'a', the
    tag it's expected to have a string to match
    given an author. For example:
    '<a name="a45634">Saar, Ferdinand von, 1833-1906</a>'
    
    Parameters
    ----------
    link: bs4.element.Tag
        The link which contains the name of the author
    author: string
        Name (possible partial name) of the author to compare
    Returns
    -------
    """
    if link is None:
        return False
    else:
        author_match = link.string
        return author in author_match

def titles_languages_links(soup, author):
    """
    Given a 'bs4.BeautifulSoup' element of a webage 
    of authors from projectGutemberg, return all the
    languages, links and titiles of every book of the
    author
    """
    titles, languages, links = list(), list(), list()
    # Compiled regex expression to remove parenthesis included in 
    # the specified language and '(as author)' or any other related match
    subre = re.compile("( \(as [A-Za-z]+\)|\(|\))")
    h2_vect = soup.find_all("h2")
    for h2 in h2_vect:
        h2_link = h2.a
        if match_author(h2_link, author):
            # The first sibling of h2 is a new line ("\n"), the
            # second sibling is a list of elements, out of which we
            # care about those who have a 'pgdbetext' class 
            elements = h2.next_sibling.next_sibling.find_all(attrs = {"class":"pgdbetext"})
            for element in elements:
                titles.append(element.a.text)
                links.append(element.a.get("href"))
                languages.append(subre.sub("", element.contents[1].strip()))
    return titles, languages, links

def author_dataframe(author, soup, remove_duplicates=True):
    """
    Obtain a dataframe with columns: title, lanuage and link
    for a given author
    """
    headers = ["title", "language", "link"]
    auth_titles = pd.DataFrame({key: values for key, values in
                            zip(headers, titles_languages_links(soup, author))})
    if remove_duplicates:
        auth_titles = auth_titles[np.logical_not(auth_titles.duplicated("title"))]
    
    return auth_titles

In [85]:
authweb = "http://www.gutenberg.org/browse/authors/c"
r = requests.get(authweb).text
soup = bs(r, "html.parser")

author = "Cervantes Saavedra, Miguel"
authordf = author_dataframe(author, soup)
authordf.head()

Unnamed: 0,language,link,title
0,Finnish,/ebooks/45203,Älykkään ritarin Don Quijote de la Manchan elä...
1,Dutch,/ebooks/28469,Don Quichot van La Mancha
2,Spanish,/ebooks/2000,Don Quijote
3,English,/ebooks/996,Don Quixote
4,English,/ebooks/14420,The Exemplary Novels of Cervantes


## Extracting Specific book information

In [243]:
book_title = "Don Quijote"
book_link = authordf.query(f"language == 'Spanish' and title == '{book_title}'").link.iloc[0]
bookid = re.sub("[^0-9]", "", book_link)
bookweb = f"http://www.gutenberg.org/cache/epub/{bookid}/pg{bookid}.txt"
r = requests.get(bookweb)
book = r.text

In [244]:
start_string = (r"TASA"
                .format(book_title=book_title.upper()))
end_string = (r"End of Project Gutenberg's"
                .format(book_title=book_title.upper()))
print(start_string)
print(end_string)

TASA
End of Project Gutenberg's


### Clean book
For this part, we will clean the book by:
* removing uncessary parts of the book such as headers and footers;
* replacing accents;
* lowercasing letters

We start by splitting the book: indexing the book from the beggining of the first chapter named, in this case, 'TASA', and end before PG's footer, which is not part of the book

#### Slicing

In [245]:
start_index = re.search(start_string, book).span()[0]
end_index = re.search(end_string, book).span()[0]
book = book[start_index: end_index]

print(book[:150])

TASA

Yo, Juan Gallo de Andrada, escribano de Cámara del Rey nuestro señor, de
los que residen en su Consejo, certifico y doy fe que, habiendo vist


#### Remove newline sequences

As we note from the sample shown, the book contains many whitespaces. Specifically, those whitespaces are comprised with the newline sequence `\r\n`.

In [247]:
# Replacing all instances of newline sequences
book = re.sub(r"(\r\n)+", " ", book)
print(book[:150])

TASA Yo, Juan Gallo de Andrada, escribano de Cámara del Rey nuestro señor, de los que residen en su Consejo, certifico y doy fe que, habiendo visto po


#### Replacing accents

In [248]:
book = unidecode(book)
print(book[:150])

TASA Yo, Juan Gallo de Andrada, escribano de Camara del Rey nuestro senor, de los que residen en su Consejo, certifico y doy fe que, habiendo visto po


#### Removing puntuation marks

In [249]:
book = re.sub(r"[^\w\s]", "", book)
print(book[:150])

TASA Yo Juan Gallo de Andrada escribano de Camara del Rey nuestro senor de los que residen en su Consejo certifico y doy fe que habiendo visto por los


#### Lowercasing

In [253]:
book = book.lower()
print(book[:150])

tasa yo juan gallo de andrada escribano de camara del rey nuestro senor de los que residen en su consejo certifico y doy fe que habiendo visto por los


Looking a bigger sample of the cleaned book

In [254]:
book[:500]

'tasa yo juan gallo de andrada escribano de camara del rey nuestro senor de los que residen en su consejo certifico y doy fe que habiendo visto por los senores del un libro intitulado el ingenioso hidalgo de la mancha compuesto por miguel de cervantes saavedra tasaron cada pliego del dicho libro a tres maravedis y medio el cual tiene ochenta y tres pliegos que al dicho precio monta el dicho libro docientos y noventa maravedis y medio en que se ha de vender en papel y dieron licencia para que a es'

## Ngram Analysis

In [258]:
book_list = book.split(" ")
book_list[:10]

['tasa',
 'yo',
 'juan',
 'gallo',
 'de',
 'andrada',
 'escribano',
 'de',
 'camara',
 'del']

In [262]:
book_1g = Counter(book_list)
book_1g.most_common(10)

[('que', 21475),
 ('de', 18297),
 ('y', 18188),
 ('la', 10362),
 ('a', 9823),
 ('el', 9487),
 ('en', 8241),
 ('no', 6335),
 ('se', 5078),
 ('los', 4748)]