# Retrieving Yesterday's Top 100 Books
## Author: Ahria Dominguez
### Last Updated: 6/30/2024

In this project, we will access the top 100 books of yesterday page from www.gutenberg.com and parse through the information to nicely print the top 100 books. It involves requesting information from the URL, checking the SSL certificate, and using BeautifulSoup to parse through the HTML. 

#### Import libraries.

In [1]:
# Imports the Requests, Regex, BeautifulSoup, SSL, and URLLib.Request libraries.
import requests
import re
from bs4 import BeautifulSoup
import ssl
import urllib.request

#### Check the SSL certificate.

In [2]:
# Assigns the URL to a 'url' variable.
url = 'https://www.gutenberg.org/browse/scores/top'

# Checks the SSL certificate. 
# Context is supposed to ensure there is a secure connection with the URL in question.
context = ssl.create_default_context()

# Opens the URL with the context variable as a parameter to ensure secure connection.
with urllib.request.urlopen(url, context=context) as ulib:
    sock = ulib.fp.raw._sock # This is supposed to access the 'raw socket'.
    cert_info = sock.getpeercert() # This obtains the SSL certificate information.
    print("Issuer:", cert_info['issuer']) # Prints the issuer's information from the SSL certificate.
    print("-") # Breaks up the text.
    print("Expiry Date:", cert_info['notAfter']) # Prints the expiration date of the SSL certificate.

Issuer: ((('countryName', 'US'),), (('organizationName', 'Network Solutions L.L.C.'),), (('commonName', 'Network Solutions RSA OV SSL CA 3'),))
-
Expiry Date: Apr  8 23:59:59 2025 GMT


#### Read the HTML from the URL.

In [3]:
# Makes a 'get' request to the URL, adds the HTML into 'html_text', and prints out the result.
response = requests.get(url, verify=False) 
html_text = response.text                 
html_text                                  



'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n <meta charset="UTF-8"/>\n\n<title>Top 100 | Project Gutenberg</title>\n <link rel="stylesheet" href="/gutenberg/style.css?v=1.1">\n <link rel="stylesheet" href="/gutenberg/collapsible.css?1.1">\n <link rel="stylesheet" href="/gutenberg/new_nav.css?v=1.321231">\n<link rel="stylesheet" href="/gutenberg/pg-desktop-one.css">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n <meta name="keywords" content="books, ebooks, free, kindle, android, iphone, ipad"/>\n <meta name="google-site-verification" content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io"/>\n <meta name="alexaVerifyID" content="4WNaCljsE-A82vP_ih2H_UqXZvM"/>\n <link rel="copyright" href="https://www.gnu.org/copyleft/fdl.html"/>\n <link rel="icon" type="image/png" href="/gutenberg/favicon.ico" sizes="16x16" />\n <meta property="og:title"        content="Project Gutenberg" />\n <meta property="og:type"         content="website" />\n <

#### Write a small function to check the status of the web request.

In [4]:
# Creates a function to check the status of the web request. It will print that the request was good only if
# it gives the status code of 200 and will print that it was bad for any other codes.
def check_status(resp):
    if resp.status_code==200:
        print("Good")
    else:
        print("Bad")

In [5]:
# Uses the function to check the status of the web request.
check_status(response)

Good


#### Decode the response and pass this on the BeautifulSoup for HTML parsing.

In [6]:
# Decodes the response to get ready to pass it along to BeautifulSoup. 
contents = response.content.decode(response.encoding)

In [7]:
# Parses the HTML data using BeautifulSoup.
soup = BeautifulSoup(contents, 'html.parser')
# Prints the results, and as you can see, they are much easier to read.
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Top 100 | Project Gutenberg</title>
<link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>
<link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>
<link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>
<link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">
<meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>
<meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>
<link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
<link href="/gutenberg/favicon.ico" rel="icon" sizes="16x16" type="image/png">
<meta content="Project Gutenberg" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="https://www.gutenberg.org/" proper

#### Find all the *href* tags and store them in the list of links. Check what the list looks like - print the first 30 elements.

In [8]:
# Begins an empty list to store the links.
links = []

# Finds the href tags and appends them to the empty list.
for link in soup.find_all('a'):
    links.append(link.get('href'))

In [9]:
# Prints out the first 30 elements in the list.
links[:30]

['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 '#books-last1',
 '#authors-last1',
 '#books-last7']

#### Use a regular expression to find the numeric digits in these links. These are the file numbers for the top 100 eBooks. 

In [10]:
# Assigns a regular expression to 'numeric_digits' to create a list of numeric file numbers.
numeric_digits = [re.findall(r'\d+', link) for link in links]
# Prints out the results.
numeric_digits

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['1'],
 ['1'],
 ['7'],
 ['7'],
 ['30'],
 ['30'],
 ['1513'],
 ['2701'],
 ['49010'],
 ['145'],
 ['2641'],
 ['100'],
 ['37106'],
 ['67979'],
 ['16389'],
 ['6761'],
 ['394'],
 ['4085'],
 ['2160'],
 ['6593'],
 ['1259'],
 ['5197'],
 ['1342'],
 ['11'],
 ['84'],
 ['14358'],
 ['2554'],
 ['5200'],
 ['345'],
 ['2000'],
 ['1661'],
 ['98'],
 ['4300'],
 ['73938'],
 ['174'],
 ['30254'],
 ['28054'],
 ['64317'],
 ['73936'],
 ['2600'],
 ['4363'],
 ['27827'],
 ['1998'],
 ['73937'],
 ['73932'],
 ['2650'],
 ['76'],
 ['1232'],
 ['43'],
 ['2591'],
 ['46633'],
 ['844'],
 ['6130'],
 ['2680'],
 ['1400'],
 ['1184'],
 ['2542'],
 ['73943'],
 ['8800'],
 ['73933'],
 ['1080'],
 ['73931'],
 ['10'],
 ['244'],
 ['45'],
 ['600'],
 ['46'],
 ['16'],
 ['768'],
 ['8492'],
 ['4973'],
 ['15474'],
 ['5740'],
 ['1952'],
 ['16119'],
 ['1260'],
 ['73942'],
 ['73952'],
 ['1727'],
 ['36034'],
 ['996

#### Initialize the empty list to hold the file numbers over an appropriate range and use *regex* to find the numeric digits in the link *href* string. Use the *findall* method.

In [11]:
# Creates an empty list to place the file numbers into.
file_num=[]

# Loops over the links list's indices that contain numeric digits, strips them, finds the digits, 
# and adds them to the new list 'file_num' so there are no more empty indices.
for link in links[27:-6]:
    link=link.strip()
    num_digits=re.findall('[0-9]+',link)
    file_num.append(int(num_digits[0]))
    
# One could also do something like 'non_empty_indices = [index for index, element in 
# enumerate(numeric_digits) if element]' or 'file_numbers = re.findall(r'\d+', ' '.join(links))' 
# to make it even simpler.

In [12]:
# Prints the new list to show it worked.
print(file_num)

[1, 1, 7, 7, 30, 30, 1513, 2701, 49010, 145, 2641, 100, 37106, 67979, 16389, 6761, 394, 4085, 2160, 6593, 1259, 5197, 1342, 11, 84, 14358, 2554, 5200, 345, 2000, 1661, 98, 4300, 73938, 174, 30254, 28054, 64317, 73936, 2600, 4363, 27827, 1998, 73937, 73932, 2650, 76, 1232, 43, 2591, 46633, 844, 6130, 2680, 1400, 1184, 2542, 73943, 8800, 73933, 1080, 73931, 10, 244, 45, 600, 46, 16, 768, 8492, 4973, 15474, 5740, 1952, 16119, 1260, 73942, 73952, 1727, 36034, 996, 31552, 14859, 31284, 26184, 58585, 73940, 398, 1399, 1497, 74, 73951, 514, 219, 120, 135, 14880, 73944, 67098, 36, 55, 17489, 5827, 12, 33283, 2814, 1, 1, 7, 7, 30, 30, 65, 838, 37, 492, 9, 18, 53, 36, 102, 314, 68, 3930, 69, 90, 35920, 9742, 975, 2858, 6202, 7, 779, 220, 603, 111, 537, 26815, 1325, 35, 136, 30, 705, 60, 505, 85, 61, 93, 1896, 94, 1039, 1961, 79, 125, 42, 34095, 132, 1735, 507, 5490, 59, 898, 25267, 120, 481, 80, 355, 113, 251, 1736, 586, 861, 190, 451, 28, 9248, 32269, 326, 2183, 23, 4589, 7862, 420, 500, 73, 36

#### Use the .*text* method and print only the first 2,000 characters.

In [13]:
# Prints the first 2,000 characters of 'soup.text'.
print(soup.text[:2000])





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright How-To
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Ways to donate







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2024-06-29332864
last 7 days2439141
last 30 days11605585



Top 100 EBooks yest

#### Search in the extracted text (using a regular expression) from the soup object to find the names of the top 100 eBooks (yesterday's ranking).

In [14]:
# Uses a regular expression to find all the text for the book titles and ignores the new lines. 're.DOTALL' had
# to be added because it could not handle the newline characters before the book titles.
names = re.compile(r'Top 100 EBooks yesterday\n\n(.*?)(?=\n\n)', re.DOTALL)

# Finds the matches of the names in the soup object.
matches = re.findall(names, soup.text)

# Nicely prints the book names.
for match in matches:
    print(match)

Romeo and Juliet by William Shakespeare (2711)
Moby Dick; Or, The Whale by Herman Melville (2459)
Æsop's Fables: A Version for Young Readers by Aesop and J. H.  Stickney (2090)
Middlemarch by George Eliot (1853)
A Room with a View by E. M.  Forster (1829)
The Complete Works of William Shakespeare by William Shakespeare (1766)
Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (1749)
The Blue Castle: a novel by L. M.  Montgomery (1615)
The Enchanted April by Elizabeth Von Arnim (1599)
The Adventures of Ferdinand Count Fathom — Complete by T.  Smollett (1510)
Cranford by Elizabeth Cleghorn Gaskell (1480)
The Adventures of Roderick Random by T.  Smollett (1474)
The Expedition of Humphry Clinker by T.  Smollett (1450)
History of Tom Jones, a Foundling by Henry Fielding (1447)
Twenty years after by Alexandre Dumas and Auguste Maquet (1437)
My Life — Volume 1 by Richard Wagner (1428)
Pride and Prejudice by Jane Austen (1424)
Alice's Adventures in Wonderland by Lewis Carroll (1042)

#### Create a starting index. It should point at the text *Top 100 eBooks yesterday.* Use the *splitlines* method of soup.text. It splits the lines of text of the soup object.

In [15]:
# Splits the lines of text and assigns them to 'split_line_text'.
split_line_text = soup.text.splitlines()

# Enumerates over the split text because there are two instances of 'Top 100 EBooks yesterday' on the web page,
# so I need to find the second one, which is the one right above the book titles.
instances = [i for i, line in enumerate(split_line_text) if line == 'Top 100 EBooks yesterday']

# Assigns the correct instance of 'Top 100 EBooks yesterday' to 'starting_idx'.
starting_idx = instances[1]

#### Loop 1-100 to add the strings of the next 100 lines to a temporary list.

In [16]:
# Creates a new empty list.
text_list = []

# Loops over the next 100 lines of text and adds them to the empty list. It begins appending with the line 
# immediately after the starting_idx line and is told to perform that action 100 times.
for i in range(100):
    text_list.append(soup.text.splitlines()[starting_idx+2+i])

In [17]:
# Prints the list to show it worked.
print(text_list)

['Romeo and Juliet by William Shakespeare (2711)', 'Moby Dick; Or, The Whale by Herman Melville (2459)', "Æsop's Fables: A Version for Young Readers by Aesop and J. H.  Stickney (2090)", 'Middlemarch by George Eliot (1853)', 'A Room with a View by E. M.  Forster (1829)', 'The Complete Works of William Shakespeare by William Shakespeare (1766)', 'Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (1749)', 'The Blue Castle: a novel by L. M.  Montgomery (1615)', 'The Enchanted April by Elizabeth Von Arnim (1599)', 'The Adventures of Ferdinand Count Fathom — Complete by T.  Smollett (1510)', 'Cranford by Elizabeth Cleghorn Gaskell (1480)', 'The Adventures of Roderick Random by T.  Smollett (1474)', 'The Expedition of Humphry Clinker by T.  Smollett (1450)', 'History of Tom Jones, a Foundling by Henry Fielding (1447)', 'Twenty years after by Alexandre Dumas and Auguste Maquet (1437)', 'My Life — Volume 1 by Richard Wagner (1428)', 'Pride and Prejudice by Jane Austen (1424)', "Ali

In [18]:
# Checks that there are 100 titles in the list.
len(text_list)

100

#### Use a regular expression to extract only text from the name strings and append it to an empty list. Use *match* and *span* to find the indices and use them.

In [19]:
# Creates a new empty list to store just the titles.
new_titles = []

# Iterates over the original 'text_list' 100 times, grabs just the text that contains letters (creating the
# starting and ending index for the position of those letters), and appends the text to 'new_titles'.
for i in range(100):
    start_idx, end_idx = re.match('^[a-zA-Z ]*', text_list[i]).span()
    new_titles.append(text_list[i][start_idx:end_idx])

In [20]:
# Loops through all the titles and prints them one-by-one, showing yesterday's top 100 eBooks.
for title in new_titles:
    print(title)

Romeo and Juliet by William Shakespeare 
Moby Dick

Middlemarch by George Eliot 
A Room with a View by E
The Complete Works of William Shakespeare by William Shakespeare 
Little Women
The Blue Castle
The Enchanted April by Elizabeth Von Arnim 
The Adventures of Ferdinand Count Fathom 
Cranford by Elizabeth Cleghorn Gaskell 
The Adventures of Roderick Random by T
The Expedition of Humphry Clinker by T
History of Tom Jones
Twenty years after by Alexandre Dumas and Auguste Maquet 
My Life 
Pride and Prejudice by Jane Austen 
Alice
Frankenstein
A Little Book of Filipino Riddles 
Crime and Punishment by Fyodor Dostoyevsky 
Metamorphosis by Franz Kafka 
Dracula by Bram Stoker 
Don Quijote by Miguel de Cervantes Saavedra 
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
A Tale of Two Cities by Charles Dickens 
Ulysses by James Joyce 
The man who was pale by Jack Sharkey 
The Picture of Dorian Gray by Oscar Wilde 
The Romance of Lust
The Brothers Karamazov by Fyodor Dostoyevsky 
The Gr