Pick a website and describe your objective

### Project Outline

- We're going to scrape https://www.gutenberg.org/
- We'll get a list of famous literary books.

In [1]:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re

Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download
- Download and save web pages locally using the requests library
- Create a function to automate downloading for different topics/search queries

In [4]:
top100url = 'https://www.gutenberg.org/browse/scores/top'

In [5]:
response = requests.get(top100url)

Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful Soup
- Use the right properties and methods to extract the required information
- Create functions to extract from the page into lists and dictionaries
- (Optional) Use a REST API to acquire additional information

In [6]:
contents = response.content.decode(response.encoding)

In [7]:
soup = BeautifulSoup(contents, 'html.parser')

In [9]:
def status_check(r):
    if r.status_code==200:
        print('Success!')
        return 1
    else:
        print('Failed!')
        return -1

In [10]:
status_check(response)

Success!


1

In [11]:
lst_links = []
for link in soup.find_all('a'):
    lst_links.append(link.get('href'))

In [12]:
lst_links[:30]

['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 '#books-last1',
 '#authors-last1',
 '#books-last7']

In [18]:
lst_titles_temp=[]

In [19]:
start_idx = soup.text.splitlines().index('Top 100 EBooks yesterday')

In [20]:
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

In [21]:
lst_titles = []
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])

In [22]:
for l in lst_titles:
    print(l)

Top 
Top 
Top 
Top 


Top 

Frankenstein
Pride and Prejudice by Jane Austen 
The Scarlet Letter by Nathaniel Hawthorne 
A Christmas Carol in Prose
Alice
Dracula by Bram Stoker 
The Picture of Dorian Gray by Oscar Wilde 
A Modest Proposal by Jonathan Swift 
A Doll
Moby Dick
The Importance of Being Earnest
The Strange Case of Dr
The Great Gatsby by F
A Tale of Two Cities by Charles Dickens 
Jane Eyre
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
Heart of Darkness by Joseph Conrad 
Narrative of the Life of Frederick Douglass
The Wolfe of Badenoch by Thomas Dick
The Yellow Wallpaper by Charlotte Perkins Gilman 
The Prince by Niccol
Metamorphosis by Franz Kafka 
Walden
Old Granny Fox by Thornton W
Adventures of Huckleberry Finn by Mark Twain 
Crime and Punishment by Fyodor Dostoyevsky 
The Awakening
Anthem by Ayn Rand 
The Odyssey by Homer 
Great Expectations by Charles Dickens 
Grimms
Next Stop
The Life of the Caterpillar by Jean
The Adventures of Tom Sawyer
The Hound of the Bas

Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing and saving CSVs
- Execute the function with different inputs to create a dataset of CSV files
- Verify the information in the CSV files by reading them back using Pandas

Document and share your work

- Add proper headings and documentation in your Jupyter notebook
- Publish your Jupyter notebook to your Github portfolio
- (Optional) Write a blog post about your project and share it online