# Web Scraping of Haodoo Backup Using BeautifulSoup Take 2
### David Lowe
### August 5, 2022

SUMMARY: This project aims to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: Haodoo is a website that houses classic Chinese literature for its readers’ enjoyment. Haodoo in Chinese can be translated to “Good Reads” in English. It collects hard-to-find Chinese text/books and makes them available for online reading. The Haodoo collection includes over 3,500 titles of text and audiobooks.

In the previous Take1 iteration, we scraped the website and obtained all the book titles and their assigned categories. In this Take2 iteration, we will use the information collected from Take1 and obtain the links for each book and file format.

Starting URL: https://haodoo.org/

## Task 1. Prepare Environment

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import shutil

In [2]:
# Begin the timer for the script processing
START_TIME_SCRIPT = datetime.now()

# Specifying the URL of web page to be scrapped
website_url = "https://haodoo.org/"

## Task 2. Collect Links to Individual Books

In [3]:
df_title_all = pd.read_csv('py_webscraping_beautifulsoup_haodoo_descriptions.csv')
print(df_title_all.head())

  category_text title_text author_name                        title_link  \
0          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   
1          世紀百強       【邊城】         沈從文  https://haodoo.org/?M=book&P=394   
2          世紀百強     《駱駝祥子》          老舍  https://haodoo.org/?M=book&P=401   
3          世紀百強       【傳奇】         張愛玲  https://haodoo.org/?M=book&P=430   
4          世紀百強       【圍城】         錢鍾書  https://haodoo.org/?M=book&P=399   

                                    book_description  
0  ['\n', <font color="CC0000">魯迅</font>, '《吶喊》',...  
1  ['\n', <font color="CC0000">沈從文</font>, '《邊城》'...  
2  ['\n', <font color="CC0000">老舍</font>, '《駱駝祥子》...  
3  ['\n', <font color="CC0000">張愛玲</font>, '《傳奇》'...  
4  ['\n', <font color="CC0000">錢鍾書</font>, '《圍城》'...  


In [4]:
df_book_links = pd.DataFrame(columns=['category_text', 'title_text', 'author_name', 'title_link',
                                 'book_format', 'book_file', 'book_link', 'storage_link'])

for i in range(len(df_title_all)) :
    record_category = df_title_all.iloc[i]['category_text']
    record_title = df_title_all.iloc[i]['title_text']
    record_author = df_title_all.iloc[i]['author_name']
    record_link = df_title_all.iloc[i]['title_link']
    print('Processing', record_category, record_title, record_author, record_link)
    book_desc = BeautifulSoup(df_title_all.iloc[i]['book_description'], 'lxml')
    link_objects = book_desc.find_all("input")
    for link in link_objects :
        link_string = link.get('onclick')
        if link_string.startswith('Download') :
            paren_start = link_string.find('\'')
            paren_end = link_string.find('\'', paren_start+1)
            if (paren_start is not None) and (paren_end is not None) :
                book_format = link_string[8:paren_start-1].upper()
                if book_format == 'UPDB' :
                    file_extension = '.updb'
                elif book_format == 'PDB' :
                    file_extension = '.pdb'
                elif book_format == 'PDF' :
                    file_extension = '.pdf'
                elif book_format == 'PRC' :
                    file_extension = '.prc'
                elif book_format == 'MOBI' :
                    file_extension = '.mobi'
                elif book_format == 'EPUB' or book_format == 'VEPUB' :
                    file_extension = '.epub'
                else :
                    print(link_string, 'is not parsed correctly')
                    file_extension = '.UNKNOWN'
                book_file = link_string[paren_start+1:paren_end] + file_extension
                book_link = 'https://www.haodoo.org/?M=d&P=' + book_file
                storage_link = 'https://haodoo.org/EBOOK/' + book_format + '/' + book_file
                df_book_links.loc[len(df_book_links)] = [record_category, record_title, record_author, record_link,
                                               book_format, book_file, book_link, storage_link]

                # If a file extension was specified, the script can also download the books
                extensions_to_download = []
                if book_format in extensions_to_download :
                    time.sleep(1)
                    with requests.get(book_link, stream=True) as r:
                        with open(book_file, 'wb') as f:
                            shutil.copyfileobj(r.raw, f)
                    print('Captured the ebook file:', book_file)

Processing 世紀百強 【吶喊】 魯迅 https://haodoo.org/?M=book&P=435
Processing 世紀百強 【邊城】 沈從文 https://haodoo.org/?M=book&P=394
Processing 世紀百強 《駱駝祥子》 老舍 https://haodoo.org/?M=book&P=401
Processing 世紀百強 【傳奇】 張愛玲 https://haodoo.org/?M=book&P=430
Processing 世紀百強 【圍城】 錢鍾書 https://haodoo.org/?M=book&P=399
Processing 世紀百強 《子夜》 茅盾 https://haodoo.org/?M=book&P=437
Processing 世紀百強 【台北人】 白先勇 https://haodoo.org/?M=book&P=244
Processing 世紀百強 【家】 巴金 https://haodoo.org/?M=book&P=443
Processing 世紀百強 【呼蘭河傳】 蕭紅 https://haodoo.org/?M=book&P=434
Processing 世紀百強 【老殘遊記】 劉鶚 https://haodoo.org/?M=book&P=85
Processing 世紀百強 《寒夜》 巴金 https://haodoo.org/?M=book&P=484
Processing 世紀百強 《彷徨》 魯迅 https://haodoo.org/?M=book&P=436
Processing 世紀百強 《官場現形記》 李伯元 https://haodoo.org/?M=book&P=446
Processing 世紀百強 《財主底兒女們》 路翎 https://haodoo.org/?M=book&P=476
Processing 世紀百強 【將軍族】 陳映真 https://haodoo.org/?M=book&P=448
Processing 世紀百強 【沉淪】 郁達夫 https://haodoo.org/?M=book&P=451
Processing 世紀百強 【死水微瀾】 李劼人 https://haodoo.org/?M=book&P=483
Processi

In [5]:
print(df_book_links.head())

  category_text title_text author_name                        title_link  \
0          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   
1          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   
2          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   
3          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   
4          世紀百強       【吶喊】          魯迅  https://haodoo.org/?M=book&P=435   

  book_format   book_file                                 book_link  \
0        UPDB   A435.updb   https://www.haodoo.org/?M=d&P=A435.updb   
1         PRC    A435.prc    https://www.haodoo.org/?M=d&P=A435.prc   
2        MOBI  AV435.mobi  https://www.haodoo.org/?M=d&P=AV435.mobi   
3        EPUB   A435.epub   https://www.haodoo.org/?M=d&P=A435.epub   
4       VEPUB  AV435.epub  https://www.haodoo.org/?M=d&P=AV435.epub   

                                storage_link  
0    https://haodoo.org/EBOOK/UPDB/A435.updb  
1      https://haodoo.

In [6]:
print(df_book_links.tail())

      category_text title_text author_name                       title_link  \
17527          小說園地      《碧玉樓》           　  https://haodoo.org/?M=book&P=43   
17528          小說園地      《碧玉樓》           　  https://haodoo.org/?M=book&P=43   
17529          小說園地      《碧玉樓》           　  https://haodoo.org/?M=book&P=43   
17530          小說園地      《碧玉樓》           　  https://haodoo.org/?M=book&P=43   
17531          小說園地      《碧玉樓》           　  https://haodoo.org/?M=book&P=43   

      book_format  book_file                                book_link  \
17527        UPDB   G43.updb   https://www.haodoo.org/?M=d&P=G43.updb   
17528         PDB    G43.pdb    https://www.haodoo.org/?M=d&P=G43.pdb   
17529         PRC    G43.prc    https://www.haodoo.org/?M=d&P=G43.prc   
17530        EPUB   G43.epub   https://www.haodoo.org/?M=d&P=G43.epub   
17531       VEPUB  GV43.epub  https://www.haodoo.org/?M=d&P=GV43.epub   

                                   storage_link  
17527    https://haodoo.org/EBOOK/UP

In [7]:
out_file = df_book_links.to_csv(index=False, line_terminator='\r')
with open('py_webscraping_beautifulsoup_haodoo_links.csv', 'w', encoding="utf-8") as f:
    f.write(out_file)
print('Total number of title found from web scraping:', len(df_book_links))

Total number of title found from web scraping: 17532


In [8]:
print ('Total time for the script:',(datetime.now() - START_TIME_SCRIPT))

Total time for the script: 0:00:28.395384
