<img src="https://raw.githubusercontent.com/codecaviar/digital_asset_management/master/assets/bingyune-and-company-logo-6400x3600.png" align="left" width="200" height="auto">

<br/> <br/> <br/> <br/>

# Beautiful Soup: How to Extract Data When Web Scraping

**BingYune Chen**, Principal Data Scientist<br>
2020-12-09 | 6 minute read

Data is provided by [Red Light Novel](https://www.readlightnovel.org/solo-leveling) 
    | Source code is on [Github](https://github.com/codecaviar)

---

The goal of this project is to provide a practical introduction to web scraping. Beautiful Soup will cater to most of your web scraping needs, from navigation to advanced search through the results.

## Table of Contents

1. [**Project Overview**](#overview)
2. [**Inspect Your Data Source**](#inspect_data)
3. [**Scrape and Parse HTML Content**](#scrape_parse)
4. [**Extract Text From HTML Content**](#extract_text)
5. [**Conclusion**](#conclusion)

<a class="anchor" id="overview"></a>
# 1. Project Overview

Web scraping is a technique used to extract and save large amounts of data from websites. Data, in general terms, is any set of information that is collected, shared, or stored for some purpose such as texts or numbers written on a piece of paper, bytes or bits inside the memory of a smartphone, and even facts or beliefs that are stored inside a person's mind. Unfortunately, data displayed by most websites can only be viewed using a web browser - be it text, media, or data in any other format. When you try to get the data you want manually, you might spend a lot of time clicking, scrolling, and searching. Web scraping essentially automates this process of collecting the data, performing the same manual task within a fraction of the time. The concerns with time and repetition is especially true if you need large amounts of data from websites that are regularly updated with new content. Because every website is different and many websites constantly change, each website will need its own personal treatment to extract the data relevant to you. Enter web scraping.        

<a class="anchor" id="problem_statement"></a>
## 1.1 Problem Statement

> The goal of this project is to provide a practical introduction to web scraping in Python.

The project makes use of webpages from [Red Light Novel](https://www.readlightnovel.org/terms-of-service). The website gives users access to read light novels, web novels, Korean novels, and Chinese novels online for free. The project will scrap the pages of a web novel and make them available to turn into an audio book. *Solo Leveling* or *I Alone Level Up* is a South Korean web novel written by Chugong. The web novel was serialized in [Kakao](https://en.wikipedia.org/wiki/Kakao)'s digital comic and fiction platform KakaoPage since July 25, 2016, and later published by D&C Media under their Papyrus label since November 4, 2016. The novel has been licensed in English by Webnovel under the title *Only I Level Up*. 

<a class="anchor" id="inspect_data"></a>
# 2. Inspect Your Data Source

Explore the website and interact with it just like any normal user. What is the layout of results when using the site's native search interface? Are there more detailed descriptions in some sections of the website? How does the content change when you click on sections of the website? A lot of information can also be encoded in the URL (of your browser's address bar). Try to pick apart the URL of the site you're currently on:

```
https://www.readlightnovel.org/solo-leveling/chapter-132
```

You can deconstruct the above URL into two main parts:

1. **The base URL** represents the path to the web novel on the website. In the example above, the base URL is https://www.readlightnovel.org/.
2. **The query parameters** represent additional values that can be declared on the page. In the example above, the query parameters are solo-leveling/chapter-132 (or the title of the web novel, plus the specific chapter).

Normally, query parameters consist of a start (denoted by a question mark), additional information (keys and values joined together by an equals sign), and a separator (for multiple parameters using an ampersand). For the simplified example URL, notice what happens when you change the numbers after "/chapter-". 

Lastly, developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed. In this tutorial, you'll see how to work with the developer tools in Mozilla Firefox. The process will be very similar to other modern browsers. In Firefox, you can open up the developer tools through the menu *Tools → Web Developer → Inspector*. You can also access them by right-clicking on the page and selecting the Inspect option, or by using a keyboard shortcut. Developer tools allow you to interactively explore the site's [document object model](https://en.wikipedia.org/wiki/Document_Object_Model). The more you get to know the structure of the page you're working with, the easier it will be to scrape it. 

<a class="anchor" id="scrape_parse"></a>
# 3. Scrape and Parse HTML Content

First, you'll want to get the site's HTML code into your Python script so that you can interact with it. For this task, you'll use Pythons [requests](https://realpython.com/python-requests/) library. After you successfully scrap some static HTML content from the Internet (the following does not work for Hidden or Dynamic websites), you can use Beautiful Soup to parse the lengthy code response to make it more accessible and readable. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you would interact with a web page using developer tools.

In [1]:
# Import libraries needed to retrieve html
import requests # make a request to a web page, and print the response text

# Perform an HTTP request and retrieve HTML data as Python object
url = "https://www.readlightnovel.org/solo-leveling/chapter-137"
# Make a request to a web page
page = requests.get(url)
# Use content or text attribute to disply structure of HTML
print(page.text)

# In case you ever get lost in a large pile of HTML, remember that you can always go back 
# to your browser and use developer tools to further explore the HTML structure.

<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="utf-8">
<title>Read Solo Leveling Chapter 137</title>
<meta name="description" content="Solo Leveling, Solo Leveling chapter 137, read Solo Leveling, read Solo Leveling chapter 137, readlightnovel.org">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="shortcut icon" href="/favicon.ico">
<meta property="og:image" content="https://www.readlightnovel.org/uploads/posters/1545815013.jpg" />
<link rel="stylesheet" href="https://www.readlightnovel.org/assets/styles/minify.css?v=5.2">
<link rel="stylesheet" href="https://www.readlightnovel.org/assets/styles/night.css?v=2.7">
<link rel="alternate" hreflang="en" href="https://www.readlightnovel.org/solo-leveling/chapter-137" />
<link rel="canonical" href="https://www.readlightnovel.org/solo-leveling/chapter-137" />
<script src="https://www.readlightnovel.org/assets/scripts/minify.js?v=10"></script>
<script src="https://www.readlightnovel.org/assets/scripts/j

In [2]:
# Import libraries needed to parse html code
from bs4 import BeautifulSoup

# Create a beautiful soup object
soup = BeautifulSoup(page.text, 'html.parser') # use appropriate parser

type(soup) # bs4.BeautifulSoup

bs4.BeautifulSoup

In [3]:
# Create function to iterate through chapters
def webscrap_chapter(chapter):
    # Verify url format
    url = "https://www.readlightnovel.org/solo-leveling/chapter-{}".format(chapter)
    # Make a request to a web page
    page = requests.get(url)
    # Parse the response text using html parser
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup # return beautiful soup object 

<a class="anchor" id="extract_text"></a>
# 4. Extract Text From HTML Content

In an HTML web page, every element can have an HTML tag and an id attribute assigned. As the name already suggests, that tag and id combination makes the element uniquely identifiable on the page. At the time of this writing, the element you're looking for is the `<hr>` tag, which is used to define thematic changes in the content. Within the `<hr>` tag, every line of content is in the `<p>` element, which represents a paragraph in the HTML. Paragraphs are block-level elements, meaning each tag will automatically close if another block-level element is parsed before the closing `</p>` tag.

**Note:** Periodically switching back to your browser and interactively exploring the page using developer tools helps you learn how to find the exact elements you're looking for when parsing your data.

In [4]:
# Find specific elements by its tag or id
element = soup.find('p') # find one element
print(element.prettify())

<p style="margin-bottom:15px;">
 <i class="fa fa-clock-o">
 </i>
 Published
at 25th of March 2019 09:04:03 AM
</p>



In [5]:
# Find all elements by its html class
soup_elems = soup.find_all('p') # find all occurences that match a given pattern
for job_elem in soup_elems:
    print(job_elem, end='\n'*2)
    
# In contrast, search() module will only return the first occurrence that matches the 
# specified pattern. findall() will iterate over all the lines of the file and will 
# return all non-overlapping matches of pattern in a single step. 

<p style="margin-bottom:15px;"> <i class="fa fa-clock-o"></i> Published
at 25th of March 2019 09:04:03 AM</p>

<p> </p>

<p>"HUH?"</p>

<p>The answer sounded the same but its 'nuance' was rather a lot different than the one that came before . If she was kidding around just now, then she was dead serious this time . </p>

<p>"What's wrong? You think it's weird?"</p>

<p>" . . . . . Oppa, why are you naming your Guild like that?"</p>

<p>"Because I like moving around solo . "</p>

<p>"It does sound like you, but still, isn't it a bit strange to name your Guild like that?"</p>

<p>"Why?"</p>

<p>"Isn't your ability summoning out those black-armoured soldiers?"</p>

<p>"Yeah . "</p>

<p>"So, if you get technical about it, you aren't fighting alone, right?"</p>

<p>Now that he heard her opinion, that made some sense . Jin-Woo nodded his head . </p>

<p>'I may think of it as just another one of my skills but it won't look that way to other people, is that it?'</p>

<p>Indeed, she had a point

<a class="anchor" id="regular_expressions"></a>
## 4.1 Use Regular Expressions

Regular expressions — or regexes for short — are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s [re](https://docs.python.org/3/library/re.html) module. A regex in Python is a special text string used for describing a search pattern such as text matching, repetition, branching, and pattern-composition. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents. Ascii or latin letters are those that are on your keyboards and Unicode is used to match the foreign text. It includes digits and punctuation and all special characters like `!@#$^&*()`. 

Regular expressions use special characters called metacharacters to denote different patterns. For example, the asterisk character (`*`) stands for zero or more of whatever comes just before the asterisk. On the other hand, an identifier using `\d` to represent any number (digits).

In [6]:
# Use regular expressions
import re # explore regular expressions

pattern = re.compile('<.*?>') # match the opening tags, and any text after the first instance of <
# .*? non-greedily matches all text after the opening <, stopping at the first match of >
# Use re.sub() to remove all tags and return only the text
base_text = re.sub(pattern, '', str(soup_elems)) # short for substitute
print(base_text)

[  Published
She wants the d... Ahahahahaha...
, 
wait if the guy had a metaloid beast in his body doesn't that mean MC can after killing him extract his blood, purify it and get another powerful bloodline and metallic abilities!?
, 
waiting for the moment he beat them all up until they cry who the fuck do they think they are lol
, 
A trustworthy royalty is unfortunately a bad one
, 
To me it is
, © COPYRIGHT  READLIGHTNOVEL.ORG. ALL RIGHTS RESERVED.]


<a class="anchor" id="string_functions"></a>
## 4.2 Use String Functions

Similar to regular expressions, [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) are also used frequently for web scraping and provide another way to extract information from a web page's HTML. For example, you can use `.find()` to search through the text of the HTML for specific characters or blocks of text to slice or remove from the base text. Alternatively, you can use `.replace()` to return a copy of the string with all occurrences of a substring replaced by a new substring.   

In [7]:
# Use .find to remove all user comments at the end of the chapter
base_text[1:base_text.find('&lt; Chapter')] 



In [8]:
# Use .find to locate specific blocks of text that need to be removed
base_text[base_text.find('Published\nat'):base_text.find(',') + 1]

'Published\nat 25th of March 2019 09:04:03 AM,'

In [9]:
# Use .replace to correct punctuation errors
clean_text = base_text.replace(' .', '.').replace(' , ', ' ')
clean_text # note there are several more punctuation errors that need to be fixed



In [10]:
# Apply all cleaning steps to find elements by HTML object name (iterate)
def webscrap_elements(soup_object):
    soup_elems = soup_object.find_all('p')
    pattern = re.compile('<.*?>')
    base_text = re.sub(pattern, '', str(soup_elems))
    # Remove blocks of text related to publication and translation notes
    clean_text1 = base_text[:base_text.find("Translator’s Notes:")]
    clean_text2 = clean_text1[1:clean_text1.find('&lt; Chapter')] 
    remove_text = [clean_text2[clean_text2.find('Published\nat'):clean_text2.find(',') + 1]]
    remove_text += [x for x in set(re.findall("\(\d\)|\[\d\]|\(TL:.*?\)|\xa0", clean_text2))]
    clean_text3 = clean_text2
    for i in remove_text:
        clean_text3 = clean_text3.replace(i, '').replace(' ,', '').replace(',,', ',')
    # Remove extra spaces and commas
    clean_text4 = (clean_text3.replace(' .', '.').replace(' ”', '”').replace(
        " ’", "’").replace('. ],', '.]').replace('?,', '?').replace(
        '!,', '!').replace('-,', '-').replace('”,', '”').replace("’,", "’"))
    # Remove comment section
    final = clean_text4.replace(clean_text4[clean_text4.find('\n'):], ' ')
    return final

<a class="anchor" id="conclusion"></a>
# 5. Conclusion

Writing automated web scraping programs is fun, and the use cases of web scraping for business and personal needs are endless. For example, a business might use web scraping to gather contact details of businesses from websites like linkedin.com. Similarly, an individual might use web scraping to gather product details (price, images, rating, review, etc.) from a website like amazon.com to test and train machine learning models. Just remember, not everyone wants you pulling data from their web servers. If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out [Legal Perspectives on Scraping Data From The Modern Web](https://www.lawinsociety.org/legal-perspectives-on-scraping-data-from-the-modern-web).

In [13]:
import timeit # check start and end time for each process

# Iterate through all chapters and apply web scraping functions
for i in range(1, 270 + 1):
    starttime = timeit.default_timer()
    chap = webscrap_chapter(i)
    clean_chap = webscrap_elements(chap)
    f = open("./assets/book_solo_leveling.txt", "a+") # open text file
    f.write(clean_chap) # write to text file
    f.write("\n\n\n") # add space between chapters
    # Confirm steps in process 
    print("Completed writing Chapter {} in {} seconds".format(i, round(timeit.default_timer() - starttime, 4))) 

Completed writing Chapter 1 in 1.2479 seconds
Completed writing Chapter 2 in 1.7744 seconds
Completed writing Chapter 3 in 1.965 seconds
Completed writing Chapter 4 in 0.9212 seconds
Completed writing Chapter 5 in 2.3778 seconds
Completed writing Chapter 6 in 0.9171 seconds
Completed writing Chapter 7 in 1.1046 seconds
Completed writing Chapter 8 in 0.9844 seconds
Completed writing Chapter 9 in 1.9896 seconds
Completed writing Chapter 10 in 0.9807 seconds
Completed writing Chapter 11 in 0.9663 seconds
Completed writing Chapter 12 in 1.6988 seconds
Completed writing Chapter 13 in 1.3559 seconds
Completed writing Chapter 14 in 1.2891 seconds
Completed writing Chapter 15 in 1.392 seconds
Completed writing Chapter 16 in 1.2367 seconds
Completed writing Chapter 17 in 2.0297 seconds
Completed writing Chapter 18 in 2.0811 seconds
Completed writing Chapter 19 in 1.6073 seconds
Completed writing Chapter 20 in 2.125 seconds
Completed writing Chapter 21 in 1.3877 seconds
Completed writing Chapter

Completed writing Chapter 175 in 0.9044 seconds
Completed writing Chapter 176 in 0.6097 seconds
Completed writing Chapter 177 in 1.6861 seconds
Completed writing Chapter 178 in 2.1376 seconds
Completed writing Chapter 179 in 1.1917 seconds
Completed writing Chapter 180 in 1.5296 seconds
Completed writing Chapter 181 in 0.9574 seconds
Completed writing Chapter 182 in 0.6721 seconds
Completed writing Chapter 183 in 1.3486 seconds
Completed writing Chapter 184 in 1.2521 seconds
Completed writing Chapter 185 in 2.2308 seconds
Completed writing Chapter 186 in 0.9666 seconds
Completed writing Chapter 187 in 0.9402 seconds
Completed writing Chapter 188 in 1.1008 seconds
Completed writing Chapter 189 in 0.97 seconds
Completed writing Chapter 190 in 1.9665 seconds
Completed writing Chapter 191 in 2.0562 seconds
Completed writing Chapter 192 in 1.2138 seconds
Completed writing Chapter 193 in 0.9115 seconds
Completed writing Chapter 194 in 1.0682 seconds
Completed writing Chapter 195 in 0.9867 se

<a class="anchor" id="acknowledgments"></a>
### Acknowledgments 

The project referenced the following resources:
* https://realpython.com/beautiful-soup-web-scraper-python/
* https://realpython.com/python-web-scraping-practical-introduction/

---
<em>The Code Caviar</em> is a digital magazine about data science and analytics that dives deep into key topics, so you can experience the thrill of solving at scale. 