# WEB SCRAPING

### Scraping Rules

### Libraries Used

Beautiful Soup --- is a Python library that provides convenient methods for parsing HTML and XML documents. It is primarily used for web scraping, extracting data, and navigating through the parsed document tree. Beautiful Soup allows you to search, filter, and manipulate the parsed data using Pythonic syntax. It provides easy access to HTML tags, attributes, and text content, making it great for extracting specific data from web pages. Beautiful Soup is lightweight and easy to learn, but it does not handle the actual HTTP requests for web pages.

Scrapy --- is a powerful and comprehensive web scraping framework written in Python. It provides a complete set of tools and functionalities for crawling and extracting data from websites. Scrapy handles the entire scraping process, including making HTTP requests, parsing HTML/XML responses, following links, and storing the scraped data. It is highly customizable and allows you to define spiders (scraping bots) that specify the scraping rules, data extraction methods, and data storage mechanisms. Scrapy is suitable for more complex scraping tasks and can handle large-scale scraping projects efficiently.

Selenium --- is a widely used open-source automation framework for web browsers. While it can be used for various purposes, it is commonly employed for web scraping tasks that require interacting with dynamic websites or JavaScript-rendered content. Selenium allows you to control web browsers programmatically, automate user actions (e.g., clicking buttons, filling forms), and extract data from web pages. It is particularly useful when scraping websites that heavily rely on JavaScript for content rendering. Selenium supports various web browsers and provides different drivers to interface with them (e.g., ChromeDriver, GeckoDriver).

In [1]:
html = '<!Doctype html>\
<html>\
<head>\
<title> Testing Web Page </title>\
</head>\
<body>\
<h1> Web Scarping <h1>\
<p id ="first_para">\
Let\'s start learning\
<b>\
Web Scraping\
</b>\
</p>\
<p class = "abc" id = "second_para">\
You can read more about beautiful soap from <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a>\
</p>\
<p class ="abc">\
<a href = "https://codingninjas.in/"> Coding Ninjas </a>\
</p>\
</body>\
</html>'

bs4 library --- library in Python refers to Beautiful Soup 4, which is a popular library used for web scraping and parsing HTML/XML documents. It provides a simple and Pythonic way to extract data from web pages by navigating and searching through the parsed document tree.

In [2]:
from bs4 import BeautifulSoup

In [3]:
data = BeautifulSoup(html, 'html.parser')
data

<!DOCTYPE html>
<html><head><title> Testing Web Page </title></head><body><h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1></body></html>

In [4]:
type(data)

bs4.BeautifulSoup

In [5]:
print(data.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing Web Page
  </title>
 </head>
 <body>
  <h1>
   Web Scarping
   <h1>
    <p id="first_para">
     Let's start learning
     <b>
      Web Scraping
     </b>
    </p>
    <p class="abc" id="second_para">
     You can read more about beautiful soap from
     <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">
      here
     </a>
    </p>
    <p class="abc">
     <a href="https://codingninjas.in/">
      Coding Ninjas
     </a>
    </p>
   </h1>
  </h1>
 </body>
</html>


In [6]:
data.title

<title> Testing Web Page </title>

In [7]:
data.head

<head><title> Testing Web Page </title></head>

In [8]:
data.p

<p id="first_para">Let's start learning<b>Web Scraping</b></p>

In [9]:
data.h1

<h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1>

In [10]:
print(data.title)
print(data.title.name)
print(data.title.string)

<title> Testing Web Page </title>
title
 Testing Web Page 


In [11]:
print(data.title.attrs)

{}


In [12]:
data.p.attrs

{'id': 'first_para'}

In [13]:
data.p['id']

'first_para'

In [14]:
data.p.get('id')

'first_para'

In [15]:
data.get_text()

" Testing Web Page  Web Scarping Let's start learningWeb ScrapingYou can read more about beautiful soap from  here  Coding Ninjas "

In [16]:
data.find('p')

<p id="first_para">Let's start learning<b>Web Scraping</b></p>

In [17]:
data.find('h1')

<h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1>

### Navigate Tree

#### 1. Searching Parse Tree

In [18]:
li = data.find_all('p')
for i in li:
    print(i)

<p id="first_para">Let's start learning<b>Web Scraping</b></p>
<p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p>
<p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>


In [19]:
data.find_all('p','a')

[]

In [20]:
data.find_all(True)

[<html><head><title> Testing Web Page </title></head><body><h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1></body></html>,
 <head><title> Testing Web Page </title></head>,
 <title> Testing Web Page </title>,
 <body><h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1></body>,
 <h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://ww

In [21]:
data.find_all(id = 'first_para')

[<p id="first_para">Let's start learning<b>Web Scraping</b></p>]

In [22]:
data.find_all(class_ = 'abc')

[<p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p>,
 <p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p>]

#### 2. Going Down

In [23]:
for i in li:
    print(i.string)

None
None
 Coding Ninjas 


In [24]:
for i in li:
    print(list(i.strings))

["Let's start learning", 'Web Scraping']
['You can read more about beautiful soap from ', ' here ']
[' Coding Ninjas ']


In [25]:
for i in li:
    print(list(i.stripped_strings))  # using this extra spaces from start and end are removed

["Let's start learning", 'Web Scraping']
['You can read more about beautiful soap from', 'here']
['Coding Ninjas']


In [26]:
li = data.html.contents
print(len(li))
print(li)

2
[<head><title> Testing Web Page </title></head>, <body><h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1></body>]


In [27]:
li2 = data.html.children
for i in li2:
    print(i)

<head><title> Testing Web Page </title></head>
<body><h1> Web Scarping <h1><p id="first_para">Let's start learning<b>Web Scraping</b></p><p class="abc" id="second_para">You can read more about beautiful soap from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> here </a></p><p class="abc"><a href="https://codingninjas.in/"> Coding Ninjas </a></p></h1></h1></body>


Contents --- attribute returns a list of all immediate children of a tag or the contents of a tag as a list. It includes both the tags and the non-tag elements (such as text) that are direct children of the tag.

Children --- attribute returns an iterator that allows you to iterate over all immediate children of a tag. It includes both the tags and the non-tag elements (such as text) that are direct children of the tag