<a href="https://colab.research.google.com/github/akhilendra2k25/eda-dv/blob/test-branch/eda_dv_scraping_bs4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploratory Data Analysys & Data Visualization - Web Scraping**

![Colab](https://github.com/user-attachments/assets/5c871f71-cb3f-4858-bd09-3a6a24c8c10a)

## This repo is to explore web scraping using Beautiful Soup library.

There are other python libraries which can be used for this purpose similar to BS4 like Scrapy. Selenium is used for automation of the scraping process in professional setups. Requests is another library that is used for making 'GET' requests to the Web Server if the website you are trying to scrape data from.

## <span style="color:red">Quickstart :-</span>

Code below imports required 'requests' and 'bs4' libraries.Sends a 'GET' request to websites server to get the html code of the webpage. The data type of the extracted soup is **'bs4.BeautifulSoup'**.

In [40]:
import requests
from bs4 import BeautifulSoup

url = ("https://beautiful-soup-4.readthedocs.io/en/latest/")

raw_html_page = requests.get(url)

soup = BeautifulSoup(raw_html_page.content, "lxml")
print(f"The data type for soup is : {type(soup)}")

The data type for soup is : <class 'bs4.BeautifulSoup'>


In [41]:
#print(soup.prettify()) # Remove # in the beginning to check the prettified HTML output.

Following are the syntax and examples of-
- Extracting the **title** element.
- Returning the **.name** attribute of the targeted tag.
- Accessing the name of the parent Tag.
- .text method to extract text content of a Tag.
- .string method to extract the text content of a Tag.

In [57]:
bold_s = "\033[1m"
bold_e = "\033[0m"

print(f"{bold_s}soup.title method extracts whole element, Tag and data respectively:{bold_e}", soup.title)
print(f"{bold_s}Data type returned by soup.title method is{bold_e}", type(soup.title))
print()

print(f"{bold_s}.name attribute of a tag returns the name of the Tag:{bold_e}",soup.title.name)
print(f"{bold_s}It's data type is:{bold_e}",type(soup.title.name))
print()

print(f"{bold_s}soup.title.parent returns the parent Tag of the targeted title Tag:{bold_e}",soup.title.parent.name)
print(f"{bold_s}It's data type is:{bold_e}",type(soup.title.parent.name))
print()

print(f"{bold_s}soup.title.text returns the text content of the <title> Tag:{bold_e}",soup.title.text)
print(f"{bold_s}It's data type is:{bold_e}",type(soup.title.text))
print()

print(f"{bold_s}soup.title.string returns the text content of the <title> Tag \nbut only if Tag contains text content in it and no other tags within it, else it returns None:{bold_e}\n", soup.title.string)
print(f"{bold_s}It's data type is:{bold_e}",type(soup.title.string))

[1msoup.title method extracts whole element, Tag and data respectively:[0m <title>Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation</title>
[1mData type returned by soup.title method is[0m <class 'bs4.element.Tag'>

[1m.name attribute of a tag returns the name of the Tag:[0m title
[1mIt's data type is:[0m <class 'str'>

[1msoup.title.parent returns the parent Tag of the targeted title Tag:[0m head
[1mIt's data type is:[0m <class 'str'>

[1msoup.title.text returns the text content of the <title> Tag:[0m Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
[1mIt's data type is:[0m <class 'str'>

[1msoup.title.string returns the text content of the <title> Tag 
but only if Tag contains text content in it and no other tags within it, else it returns None:[0m
 Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
[1mIt's data type is:[0m <class 'bs4.element.NavigableString'>


In [66]:
print(soup.p)
print(type(soup.p))
print()

print(soup.p.text)
print(type(soup.p.text))
print()

print(f"{bold_s}Notice the output None because of nested <a> anchor Tag:{bold_e}\n" ,soup.p.string)
print(type(soup.p.string))
print()

print(soup.p.get_text())
print(type(soup.p.get_text()))
print()

print(f"{bold_s}Notice the stripped space after Beautiful Soup, since the text was anchored:{bold_e}\n",soup.p.get_text(strip=True))
print(type(soup.p.get_text(strip=True)))
print()


<p><a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.</p>
<class 'bs4.element.Tag'>

Beautiful Soup is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.
<class 'str'>

[1mNotice the output None because of nested <a> anchor Tag:[0m
 None
<class 'NoneType'>

Beautiful Soup is a
Python library for pulling data out of HTML and XML files. It works
with your favorite parser to provide idiomatic ways of navigating,
searching, and modifying the parse tree. It commonly saves programmers
hours or days of work.
<class 'str'>

[1mNotice 

In [80]:
print(soup.a)
print(type(soup.a))
print()

print(soup.a.text)
print(type(soup.a.text))
print()

print(soup.find_all("a"))
print(f"{bold_s}This returns {bold_e}",type(soup.find_all("a")))
print()

print(soup.find_all("a")[0])
print(type(soup.find_all("a")[0]))
print()

<a class="icon icon-home" href="#"> Beautiful Soup
          

          
          </a>
<class 'bs4.element.Tag'>

 Beautiful Soup
          

          
          
<class 'str'>

[<a class="icon icon-home" href="#"> Beautiful Soup
          

          
          </a>, <a class="reference internal" href="#">Beautiful Soup Documentation</a>, <a class="reference internal" href="#getting-help">Getting help</a>, <a class="reference internal" href="#quick-start">Quick Start</a>, <a class="reference internal" href="#installing-beautiful-soup">Installing Beautiful Soup</a>, <a class="reference internal" href="#problems-after-installation">Problems after installation</a>, <a class="reference internal" href="#installing-a-parser">Installing a parser</a>, <a class="reference internal" href="#making-the-soup">Making the soup</a>, <a class="reference internal" href="#kinds-of-objects">Kinds of objects</a>, <a class="reference internal" href="#tag"><code class="docutils literal notranslate"><span

In [81]:
for link in soup.find_all("a"):
    print(link.get("href"))

#
#
#getting-help
#quick-start
#installing-beautiful-soup
#problems-after-installation
#installing-a-parser
#making-the-soup
#kinds-of-objects
#tag
#name
#attributes
#multi-valued-attributes
#navigablestring
#beautifulsoup
#comments-and-other-special-strings
#navigating-the-tree
#going-down
#navigating-using-tag-names
#contents-and-children
#descendants
#string
#strings-and-stripped-strings
#going-up
#parent
#parents
#going-sideways
#next-sibling-and-previous-sibling
#next-siblings-and-previous-siblings
#going-back-and-forth
#next-element-and-previous-element
#next-elements-and-previous-elements
#searching-the-tree
#kinds-of-filters
#a-string
#a-regular-expression
#a-list
#true
#a-function
#find-all
#the-name-argument
#the-keyword-arguments
#searching-by-css-class
#the-string-argument
#the-limit-argument
#the-recursive-argument
#calling-a-tag-is-like-calling-find-all
#find
#find-parents-and-find-parent
#find-next-siblings-and-find-next-sibling
#find-previous-siblings-and-find-previous-

In [86]:
for text in soup.find_all("a"):
    print(text.get_text())

 Beautiful Soup
          

          
          
Beautiful Soup Documentation
Getting help
Quick Start
Installing Beautiful Soup
Problems after installation
Installing a parser
Making the soup
Kinds of objects
Tag
Name
Attributes
Multi-valued attributes
NavigableString
BeautifulSoup
Comments and other special strings
Navigating the tree
Going down
Navigating using tag names
.contents and .children
.descendants
.string
.strings and stripped_strings
Going up
.parent
.parents
Going sideways
.next_sibling and .previous_sibling
.next_siblings and .previous_siblings
Going back and forth
.next_element and .previous_element
.next_elements and .previous_elements
Searching the tree
Kinds of filters
A string
A regular expression
A list
True
A function
find_all()
The name argument
The keyword arguments
Searching by CSS class
The string argument
The limit argument
The recursive argument
Calling a tag is like calling find_all()
find()
find_parents() and find_parent()
find_next_siblings() and find_n

In [114]:
print(soup.select("p a"))
print(type(soup.select("p a")))
print()

print(soup.select("p a")[0])
print(type(soup.select("p a")[0]))
print()

[<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>, <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup 3</a>, <a class="reference internal" href="#porting-code-to-bs4">Porting code to BS4</a>, <a class="reference external" href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">send mail to the discussion group</a>, <a class="reference internal" href="#diagnose"><span class="std std-ref">what the diagnose() function says</span></a>, <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup 3</a>, <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">download the Beautiful Soup 4 source tarball</a>, <a class="reference external" href="http://lxml.de/">lxml parser</a>, <a class="reference external" href="http://code.google.com/p/html5lib/">html5lib pa

In [116]:
print(soup.select("p a")[0].get("href"))
print(type(soup.select("p a")[0].get("href")))

http://www.crummy.com/software/BeautifulSoup/
<class 'str'>


In [117]:
print(soup.select("p a")[0].get_text())
print(type(soup.select("p a")[0].get_text()))

Beautiful Soup
<class 'str'>


In [121]:
print(soup.select("a "))
print(type(soup.select("a [href]")))
print()

[]
<class 'bs4.element.ResultSet'>



In [113]:
for link in soup.select("p a"):
    print(link.get("href"))
print()

http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
#porting-code-to-bs4
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
#diagnose
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
http://www.crummy.com/software/BeautifulSoup/download/4.x/
http://lxml.de/
http://code.google.com/p/html5lib/
#differences-between-parsers
#id17
#navigating-the-tree
#searching-the-tree
#navigating-the-tree
#searching-the-tree
#replace-with
#navigating-the-tree
#searching-the-tree
#tag
#navigating-the-tree
#searching-the-tree
#modifying-the-tree
#tag
#searching-the-tree
#id11
#attrs
#recursive
#id12
#limit
#kwargs
#kinds-of-filters
#kinds-of-filters
#a-string
#a-regular-expression
#a-list
#a-function
#the-value-true
#a-string
#a-regular-expression
#a-list
#a-function
#the-value-true
#multivalue
#a-string
#a-regular-expression
#a-list
#a-function
#the-value-true
#id11
#attrs
#recursive
#id12
#kwargs
#navigating-us

In [112]:
for text in soup.select("p a"):
    print(text.get_text())
print()

Beautiful Soup
Beautiful Soup 3
Porting code to BS4
send mail to the discussion group
what the diagnose() function says
Beautiful Soup 3
download the Beautiful Soup 4 source tarball
lxml parser
html5lib parser
Differences
between parsers
Parsing XML
Navigating the tree
Searching the tree
Navigating
the tree
Searching the tree
replace_with()
Navigating the tree
Searching the tree
Tag
Navigating the tree
Searching the tree
Modifying the tree
Tag
Searching the tree
name
attrs
recursive
string
limit
**kwargs
Kinds of filters
Kinds of filters
a
string
a regular expression
a list
a function
the value
True
a string
a regular
expression
a list
a function
the value True
Remember
a string
a
regular expression
a list
a function
the value True
name
attrs
recursive
string
**kwargs
Navigating using tag
names
name
attrs
string
limit
**kwargs
name
attrs
string
**kwargs
.parent
.parents
name
attrs
string
limit
**kwargs
name
attrs
string
**kwargs
.next_siblings
name
attrs
string
limit
**kwargs
name
attr