# **The essential fundamentals of web scraping are:**


*   To understand the basics of HTML and CSS. 
*   HTML is used to give structure for a web page and CSS beautify the webpage.
*   To explore the web page structure and usage of developer tools.
*   To make HTTP requests and get HTML responses.
*   To get specific structured information using beautifulsoup.






# **BeautifulSoup**


*   Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages. 
*   Say you’ve found some webpages that display data relevant to your work/research, such as date or address information, but that do not provide any way of downloading the data directly. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. 
*   It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.
*   This process is suitable for static content which is available by making an HTTP request to get the webpage content




# **Basic Termes in Web Scraping**

1.   **Crawler**: is a web bot that visits a stack of web pages and accumulates the links (URLs) of the nodes, deriving new URLs from each new web page [html] that it visits. Crawler might or might not get pages’ info in a data storage. It does not go deep unless programmed explicitly.
2.   **Scraper**: is a bot that visits web pages of a given set of URLs. It does not collect new URLs (as a crawler does). It rather visits pre-collected URLs and retrieves relevant data to store into a data storage.
3.   **Parser**: is an [offline] robot that processes or analyses given data to dervie a proper data structures. It retrieves information from [unstructured] data, whether from data storage or directly from the web (e.g. HTML).





# **Types of Parser**

1.   **html.parser** :  built-in, no extra dependencies needed.
2.   **html5lib** : the most lenient (not strictly matches your pattern), better use it if HTML is broken.
3.   **lxml** : the fastest.
html2text check


**How to Make a Soup out of HTML File**

(Note: Here Soup mean way we prase the HTML Tree)

In [5]:
from bs4 import BeautifulSoup

def read_file():
    file = open('intro_to_soup_html.html')
    data = file.read()
    file.close()
    return data

# Make soup
# Syntax = BeautifulSoup(html_data,parser)
# Our parser is lxml or html.parser which we have installed

html_file = read_file()
#print(html_file)


soup = BeautifulSoup(html_file,'lxml')
#print(soup)
#type(soup)

# soup prettify
print(soup.prettify())


<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Intro_to_soup
  </title>
 </head>
 <body>
  <div>
   <p>
    In first div
   </p>
  </div>
  <div>
   <p>
    In second div
   </p>
  </div>
 </body>
</html>


# **How to Make a Soup out of any Website HTML**

In [13]:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()
header = {'user-agent':ua.chrome}
google_page = requests.get('https://www.amazon.in',headers=header)
#print(google_page.content)

soup = BeautifulSoup(google_page.content,'lxml') # html.parser

print(soup.prettify())


#identify some tags


<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-in">
 <!-- sp:feature:head-start -->
 <head>
  <script>
   var aPageStart = (new Date()).getTime();
  </script>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   var ue_t0=ue_t0||+new Date();
  </script>
  <!-- sp:feature:cs-optimization -->
  <meta content="on" http-equiv="x-dns-prefetch-control"/>
  <link href="https://images-eu.ssl-images-amazon.com" rel="dns-prefetch"/>
  <link href="https://m.media-amazon.com" rel="dns-prefetch"/>
  <link href="https://completion.amazon.com" rel="dns-prefetch"/>
  <script type="text/javascript">
   window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function

In [6]:
pip install fake_useragent

Note: you may need to restart the kernel to use updated packages.


# Analysis to HTML **Tags**

In [33]:
from bs4 import BeautifulSoup

def read_file():
    file = open('tags.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
#print(soup)

# Accessing tags
meta = soup.meta
#print(meta) # gives us the first occurance of meta tag.

div = soup.div
#print(div) # gives us the first occurance of tag ->div.

# tag methods
'''
name
-- attributes
.get() method
dictionary
'''
print("Value of Charset via get method is: ")
print(meta.get("charset"))

print("Value of Charset via  dictonary is: ")
print(meta["charset"]) # can be treated as dictionary

# modify attributes at runtime
body = soup.body
#print(body.prettify()) # prints entire body content
#print(body['style'])  # output will be blank as there is no style
body['style'] = 'some style' 
print(body['style']) # returns some style

'''
 Multi valued attributes
'''
print(body['class']) # here class has two attributes(list): first and second

Value of Charset via get method is: 
UTF-8
Value of Charset via  dictonary is: 
UTF-8
some style
['first', 'second']


# **Navigable Strings**

In [1]:
from bs4 import BeautifulSoup

def read_file():
    file = open('intro_to_soup_html.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Navigable strings in the HTML file are: Intro_to_soup, In first div, In second div

# To access string inside a tag use .string method (Accessing Navigable strings )
title = soup.title

#print(title)         #Complete HTML Element is printed
#print(title.string)  #String in the HTML element is printed


# .replace_with("") function            -- navigable string
print("Before replacing:")
print(title)

title.string.replace_with("title has been changed") # replaces "Intro_to_soup" to "title has been changed"

print("After replacing:")
print(title)

Before replacing:
<title>Intro_to_soup</title>
After replacing:
<title>title has been changed</title>


#**Navigating Through tag Names**

In [None]:
from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# example  -- accessing tags directely from their tag names
title = soup.title
print(title) # prints 1st title tag

p = soup.p
print(p) # prints 1st p tag


<title>
            The Dormouse's story
        </title>
<p class="title">
<b>
                The Dormouse's story
            </b>
</p>


# **Navigating Through Child tag** 

In [42]:
from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# tag.contents         -- returns a list of children
head = soup.head
#print(head.contents)

for child in head.contents:
    #print(child if child is not None else'')
    pass

body = soup.body
#print(body.contents)
for child in body.contents:
    #print(child if child is not None else '', end='\n\n\n\n')  #here end='\n\n\n\n' is written only to differntiating between tags works fine if deleted
    pass


# .children         -- returns an iterator
for child in body.children:
    print(child if child is not None else '', end='\n\n\n\n')
    pass







<b></b>








<p class="title">
<b>
                The Dormouse's story
            </b>
</p>








<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
                Lacie
            </a>
                and
            <a class="sister" href="http://example.com/tillie" id="link2">
                Tillie
            </a>
                ; and they lived at the bottom of a well.
        </p>








<p class="story">
<b>
                The End
            </b>
</p>










# Navigating with Beautifulsoup - Going Down - use three_sisters.html

There are 3 types of movement across html Parse tree

1.   Down the Tree - body tag to P tag
2.   Up the Tree - P tag to body tag
3.   Sideways Movement - P tag to P tag Movement






In [4]:
#This script describes how to move up in an html parse tree from a child tag

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
title = soup.title

parent = title.parent
#print(parent)   #prints the complete parent tag HTML Element
#print(parent.name)   # method.name ---> gives Parent tag's name


# .parent
p = soup.p  
print(p)  #prints first occurance of p tag
print(p.parent) #prints complte body tag, since it is the parent of p tag
print(p.parent.name) # prints only the name of the parent

'''
note: all p tags are siblings in the html
Tree starts from soup --> has its child as HTML --> HTML has childerns as head and body --> head and boby has childrens depending on the structure of web pag
'''

# html
html = soup.html
#print(type(html.parent))         #   bs4 (top level parent of every parse tags) ---> html ---- prints the parent of html


# soup
print(soup.parent) # returns none as it is at top of the hirerchey

<p class="title">
<b>
                The Dormouse's story
            </b>
</p>
<body>
<b></b>
<p class="title">
<b>
                The Dormouse's story
            </b>
</p>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
                Lacie
            </a>
                and
            <a class="sister" href="http://example.com/tillie" id="link2">
                Tillie
            </a>
                ; and they lived at the bottom of a well.
        </p>
<p class="story">
<b>
                The End
            </b>
</p>
</body>
body
None


In [None]:
'''This script describes .parent method, how access all the 
parents of a perticular tag'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

'''
 .parents              --- returns a list (generator)  of parents
 we shall use 'a' tag
 'a' tag has parent as 'p' tag which has parent as 'body' tag and so on

#moving up the tree: a --> p --> body --> html --> beautifulsoup

'''

link = soup.a
#print(link) # prints first a tag
#print(link.parents) # returns generator object parents at mem location
#print(link.parent) # returns P tag structure
#print(link.parent.name) # returns a tag's parent name only

for parent in link.parents:
    print(parent.name) # p --> body --> html --> doc
    pass


# Navigating with Beautifulsoup - Going Sideway (moving through siblings) - use three_sisters.html

In [None]:
#This script demonstrates moving from current tag to next sibling tag
#Here we are moving side ways

#observer that first b and p tags are siblings

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

body = soup.body
p = soup.body.p
#print(p) #print first p tag with class title

# body - contents
print(body.contents)

''' .next_sibling
 our task now is to move from p tag "title" to next p tag "story".
observe the output of print(body.contents) their is a new line character "\n"
and then the p tag "story"
'''
#print(p.next_sibling) # prints nothing as it is new line character
#print(p.next_sibling.next_sibling) #prints p tag "story". Moving side ways



['\n', <b></b>, '\n', <p class="title">
<b>
                The Dormouse's story
            </b>
</p>, '\n', <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
            ,
            <a class="sister" href="http://example.com/lacie" id="link2">
                Lacie
            </a>
                and
            <a class="sister" href="http://example.com/tillie" id="link2">
                Tillie
            </a>
                ; and they lived at the bottom of a well.
        </p>, '\n', <p class="story">
<b>
                The End
            </b>
</p>, '\n']


' .next_sibling\n our task now is to move from p tag "title" to next p tag "story".\nobserve the output of print(body.contents) their is a new line character "\n"\nand then the p tag "story"\n'

In [None]:
'''
This script demonstrates moving from current tag to previous sibling tag
Here we are moving side ways
Here we are moving from body tag to head tag
'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

body = soup.body

# contents - html
#print(soup.html.contents) #prints complete html

#we shall move from body tag to head tag
# .previous_sibling
#print(body.previous_sibling) # prints nothing as it is new line character
#print(body.previous_sibling.previous_sibling) #prints head tag, sibling of body tag, moving up or previous sibling


In [None]:
'''
This script demonstrates moving from current tag to next tag and previous sibling tag.
Here we are moving side ways.
Here we are moving from 'p' tag to next 'p' tag, also to previous 'b' tag siblings.
'''

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')
p = (soup.body.p)
#print(p)   #prints first 'p' tag. <p class="title">


# .next_siblings (after b tag it has two siblings i.e, p, p tags)
#Use inline if to escape the '\n': (value if contiditon else '')

for sibling in p.next_siblings:
  #print(sibling.name if sibling != '\n' else '') # note: here we are omitting new line character see tree
  pass


# .previous_siblings (before first 'p' tag there is only one 'b' tag) # note: here we are omitting new line character see tree

for sibling in p.previous_siblings:
  print(sibling if sibling  != '\n' else '') 
  pass
