# <center>HTML Basics</center>

___

Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets(CSS) and scripting languages such as JavaScript(JS).

- HTML describes the structure of a Web page that tell the browser how to display the content.
- HTML consists of a series of elements, represented by tags.
- Browsers do not display the HTML tags, but use them to render the content of the page.


### Simple HTML Document

- The <!DOCTYPE html> declaration defines this document to be HTML5
- The < html> element is the root element of an HTML page
- The < head> element contains meta information about the document
- The < title> element specifies a title for the document
- The < body> element contains the visible page content
- The < h1> element defines a large heading
- The < p> element defines a paragraph

<img src='img/pagestructure.jpg'  width="700" height="700" >

### HTML Tags

HTML tags are element names surrounded by angle brackets. They are normally come in pairs with a start tag `<p>` and a end tag `</p>`. Otherwise there are self closing tag like line break denoted as `<br />`

Some commonly used tags :

- __Anchor__ tag: It is used to link one page to another page.
- __List__ tag: It is used to list the content.
- __Ordered List__ tag: It is used to list the content in a particular order.
- __Unordered List__ tag: It is used to list the content without order.
- __Image__ tag: It is used to add image element in html document.
- __Tables__ Tags: Table tag is used to create a table in html document.
- __Form__ tag: It is used to create html form for user.


Lets implement these below with a very basic example. 

<html>
    <head>
        <title>First HTML</title>
    </head>    
    <body bgcolor = 'yellow'>  
        <p><i>Write this code in a text file, save it as html format and open in browser.</i></p>
        <ol>  
            <li>List item 1</li>  
            <li>List item 2</li> 
        </ol>
        <ul type = 'circle'>  
            <li>List item</li>  
            <li>List item</li> 
        </ul>
        <form> 
            <p>Submit you Request</p>
            <input type='text' maxlength='30'>  
            <input type='Submit' value='Submit'>  
        </form>
        <img src='img/sale.jpg' width='200' height='200' >
        <a href = 'https://www.amazon.com/'>BUY@AMAZON</a>
        <br/>
        <table bordercolor = 'black' bgcolor = 'lightpink'> 
            <tr> 
                <th>Month</th> 
                <th>Expenses</th> 
            </tr> 
            <tr> 
                <td>January</td> 
                <td>100</td> 
            </tr> 
        </table>
    </body>    
</html>    

# <center>Web Scraping with BeautifulSoup</center>

___

This activity will use the below modules:

- Requests : To make web requests

- Beautiful Soup : To extract data from the HTML response

BeautifulSoup can extract single or multiple occurrences of a specific tag and can also accept search criteria based on attributes such as:

- find()

- findall()

- select()

#### Points To Remember : 

- The logic to extract the data usually depends upon the HTML structure of the webpage, so some changes in structure can break the logic.

- The content of a website can be subject to applied laws, so make sure to read the terms and conditions about content


In [28]:
html = "<html><head><title>FirstHTML</title></head><body bgcolor='yellow'>\
        <p><i>Write this code in a text file, save it as html format and open in browser.</i></p>\
        <ol><li>OrderedListItem</li></ol><ul type='circle'><li>UnorderedListItem</li></ul>\
        <form><p>Submit you Request</p><input type='text' maxlength='30'><input type='Submit' value='Submit'></form>\
        <img src='sale.jpg' width='200' height='200'>\
        <a href='https://www.amazon.com/'>BUY@AMAZON</a><br/>\
        <table bordercolor='black' bgcolor='lightpink'><tr><th>Month</th><th>Expenses</th></tr>\
        <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table></body></html>"


In [31]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   FirstHTML
  </title>
 </head>
 <body bgcolor="yellow">
  <p>
   <i>
    Write this code in a text file, save it as html format and open in browser.
   </i>
  </p>
  <ol>
   <li>
    OrderedListItem
   </li>
  </ol>
  <ul type="circle">
   <li>
    UnorderedListItem
   </li>
  </ul>
  <form>
   <p>
    Submit you Request
   </p>
   <input maxlength="30" type="text"/>
   <input type="Submit" value="Submit"/>
  </form>
  <img height="200" src="sale.jpg" width="200"/>
  <a href="https://www.amazon.com/">
   BUY@AMAZON
  </a>
  <br/>
  <table bgcolor="lightpink" bordercolor="black">
   <tr>
    <th>
     Month
    </th>
    <th>
     Expenses
    </th>
   </tr>
   <tr>
    <td>
     Jan
    </td>
    <td>
     100
    </td>
    <td>
     Feb
    </td>
    <td>
     105
    </td>
   </tr>
  </table>
 </body>
</html>


In [3]:
soup.a

<a href="https://www.amazon.com/">BUY@AMAZON</a>

In [4]:
soup.table

<table bgcolor="lightpink" bordercolor="black"><tr><th>Month</th><th>Expenses</th></tr> <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table>

In [5]:
soup.table.attrs

{'bordercolor': 'black', 'bgcolor': 'lightpink'}

In [6]:
soup.table['bgcolor']

'lightpink'

In [7]:
soup.li

<li>OrderedListItem</li>

- __find()__ : This function takes the name of the tag as string input and returns the first found match of the particular tag from the webpage response<br>

In [8]:
soup.find('li')

<li>OrderedListItem</li>

- __find_all__ : Use find_all to extract all the occurrences of a particular tag from the page response.

find_all returns an object of result set which offers index based access to the result of found occurrences and can be printed using a for loop.

find_all can accept a list of tags as `soup.find_all(['th', 'td'])` and parameters like id to find tags with unique id.

In [9]:
soup.find_all('li')

[<li>OrderedListItem</li>, <li>UnorderedListItem</li>]

In [10]:
soup.ul.li

<li>UnorderedListItem</li>

In [11]:
soup.find_all('li')[1]

<li>UnorderedListItem</li>

- __select()__: This function allows to search for CSS selectors

In [12]:
soup.select('html head title')

[<title>FirstHTML</title>]

In [13]:
soup.select('p')

[<p><i>Write this code in a text file, save it as html format and open in browser.</i></p>,
 <p>Submit you Request</p>]

In [14]:
soup.select('form > p')

[<p>Submit you Request</p>]

### Extract text without tags

In [33]:
soup.ul.li.string

'UnorderedListItem'

In [16]:
list(soup.strings)

['FirstHTML',
 ' ',
 'Write this code in a text file, save it as html format and open in browser.',
 ' ',
 'OrderedListItem',
 'UnorderedListItem',
 ' ',
 'Submit you Request',
 ' ',
 ' ',
 'BUY@AMAZON',
 ' ',
 'Month',
 'Expenses',
 ' ',
 'Jan',
 '100',
 'Feb',
 '105']

In [34]:
soup.ul.li.string.replace_with('UL 1')

'UnorderedListItem'

In [35]:
soup.ul.li.text

'UL 1'

Beautiful Soup also provides navigation properties like

- __children__ and __descendants__: To access tags at next level.

- __next_sibling__ and __previous_sibling__: To traverse tags at same level.

- __next_element__ and __previous_element__: To shift HTML elements.

In [19]:
soup.body.contents

[' ',
 <p><i>Write this code in a text file, save it as html format and open in browser.</i></p>,
 ' ',
 <ol><li>OrderedListItem</li></ol>,
 <ul type="circle"><li>UL 1</li></ul>,
 ' ',
 <form><p>Submit you Request</p><input maxlength="30" type="text"/><input type="Submit" value="Submit"/></form>,
 ' ',
 <img height="200" src="sale.jpg" width="200"/>,
 ' ',
 <a href="https://www.amazon.com/">BUY@AMAZON</a>,
 <br/>,
 ' ',
 <table bgcolor="lightpink" bordercolor="black"><tr><th>Month</th><th>Expenses</th></tr> <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table>]

In [20]:
list(soup.body.children)

[' ',
 <p><i>Write this code in a text file, save it as html format and open in browser.</i></p>,
 ' ',
 <ol><li>OrderedListItem</li></ol>,
 <ul type="circle"><li>UL 1</li></ul>,
 ' ',
 <form><p>Submit you Request</p><input maxlength="30" type="text"/><input type="Submit" value="Submit"/></form>,
 ' ',
 <img height="200" src="sale.jpg" width="200"/>,
 ' ',
 <a href="https://www.amazon.com/">BUY@AMAZON</a>,
 <br/>,
 ' ',
 <table bgcolor="lightpink" bordercolor="black"><tr><th>Month</th><th>Expenses</th></tr> <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table>]

In [21]:
list(soup.body.descendants)

[' ',
 <p><i>Write this code in a text file, save it as html format and open in browser.</i></p>,
 <i>Write this code in a text file, save it as html format and open in browser.</i>,
 'Write this code in a text file, save it as html format and open in browser.',
 ' ',
 <ol><li>OrderedListItem</li></ol>,
 <li>OrderedListItem</li>,
 'OrderedListItem',
 <ul type="circle"><li>UL 1</li></ul>,
 <li>UL 1</li>,
 'UL 1',
 ' ',
 <form><p>Submit you Request</p><input maxlength="30" type="text"/><input type="Submit" value="Submit"/></form>,
 <p>Submit you Request</p>,
 'Submit you Request',
 <input maxlength="30" type="text"/>,
 <input type="Submit" value="Submit"/>,
 ' ',
 <img height="200" src="sale.jpg" width="200"/>,
 ' ',
 <a href="https://www.amazon.com/">BUY@AMAZON</a>,
 'BUY@AMAZON',
 <br/>,
 ' ',
 <table bgcolor="lightpink" bordercolor="black"><tr><th>Month</th><th>Expenses</th></tr> <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table>,
 <tr><th>Month</th><th>Expenses</th></t

In [22]:
print(soup.td)
print(soup.td.next_sibling)

<td>Jan</td>
<td>100</td>


In [23]:
print(list(soup.td.next_siblings))

[<td>100</td>, <td>Feb</td>, <td>105</td>]


In [24]:
print(soup.tr.previous_sibling)
print(soup.tr.previous_element)

None
<table bgcolor="lightpink" bordercolor="black"><tr><th>Month</th><th>Expenses</th></tr> <tr><td>Jan</td><td>100</td><td>Feb</td><td>105</td></tr></table>


### Parse based on condition

In [25]:
from bs4 import SoupStrainer

s = BeautifulSoup(html, 'html.parser', parse_only = SoupStrainer('body'))
print(s.prettify())

<body bgcolor="yellow">
 <p>
  <i>
   Write this code in a text file, save it as html format and open in browser.
  </i>
 </p>
 <ol>
  <li>
   OrderedListItem
  </li>
 </ol>
 <ul type="circle">
  <li>
   UnorderedListItem
  </li>
 </ul>
 <form>
  <p>
   Submit you Request
  </p>
  <input maxlength="30" type="text"/>
  <input type="Submit" value="Submit"/>
 </form>
 <img height="200" src="sale.jpg" width="200"/>
 <a href="https://www.amazon.com/">
  BUY@AMAZON
 </a>
 <br/>
 <table bgcolor="lightpink" bordercolor="black">
  <tr>
   <th>
    Month
   </th>
   <th>
    Expenses
   </th>
  </tr>
  <tr>
   <td>
    Jan
   </td>
   <td>
    100
   </td>
   <td>
    Feb
   </td>
   <td>
    105
   </td>
  </tr>
 </table>
</body>


### Remove some tag from the soup

In [26]:
s.table.decompose()
print(s.prettify())

<body bgcolor="yellow">
 <p>
  <i>
   Write this code in a text file, save it as html format and open in browser.
  </i>
 </p>
 <ol>
  <li>
   OrderedListItem
  </li>
 </ol>
 <ul type="circle">
  <li>
   UnorderedListItem
  </li>
 </ul>
 <form>
  <p>
   Submit you Request
  </p>
  <input maxlength="30" type="text"/>
  <input type="Submit" value="Submit"/>
 </form>
 <img height="200" src="sale.jpg" width="200"/>
 <a href="https://www.amazon.com/">
  BUY@AMAZON
 </a>
 <br/>
</body>


### Remove first occurance of a tag in the soup

In [27]:
s.li.extract()
print(s.prettify())

<body bgcolor="yellow">
 <p>
  <i>
   Write this code in a text file, save it as html format and open in browser.
  </i>
 </p>
 <ol>
 </ol>
 <ul type="circle">
  <li>
   UnorderedListItem
  </li>
 </ul>
 <form>
  <p>
   Submit you Request
  </p>
  <input maxlength="30" type="text"/>
  <input type="Submit" value="Submit"/>
 </form>
 <img height="200" src="sale.jpg" width="200"/>
 <a href="https://www.amazon.com/">
  BUY@AMAZON
 </a>
 <br/>
</body>
