# Web Scrapping

## Introduction to Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

In [18]:
!pip install beautifulsoup4



In [19]:
# import libraries

In [20]:
from bs4 import BeautifulSoup

Let's take a Example ,How to parse html text using Beautiful soup

In [21]:
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")

In [22]:
print(soup.prettify()) #Pretty-print this PageElement as a string.

<html>
 <body>
  <p>
   Some
   <b>
    bad
    <i>
     HTML
    </i>
   </b>
  </p>
 </body>
</html>


### Let's scrap html form

<!DOCTYPE html>
<html>
<body>

<h2>Text Input</h2>

<form>
  First name:<br>
  <input type="text" name="firstname">
  <br>
  Last name:<br>
  <input type="text" name="lastname">
</form>

<p>Note that the form itself is not visible.</p>

<p>Also note that the default width of a text input field is 20 characters.</p>

</body>
</html>


In [23]:
#Open the above index.html from data folder and read its content into text
with open("../data/index.html") as f:
    text = f.read()

In [24]:
print(text)

<!DOCTYPE html>
<html>
<body>

<h2>Text Input</h2>

<form>
  First name:<br>
  <input type="text" name="firstname">
  <br>
  Last name:<br>
  <input type="text" name="lastname">
</form>

<p>Note that the form itself is not visible.</p>

<p>Also note that the default width of a text input field is 20 characters.</p>

</body>
</html>



In [25]:
soup = BeautifulSoup(text) #To parse the html just pass into Beautifulsoup

In [26]:
soup.title #Since there is no title in the html form

In [27]:
soup.p   #return the first paragraph element 

<p>Note that the form itself is not visible.</p>

In [28]:
soup.h2  #return the first h2 element

<h2>Text Input</h2>

In [29]:
soup.form #return the first form element

<form>
  First name:<br/>
<input name="firstname" type="text"/>
<br/>
  Last name:<br/>
<input name="lastname" type="text"/>
</form>

In [30]:
soup.find_all("p") #returns list of all paragraph element inside html file

[<p>Note that the form itself is not visible.</p>,
 <p>Also note that the default width of a text input field is 20 characters.</p>]

In [31]:
soup.find_all("input") #return the all input elements inside html file 

[<input name="firstname" type="text"/>, <input name="lastname" type="text"/>]

In [32]:
soup.find_all("h2")   #return the all h2 elements inside html file

[<h2>Text Input</h2>]

In [33]:
print(text)  #html text

<!DOCTYPE html>
<html>
<body>

<h2>Text Input</h2>

<form>
  First name:<br>
  <input type="text" name="firstname">
  <br>
  Last name:<br>
  <input type="text" name="lastname">
</form>

<p>Note that the form itself is not visible.</p>

<p>Also note that the default width of a text input field is 20 characters.</p>

</body>
</html>



# How to copy a web page onto local machine 

In [34]:
import requests #import the libraries

In [35]:
#For Example,Let's take wikipidea of shahrukh khan 
# We need to copy the web page onto our local machine in file name srk.html 
url = r"https://en.wikipedia.org/wiki/Shah_Rukh_Khan" 

In [37]:
srk = open('../output/srk.html','w') #First open the srk.html into write binary mode

In [38]:
text = requests.get(url)  #Featch srk wikipedia resource using requests   

In [39]:
srk.write(text.text)    #write it's content into srk.html file 

705675