### Web Scraping with Python

#### **1. BeautifulSoup**
Its used for webscraping purposes to pull the data out of HTML and XML files. It create a parse tree from page source code taht can be used to extract datain hierarchical and more readable manner.

In [2]:
from bs4 import BeautifulSoup
import requests
URL = "http://www.example.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
soup

<!DOCTYPE html>
<html lang="en"><head><title>Example Domain</title><meta content="width=device-width, initial-scale=1" name="viewport"/><style>body{background:#eee;width:60vw;margin:15vh auto;font-family:system-ui,sans-serif}h1{font-size:1.5em}div{opacity:0.8}a:link,a:visited{color:#348}</style><body><div><h1>Example Domain</h1><p>This domain is for use in documentation examples without needing permission. Avoid use in operations.<p><a href="https://iana.org/domains/example">Learn more</a></p></p></div></body></head></html>

#### **2. Scrapy**
Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website

In [4]:
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/',]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {'quote': quote.css('span.text::text').get()}

In [6]:
QuotesSpider('hello')

<QuotesSpider 'hello' at 0x2e24d4fe710>

#### **3. Selenium**
Selenium is a tool used for controlling web browsers through programs and automating browser tasks.

### Applications of web Scraping
1. Price Comparison
2. Email address gathering
3. Social media Scraping

### Web Scraping Lab

In [11]:
pip install --upgrade beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.14.2-py3-none-any.whl.metadata (3.8 kB)
Downloading beautifulsoup4-4.14.2-py3-none-any.whl (106 kB)
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.12.3
    Uninstalling beautifulsoup4-4.12.3:
      Successfully uninstalled beautifulsoup4-4.12.3
Successfully installed beautifulsoup4-4.14.2
Note: you may need to restart the kernel to use updated packages.


In [12]:
pip install --upgrade pandas


Collecting pandas
  Downloading pandas-2.3.3-cp313-cp313-win_amd64.whl.metadata (19 kB)
Downloading pandas-2.3.3-cp313-cp313-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   -------

ERROR: Could not install packages due to an OSError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\awais\\AppData\\Local\\Temp\\pip-unpack-cn4dm3l8\\pandas-2.3.3-cp313-cp313-win_amd64.whl'
Consider using the `--user` option or check the permissions.



In [13]:
import warnings
warnings.simplefilter('ignore')

In [14]:
from bs4 import BeautifulSoup
import requests

In [15]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [16]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

In [19]:
soup = BeautifulSoup(html, 'html.parser')

In [20]:
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   Page Title\n  </title>\n </head>\n <body>\n  <h3>\n   <b id="boldest">\n    Lebron James\n   </b>\n  </h3>\n  <p>\n   Salary: $ 92,000,000\n  </p>\n  <h3>\n   Stephen Curry\n  </h3>\n  <p>\n   Salary: $85,000, 000\n  </p>\n  <h3>\n   Kevin Durant\n  </h3>\n  <p>\n   Salary: $73,200, 000\n  </p>\n </body>\n</html>\n'

In [21]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



### Tags


In [22]:
tag_object = soup.title
print('tag object:', tag_object)

tag object: <title>Page Title</title>


In [23]:
print("tag object type: ", type(tag_object))

tag object type:  <class 'bs4.element.Tag'>


In [24]:
soup.h3

<h3><b id="boldest">Lebron James</b></h3>

In [25]:
tag_object = soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

### Children, Parents, and Siblings

In [27]:
tag_child = tag_object.b
tag_child #Lebron James

<b id="boldest">Lebron James</b>

In [28]:
parent_tag = tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

In [30]:
sibling_1 = tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

In [32]:
sibling_2 = sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>

In [33]:
sibling_2.next_sibling

<p> Salary: $85,000, 000 </p>

In [34]:
tag_child['id']

'boldest'

In [35]:
tag_child.attrs

{'id': 'boldest'}

In [36]:
tag_child.get('id')

'boldest'