# Webscraping project with BeautifulSoup

**1. Searching and extracting from the HTML**

In order to get the HTML web pages into Python, import the necessary libraries

In [52]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://webscraper.io/test-sites/e-commerce/allinone/phones/touch'

In [53]:
page = requests.get(url) #Get the HTML page in Python
page

<Response [200]>

Reponse 200 : Indicates success in getting the url request

In [54]:
soup = BeautifulSoup(page.text, 'lxml') #lxml parser reformats HTML in python
soup

<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
<title>Allinone | Web Scraper Test Sites</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords"/>
<meta content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed." name="description"/>
<link href="/favicon.png" rel

In [55]:
#find function is used to filter or find the tags,attributes and strings in the HTML (in this case, everything between header tag is displayed)
soup.find('header')
#or
#soup.header

<header class="navbar fixed-top navbar-expand-lg navbar-dark navbar-static svg-background" id="navbar-top" role="banner">
<div class="container">
<div class="navbar-header">
<a data-bs-target=".side-collapse" data-bs-target-2=".side-collapse-container" data-bs-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggler float-end collapsed" data-bs-target="#navbar" data-bs-target-2=".side-collapse-container" data-bs-target-3=".side-collapse" data-bs-toggle="collapse" type="button">
<span class="visually-hidden">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
<span class="icon-bar extra-bottom-bar"></span>
</button>
</a>
<div class="navbar-brand">
<a href="/"><img alt="Web Scraper" src="/img/logo_white.svg"/></a>
</div>
</div>
<div class="side-collapse in">
<nav class="navbar-collapse collapse" id="navbar" role="navigation">
<ul class="nav navbar-nav 

Getting all the attributes of the header tag

In [56]:
soup.header.attrs

{'role': 'banner',
 'class': ['navbar',
  'fixed-top',
  'navbar-expand-lg',
  'navbar-dark',
  'navbar-static',
  'svg-background'],
 'id': 'navbar-top'}

Goto the webpage(https://webscraper.io/test-sites/e-commerce/allinone/phones/touch) in Google Chrome, right click the webpage, 
click on inspect, take the cursor over to the webpage and select the area of the test site.
Get the name of the "class" from the elements by double-clicking on the specific element (in this case: *class="col-lg-9"*)

****"Find" function of BeautifulSoup will filter the required tags, attributes in the html content selected.****

**2. Running find command in BeautifulSoup to get the first occurance of tags, attributes,strings and comments.**

In [57]:
soup.find('div', {'class':'col-lg-9'}) 

<div class="col-lg-9">
<h1 class="page-header">Phones / Touch</h1>
<div class="row">
<div class="col-md-4 col-xl-4 col-lg-4">
<div class="card thumbnail">
<div class="product-wrapper card-body">
<img alt="item" class="img-fluid card-img-top image img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="price float-end card-title pull-right">$24.99</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/1" title="Nokia 123">Nokia 123</a>
</h4>
<p class="description card-text">7 day battery</p>
</div>
<div class="ratings">
<p class="review-count float-end">11 reviews</p>
<p data-rating="3">
<span class="ws-icon ws-icon-star"></span>
<span class="ws-icon ws-icon-star"></span>
<span class="ws-icon ws-icon-star"></span>
</p>
</div>
</div>
</div>
</div>
<div class="col-md-4 col-xl-4 col-lg-4">
<div class="card thumbnail">
<div class="product-wrapper card-body">
<img alt="item" class="img-fluid card-img-top image img-responsive" s

In [58]:
#get the price for the first phone listed, find only returns the first occurence of the tag
soup.find('h4', {'class':'price'})

<h4 class="price float-end card-title pull-right">$24.99</h4>

In [59]:
#Another way to write find function with class_
soup.find('h4', class_='price')

<h4 class="price float-end card-title pull-right">$24.99</h4>

**2. Different ways to run find_all (in BeautifulSoup Library) :**

In [60]:
soup.find_all('h4', {'class':'price'})

[<h4 class="price float-end card-title pull-right">$24.99</h4>,
 <h4 class="price float-end card-title pull-right">$57.99</h4>,
 <h4 class="price float-end card-title pull-right">$93.99</h4>,
 <h4 class="price float-end card-title pull-right">$109.99</h4>,
 <h4 class="price float-end card-title pull-right">$118.99</h4>,
 <h4 class="price float-end card-title pull-right">$499.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>]

In [61]:
soup.find_all('h4', {'class':'price'})[:4] #To get first 4 rows only

[<h4 class="price float-end card-title pull-right">$24.99</h4>,
 <h4 class="price float-end card-title pull-right">$57.99</h4>,
 <h4 class="price float-end card-title pull-right">$93.99</h4>,
 <h4 class="price float-end card-title pull-right">$109.99</h4>]

In [62]:
soup.find_all('a', class_ = 'title')

[<a class="title" href="/test-sites/e-commerce/allinone/product/1" title="Nokia 123">Nokia 123</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/2" title="LG Optimus">LG Optimus</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/3" title="Samsung Galaxy">Samsung Galaxy</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/4" title="Nokia X">Nokia X</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/5" title="Sony Xperia">Sony Xperia</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/6" title="Ubuntu Edge">Ubuntu Edge</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/7" title="Iphone">Iphone</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/8" title="Iphone">Iphone</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/9" title="Iphone">Iphone</a>]

In [63]:
soup.find_all('p', class_ = 'description')

[<p class="description card-text">7 day battery</p>,
 <p class="description card-text">3.2" screen</p>,
 <p class="description card-text">5 mpx. Android 5.0</p>,
 <p class="description card-text">Andoid, Jolla dualboot</p>,
 <p class="description card-text">GPS, waterproof</p>,
 <p class="description card-text">Sapphire glass</p>,
 <p class="description card-text">White</p>,
 <p class="description card-text">Silver</p>,
 <p class="description card-text">Black</p>]

**4. Getting all the data from find_all into lists**

In [64]:
product_name = soup.find_all('a', class_ = 'title')
product_name

[<a class="title" href="/test-sites/e-commerce/allinone/product/1" title="Nokia 123">Nokia 123</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/2" title="LG Optimus">LG Optimus</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/3" title="Samsung Galaxy">Samsung Galaxy</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/4" title="Nokia X">Nokia X</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/5" title="Sony Xperia">Sony Xperia</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/6" title="Ubuntu Edge">Ubuntu Edge</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/7" title="Iphone">Iphone</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/8" title="Iphone">Iphone</a>,
 <a class="title" href="/test-sites/e-commerce/allinone/product/9" title="Iphone">Iphone</a>]

In [65]:
description = soup.find_all('p', class_ = 'description')
description

[<p class="description card-text">7 day battery</p>,
 <p class="description card-text">3.2" screen</p>,
 <p class="description card-text">5 mpx. Android 5.0</p>,
 <p class="description card-text">Andoid, Jolla dualboot</p>,
 <p class="description card-text">GPS, waterproof</p>,
 <p class="description card-text">Sapphire glass</p>,
 <p class="description card-text">White</p>,
 <p class="description card-text">Silver</p>,
 <p class="description card-text">Black</p>]

In [66]:
price = soup.find_all('h4', class_ = 'price')
price

[<h4 class="price float-end card-title pull-right">$24.99</h4>,
 <h4 class="price float-end card-title pull-right">$57.99</h4>,
 <h4 class="price float-end card-title pull-right">$93.99</h4>,
 <h4 class="price float-end card-title pull-right">$109.99</h4>,
 <h4 class="price float-end card-title pull-right">$118.99</h4>,
 <h4 class="price float-end card-title pull-right">$499.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>,
 <h4 class="price float-end card-title pull-right">$899.99</h4>]

In [67]:
reviews = soup.find_all('p', class_ = 'review-count float-end')
reviews

[<p class="review-count float-end">11 reviews</p>,
 <p class="review-count float-end">11 reviews</p>,
 <p class="review-count float-end">3 reviews</p>,
 <p class="review-count float-end">4 reviews</p>,
 <p class="review-count float-end">6 reviews</p>,
 <p class="review-count float-end">2 reviews</p>,
 <p class="review-count float-end">10 reviews</p>,
 <p class="review-count float-end">8 reviews</p>,
 <p class="review-count float-end">1 reviews</p>]

In [68]:
product_name_list = []
for i in product_name:
    name = i.text
    product_name_list.append(name)
print(product_name_list)

['Nokia 123', 'LG Optimus', 'Samsung Galaxy', 'Nokia X', 'Sony Xperia', 'Ubuntu Edge', 'Iphone', 'Iphone', 'Iphone']


In [69]:
price_list = []
for i in price:
    price_list.append(i.text)
print(price_list)

['$24.99', '$57.99', '$93.99', '$109.99', '$118.99', '$499.99', '$899.99', '$899.99', '$899.99']


In [70]:
descriptions_list = []
for i in description:
    descriptions_list.append(i.text)
descriptions_list

['7 day battery',
 '3.2" screen',
 '5 mpx. Android 5.0',
 'Andoid, Jolla dualboot',
 'GPS, waterproof',
 'Sapphire glass',
 'White',
 'Silver',
 'Black']

In [71]:
reviews_list = []
for i in reviews:
    reviews_list.append(i.text)
reviews_list

['11 reviews',
 '11 reviews',
 '3 reviews',
 '4 reviews',
 '6 reviews',
 '2 reviews',
 '10 reviews',
 '8 reviews',
 '1 reviews']

In [72]:
table = pd.DataFrame({'Product Name':product_name_list, 'Price':price_list,'Description':descriptions_list,'Reviews':reviews_list})
print(table)

     Product Name    Price             Description     Reviews
0       Nokia 123   $24.99           7 day battery  11 reviews
1      LG Optimus   $57.99             3.2" screen  11 reviews
2  Samsung Galaxy   $93.99      5 mpx. Android 5.0   3 reviews
3         Nokia X  $109.99  Andoid, Jolla dualboot   4 reviews
4     Sony Xperia  $118.99         GPS, waterproof   6 reviews
5     Ubuntu Edge  $499.99          Sapphire glass   2 reviews
6          Iphone  $899.99                   White  10 reviews
7          Iphone  $899.99                  Silver   8 reviews
8          Iphone  $899.99                   Black   1 reviews
