### WEB SCRAPING
* Gathering datas from a website.
* Use for other projects.
* Use python for web scraping
* From the front-end of the website(HTML + CSS + JS).

### PYTHON WEB SCRAPING PROGRAM
* Gather images and objects.

### Main things we need to understand
* Rules of web scraping.
* Limitations of web scraping.
* Basic HTML and CSS.

### Rules of web scraping
* Always try to get permission, if you make too many attempts the IP address might get blocked.
* Some sites automatically block scraping software.
* Check legal laws if it's okay to scrape.
* Every web scraping script technique is unique, a slight change or update may completely break your web scraping script.

#### For effective basic web scraping , only need to look into HTML and CSS, wherein Python can look into HTML and CSS elements programmatically and extract information from the website.

#### To web scrape with Python we can use BeautifulSoup and request libraries.


### Setting up Web scraping libraries
Download at anaconda prompt :
* pip install requests
* pip install lxml
* pip install bs4

In Jupyter notebook :
Import requests and bs4

## GRABBING A PAGE TITLE

In [1]:
import requests

In [2]:
result = requests.get("http://www.example.com")

In [3]:
type(result)

requests.models.Response

In [4]:
result.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [5]:
import bs4

In [6]:
soup = bs4.BeautifulSoup(result.text, "lxml")

In [7]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

In [8]:
soup.select('title')

[<title>Example Domain</title>]

In [9]:
soup.select('p')

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

In [10]:
soup.select('h1')[0].getText()

'Example Domain'

In [11]:
site_paragraph = soup.select("p")

In [12]:
site_paragraph[0].getText()

'This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.'

## GRABBING A CLASS

* soup.select('div') = All elements with div tag.
* soup.select('#some_id') = Elements containing id = 'some_id'.
* soup.select('.some_class') = Elements containing class = 'some_class'.
* soup.select('div span') = Any elements named span within a div element.
* soup.select('div>span') = Any elements named span directly within a div element, with nothing in between.

In [13]:
res = requests.get('https://en.wikipedia.org/wiki/London')

In [14]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [15]:
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>London - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b29ae56b-659c-46a5-a444-255e65b47905","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"London","wgTitle":"London","wgCurRevisionId":1027699310,"wgRevisionId":1027699310,"wgArticleId":17867,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","Webarchive template wayback links","Wikipedia articles incorporating a citation from the 1911 Encyclopaedia Britannica with Wikisource reference"

In [16]:
first_item = soup.select('.toctext')[0]

In [17]:
first_item.text

'Toponymy'

In [18]:
for item in soup.select('.toctext'):
    print(item.text)

Toponymy
History
Prehistory
Roman London
Anglo-Saxon and Viking period London
Middle Ages
Early modern
Late modern and contemporary
Administration
Local government
National government
Policing and crime
Geography
Scope
Status
Topography
Climate
Districts
Architecture
Cityscape
Natural history
Demography
Age structure and median age (2018)
Ethnic groups
Religion
Accents
Economy
The City of London
Media and technology
Tourism
Transport
Aviation
Rail
Underground and DLR
Suburban
Inter-city and international
Freight
Buses, coaches and trams
Cable car
Cycling
Port and river boats
Roads
Education
Tertiary education
Primary and secondary education
Culture
Leisure and entertainment
Literature, film and television
Museums, art galleries and libraries
Music
Recreation
Parks and open spaces
Walking
Sport
Notable people
See also
Notes
References
Bibliography
External links


## GRABBING AN IMAGE

In [19]:
res = requests.get("https://en.wikipedia.org/wiki/London")

In [20]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [21]:
soup.select('.thumbimage')

[<img alt="" class="thumbimage" data-file-height="842" data-file-width="1080" decoding="async" height="172" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/39/Map_of_London%2C_1300.svg/220px-Map_of_London%2C_1300.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/39/Map_of_London%2C_1300.svg/330px-Map_of_London%2C_1300.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/39/Map_of_London%2C_1300.svg/440px-Map_of_London%2C_1300.svg.png 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="488" data-file-width="320" decoding="async" height="259" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Siege_of_London_%28MS_1168%29.jpg/170px-Siege_of_London_%28MS_1168%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Siege_of_London_%28MS_1168%29.jpg/255px-Siege_of_London_%28MS_1168%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/d/d3/Siege_of_London_%28MS_1168%29.jpg 2x" width="170"/>,
 <img alt="" class="thumbimage" d

In [22]:
london = soup.select('.thumbimage')[10]

In [23]:
london

<img alt="" class="thumbimage" data-file-height="3021" data-file-width="4388" decoding="async" height="151" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/220px-London_from_Primrose_Hill_May_2013.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/330px-London_from_Primrose_Hill_May_2013.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/440px-London_from_Primrose_Hill_May_2013.jpg 2x" width="220"/>

In [24]:
london['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/220px-London_from_Primrose_Hill_May_2013.jpg'

<img src = "//upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/220px-London_from_Primrose_Hill_May_2013.jpg">

In [25]:
image_link = requests.get("https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/London_from_Primrose_Hill_May_2013.jpg/220px-London_from_Primrose_Hill_May_2013.jpg")

In [35]:
#image_link.content

In [26]:
f = open('my_computer_image.jpg', 'wb')

In [27]:
f.write(image_link.content)

9789

In [28]:
f.close()

## A site specifically used to practice web scraping.
www.toscrape.com

#### We will practice grabbing elements across multiple pages.

#### GOAL : GET THE TITLE OF EVERY BOOK WITH A ** RATING

In [29]:
import requests
import bs4

In [30]:
'http://books.toscrape.com/catalogue/page-2.html'

'http://books.toscrape.com/catalogue/page-2.html'

In [31]:
'http://books.toscrape.com/catalogue/page-3.html'

'http://books.toscrape.com/catalogue/page-3.html'

In [32]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

In [33]:
base_url.format('20')

'http://books.toscrape.com/catalogue/page-20.html'

In [34]:
page_num=12
base_url.format(page_num)

'http://books.toscrape.com/catalogue/page-12.html'

In [35]:
res = requests.get(base_url.format(1))

In [36]:
soup = bs4.BeautifulSoup(res.text, "lxml")

In [37]:
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="

In [38]:
soup.select(".product_pod")

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

In [39]:
len(soup.select(".product_pod"))

20

In [42]:
products = soup.select(".product_pod")

In [43]:
example = products[0]

In [44]:
example

<article class="product_pod">
<div class="image_container">
<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

### Check for 2 star rating

In [47]:
str(example)

'<article class="product_pod">\n<div class="image_container">\n<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>\n</div>\n<p class="star-rating Three">\n<i class="icon-star"></i>\n<i class="icon-star"></i>\n<i class="icon-star"></i>\n<i class="icon-star"></i>\n<i class="icon-star"></i>\n</p>\n<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>\n<div class="product_price">\n<p class="price_color">Â£51.77</p>\n<p class="instock availability">\n<i class="icon-ok"></i>\n    \n        In stock\n    \n</p>\n<form>\n<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>\n</form>\n</div>\n</article>'

In [48]:
'star-rating Two' in str(example)

False

### * Official and legal and better way *

In [50]:
example.select(".star-rating.Three")  # if space use a dot in between

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

In [51]:
example.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [53]:
example.select('a')[1]["title"]

'A Light in the Attic'

### Follow these steps :
* We can check if something is 2 stars (string call in, example.select(rating))
* example.select('a')[1]['title'] to grab the book title

In [67]:
# Print all books
two_star_titles = []
for n in range(1, 51):
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    soup = bs4.BeautifulSoup(res.text, "lxml")
    
    for link in soup.find_all('a'):
        if link.has_attr('title'):
            print(link['title'])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
In Her Wake
How Music Works
Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More
Chase Me (Paris Nights #2)
Black Dust
Birdsong: A Story in Pictures
A

Modern Romance
Miss Peregrineâs Home for Peculiar Children (Miss Peregrineâs Peculiar Children #1)
Louisa: The Extraordinary Life of Mrs. Adams
Little Red
Library of Souls (Miss Peregrineâs Peculiar Children #3)
Large Print Heart of the Pride
I Had a Nice Time And Other Lies...: How to find love & sh*t like that
Hollow City (Miss Peregrineâs Peculiar Children #2)
Grumbles
Full Moon over Noahâs Ark: An Odyssey to Mount Ararat and Beyond
Frostbite (Vampire Academy #2)
Follow You Home
First Steps for New Christians (Print Edition)
Finders Keepers (Bill Hodges Trilogy #2)
Fables, Vol. 1: Legends in Exile (Fables #1)
Eureka Trivia 6.0
Drive: The Surprising Truth About What Motivates Us
Done Rubbed Out (Reightman & Bailey #1)
Doing It Over (Most Likely To #1)
Deliciously Ella Every Day: Quick and Easy Recipes for Gluten-Free Snacks, Packed Lunches, and Simple Meals
Dark Notes
Daring Greatly: How the Courage to Be Vulnerable Transforms the Way We Live, Love, Parent, and Lead
Close t

Mothering Sunday
Mother, Can You Not?
M Train
Lilac Girls
Lies and Other Acts of Love
Lab Girl
Keep Me Posted
It Didn't Start with You: How Inherited Family Trauma Shapes Who We Are and How to End the Cycle
Grey (Fifty Shades #4)
Exit, Pursued by a Bear
Daredevils
Cravings: Recipes for What You Want to Eat
Born for This: How to Find the Work You Were Meant to Do
Arena
Adultery
A Mother's Reckoning: Living in the Aftermath of Tragedy
A Gentleman's Position (Society of Gentlemen #3)
11/22/63
10% Happier: How I Tamed the Voice in My Head, Reduced Stress Without Losing My Edge, and Found Self-Help That Actually Works
10-Day Green Smoothie Cleanse: Lose Up to 15 Pounds in 10 Days!
Without Shame
Watchmen
Unlimited Intuition Now
Underlying Notes
The Shack
The New Brand You: Your New Image Makes the Sale for You
The Moosewood Cookbook: Recipes from Moosewood Restaurant, Ithaca, New York
The Flowers Lied
The Fabric of the Cosmos: Space, Time, and the Texture of Reality
The Book of Mormon
The Ar

Isla and the Happily Ever After (Anna and the French Kiss #3)
If I Stay (If I Stay #1)
I Know Why the Caged Bird Sings (Maya Angelou's Autobiography #1)
Harry Potter and the Deathly Hallows (Harry Potter #7)
Fruits Basket, Vol. 5 (Fruits Basket #5)
Foundation (Foundation (Publication Order) #1)
Fool Me Once
Find Her (Detective D.D. Warren #8)
Evicted: Poverty and Profit in the American City
Drama
Dracula the Un-Dead
Digital Fortress
Death Note, Vol. 5: Whiteout (Death Note #5)
Data, A Love Story: How I Gamed Online Dating to Meet My Match
Critique of Pure Reason
Booked
Blue Lily, Lily Blue (The Raven Cycle #3)
Approval Junkie: Adventures in Caring Too Much
An Abundance of Katherines
America's War for the Greater Middle East: A Military History
Alight (The Generations Trilogy #2)
A Girl's Guide to Moving On (New Beginnings #2)
A Game of Thrones (A Song of Ice and Fire #1)
A Feast for Crows (A Song of Ice and Fire #4)
A Clash of Kings (A Song of Ice and Fire #2)
Vogue Colors A to Z: A Fa

Girl in the Blue Coat
Fruits Basket, Vol. 3 (Fruits Basket #3)
Friday Night Lights: A Town, a Team, and a Dream
Fire Bound (Sea Haven/Sisters of the Heart #5)
Fifty Shades Freed (Fifty Shades #3)
Fellside
Extreme Prey (Lucas Davenport #26)
Eragon (The Inheritance Cycle #1)
Eclipse (Twilight #3)
Dune (Dune #1)
Dracula
Do Androids Dream of Electric Sheep? (Blade Runner #1)
Disrupted: My Misadventure in the Start-Up Bubble
Dead Wake: The Last Crossing of the Lusitania
David and Goliath: Underdogs, Misfits, and the Art of Battling Giants
Darkfever (Fever #1)
Dark Places
Crazy Rich Asians (Crazy Rich Asians #1)
Counting Thyme
Cosmos
Civilization and Its Discontents
Cinder (The Lunar Chronicles #1)
Catastrophic Happiness: Finding Joy in Childhood's Messy Years
Career of Evil (Cormoran Strike #3)
Breaking Dawn (Twilight #4)
Brave Enough
Boy Meets Boy
Born to Run: A Hidden Tribe, Superathletes, and the Greatest Race the World Has Never Seen
Blink: The Power of Thinking Without Thinking
Black F

In [83]:
two_star_titles = []
for n in range(1, 51):
    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    soup = bs4.BeautifulSoup(res.text, "lxml")
    x=[]
    for link in soup.find_all('article', class_='product_pod'):
        for j in link.find_all('p', class_='star-rating Two'):
            x.append(link)
            
    for j in x:      
        for i in j.find_all('a'):
            if i.has_attr('title'):
                print(i['title'])

Starving Hearts (Triangular Trade Trilogy, #1)
Libertarianism for Beginners
It's Only the Himalayas
How Music Works
Maude (1883-1993):She Grew Up with the country
You can't bury them all: Poems
Reasons to Stay Alive
Without Borders (Wanderlove #1)
Soul Reader
Security
Saga, Volume 5 (Saga (Collected Editions) #5)
Reskilling America: Learning to Labor in the Twenty-First Century
Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics
Obsidian (Lux #1)
My Paris Kitchen: Recipes and Stories
Masks and Shadows
Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)
Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)
Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)
I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)
Giant Days, Vol. 2 (Giant Days #5-8)
Everydata: The Misinformation Hidden in the Little Data You Consume Every 