## Basic Web Scrapping

### Loading the source code of a web page

In [2]:
import requests
help(requests)

Help on package requests:

NAME
    requests

DESCRIPTION
    requests HTTP library
    ~~~~~~~~~~~~~~~~~~~~~
    
    Requests is an HTTP library, written in Python, for human beings. Basic GET
    usage:
    
       >>> import requests
       >>> r = requests.get('http://python.org')
       >>> r.status_code
       200
       >>> 'Python is a programming language' in r.content
       True
    
    ... or POST:
    
       >>> payload = dict(key1='value1', key2='value2')
       >>> r = requests.post("http://httpbin.org/post", data=payload)
       >>> print(r.text)
       {
         ...
         "form": {
           "key2": "value2",
           "key1": "value1"
         },
         ...
       }
    
    The other HTTP methods are supported - see `requests.api`. Full documentation
    is at <http://python-requests.org>.
    
    :copyright: (c) 2014 by Kenneth Reitz.
    :license: Apache 2.0, see LICENSE for more details.

PACKAGE CONTENTS
    adapters
    api
    auth
    certs
    com

In [5]:
r = requests.get('https://pythonhow.com/example.html')

In [6]:
type(r)

requests.models.Response

In [8]:
c = r.content
c

b'<!DOCTYPE html>\n<html>\n<head>\n<style>\ndiv.cities {\n    background-color:black;\n    color:white;\n    margin:20px;\n    padding:20px;\n} \n</style>\n</head>\n<body>\n\n<h1 align="center"> Here are three big cities </h1>\n\n<div class="cities">\n<h2>London</h2>\n<p>London is the capital of England and it\'s been a British settlement since 2000 years ago. </p>\n</div>\n\n<div class="cities">\n<h2>Paris</h2>\n<p>Paris is the capital city of France. It was declared capital since 508.</p>\n</div>\n\n<div class="cities">\n<h2>Tokyo</h2>\n<p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>\n</div>\n\n</body>\n</html>'

In [9]:
type(c)

bytes

#### Use bs4 module to organize the content

In [12]:
from bs4 import BeautifulSoup

In [15]:
soup = BeautifulSoup(c,'html.parser')
print(soup.prettify()) ## Inspection of web page

<!DOCTYPE html>
<html>
 <head>
  <style>
   div.cities {
    background-color:black;
    color:white;
    margin:20px;
    padding:20px;
}
  </style>
 </head>
 <body>
  <h1 align="center">
   Here are three big cities
  </h1>
  <div class="cities">
   <h2>
    London
   </h2>
   <p>
    London is the capital of England and it's been a British settlement since 2000 years ago.
   </p>
  </div>
  <div class="cities">
   <h2>
    Paris
   </h2>
   <p>
    Paris is the capital city of France. It was declared capital since 508.
   </p>
  </div>
  <div class="cities">
   <h2>
    Tokyo
   </h2>
   <p>
    Tokyo is the capital of Japan and one of the most populated cities in the world.
   </p>
  </div>
 </body>
</html>


#### Extract all 'div' elements

In [22]:
all = soup.find_all("div",{"class":"cities"})
all

[<div class="cities">
 <h2>London</h2>
 <p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
 </div>, <div class="cities">
 <h2>Paris</h2>
 <p>Paris is the capital city of France. It was declared capital since 508.</p>
 </div>, <div class="cities">
 <h2>Tokyo</h2>
 <p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>
 </div>]

In [18]:
type(all)

bs4.element.ResultSet

In [21]:
len(all)

3

#### Extract first 'div' elements

In [24]:
first = soup.find("div",{"class":"cities"})
first

<div class="cities">
<h2>London</h2>
<p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
</div>

#### Get the h2 element

In [25]:
first.find_all("h2")

[<h2>London</h2>]

In [29]:
first.find_all("h2")[0].text

'London'

In [30]:
for item in all:
    print(item.find_all("h2")[0].text)

London
Paris
Tokyo


#### Get the paragraphs

In [31]:
for item in all:
    print(item.find_all("p")[0].text)

London is the capital of England and it's been a British settlement since 2000 years ago. 
Paris is the capital city of France. It was declared capital since 508.
Tokyo is the capital of Japan and one of the most populated cities in the world.
