### Web Scraping

- **Requests**: The first thing we’ll need to do to scrape a web page is to download the page.
- We can **download pages** using the Python requests library.
- The requests library will make a **GET request** to a web server, which will download the **HTML contents** of a given web page for us.

In [2]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

- After running our request, we get a **Response object**.
- This object has a **status_code property**, which indicates if the page was downloaded successfully:

In [3]:
page.status_code

200

- A status_code of 200 means that the page downloaded successfully.
- A status code starting with a
    - 2 indicates **success**.
    - Code starting with a **4 or a 5** indicates an **error.**

In [4]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

- We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. 

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

- We can now print out the **HTML content of the page**, formatted nicely, using the **prettify method** on the **BeautifulSoup object:**

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [8]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

- The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag.
    
- There is a newline character (\n) in the list as well.
- Let’s see what the type of each element in the list is:

In [9]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

- The first is a **Doctype object**, which contains information about the type of the document.
- The second is a **NavigableString**, which represents text found in the HTML document.
- The final item is a **Tag object**, which contains other nested tags. 
   - The Tag object allows us to navigate through an HTML document, and extract other tags and text.

In [10]:
html = list(soup.children)[2]

#### Now, we can find the children inside the html tag:

In [11]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [12]:
body = list(html.children)[3]

In [13]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

#### We can now isolate the p tag:

In [14]:
p = list(body.children)[1]

#### Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [16]:
p.get_text()

'Here is some simple content for this page.'

 ##                            OR

#### Finding all instance of tag at once using the find_all method

In [17]:
soup = BeautifulSoup(page.content,'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

#### find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [18]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

### OR

#### If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [19]:
soup.find('p')

<p>Here is some simple content for this page.</p>

### Searching for tags by class and id

- Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. 
- We can also use them when scraping to specify specific elements we want to scrape. 


In [22]:
page1 = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page1.content, 'html.parser')
print(soup)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>


In [24]:
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [23]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### In the below example, we’ll look for any tag that has the class outer-text:

In [25]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### We can also search for elements by id:

In [26]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

#### We can also search items using css selectors

- BeautifulSoup objects support searching a page via CSS selectors using the select method. 
- We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

In [27]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [28]:
soup.select("html body")

[<body>
 <div>
 <p class="inner-text first-item" id="first">
                 First paragraph.
             </p>
 <p class="inner-text">
                 Second paragraph.
             </p>
 </div>
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>
 </body>]

#### Lets scrape Weather data in Web_scraping2