First, we download the page

In [1]:
import requests

In [2]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

In [3]:
page

<Response [200]>

We check the page whether it is successfully downloaded or not. 
Output starting with a "2" means successful. "4" or "5" means unsuccessful.

In [6]:
page.status_code

200

Now that the page has been successfully downloaded, we check the content of the page.

In [5]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In order to scrape the webpage, we use the library called "Beautiful Soup".

In [8]:
from bs4 import BeautifulSoup

In [9]:
soup = BeautifulSoup(page.content, 'html.parser')

We use the prettify method to have a nicely printed result.

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


Because tags are nested inside tags, let's select all elements at the top of the page with the "children" property of the soup.

In [11]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

Let's check the type of items in the list.

In [12]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

The first element is a 'doctype' object --> contains information about the type of the document.

The second element is a NavigableString --> represents text found in the HTML document. 

The final element is a Tag object --> contains other nested tags. 

The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

You can check the type of objects in this link --> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects

We can now select the html tag and its children by taking the third item in the list: (remember that the list contains 'html', \n' and <html>

In [13]:
html = list(soup.children)[2]

We can also call the children method on html tag.

In [14]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

Note tha the third element in the list before the title, the html no longer exists in the output.

There are two tags here: head and body. We want to extract the elements inside the body. 

In [16]:
body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

In [18]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag: (Remember the the first element starts with a [0])

In [21]:
p = list(body.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [22]:
p.get_text()

'Here is some simple content for this page.'

## FINDING ALL INSTANCES OF A TAG AT ONCE

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [23]:
soup = BeautifulSoup(page.content, 'html.parser')

In [24]:
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [25]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [26]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching for tags by class and id

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.

Let's scrape a new webpage.

In [28]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the find_all method to search for items by __class__ or by __id__. In the below example, we’ll search for any p tag that has the class outer-text:

In [29]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class outer-text:

In [30]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:

In [32]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

# Using CSS Selectors

You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

p a — finds all a tags inside of a p tag.

body p a — finds all a tags inside of a p tag inside of a body tag.

html body — finds all body tags inside of an html tag.

p.outer-text — finds all p tags with a class of outer-text.

p#first — finds all p tags with an id of first.

body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

You can learn more about CSS selectors here:https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors

BeautifulSoup objects support searching a page via CSS selectors using the __select__ method. We can use CSS selectors to _find all the p tags in our page that are inside of a div_ like this:

In [36]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

Note that the _select_ method above returns a list of BeautifulSoup objects, just like _find_ and _find_all._

# Exploring Page Structures with Chrome DevTools

Go to the webpage and click "more tools"-"developer tools"

Mouse over a specific word and click inspect to open up the tag that contains that text in the elements panel.

After getting to know the page, we start parsing the webpage.

We now know enough to download the page and start parsing it. In the below code, we:

Download the web page containing the forecast.<br>
Create a BeautifulSoup class to parse the page. <br>
Find the div with id seven-day-forecast, and assign to seven_day. <br>
Inside seven_day, find each individual forecast item. <br>
Extract and print the first forecast item.

In [39]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

In [40]:
soup = BeautifulSoup(page.content, 'html.parser')

In [41]:
seven_day = soup.find(id="seven-day-forecast")

In [42]:
forecast_items = seven_day.find_all(class_="tombstone-container")

In [48]:
today = forecast_items[0]

In [49]:
print(today.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Mostly cloudy, with a high near 58. Calm wind becoming east southeast around 5 mph in the afternoon. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Mostly cloudy, with a high near 58. Calm wind becoming east southeast around 5 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
 </p>
 <p class="temp temp-high">
  High: 58 °F
 </p>
</div>


# Extracting information from the page

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

The name of the forecast item — in this case, Today. <br>
The description of the conditions — this is stored in the title property of img. <br>
A short description of the conditions — in this case, Mostly Cloudy. <br>
The temperature low — in this case, 58 degrees. <br>
We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

In [50]:
period = today.find(class_="period-name").get_text()
short_desc = today.find(class_="short-desc").get_text()
temp = today.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Today
Mostly Cloudy
High: 58 °F


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [51]:
img = tonight.find("img")
desc = img['title']
print(desc)

Today: Mostly cloudy, with a high near 58. Calm wind becoming east southeast around 5 mph in the afternoon. 


# Extracting all the information from the page

In the below code, we:

Select all items with the class _period-name_ inside an item with the class _tombstone-container_ in _seven_day_.
Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [52]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday']

As you can see above, our technique gets us each of the period names, in order. We can apply the same technique to get the other 3 fields:

In [56]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

In [58]:
print(short_descs)

['Mostly Cloudy', 'DecreasingClouds', 'Slight ChanceShowers', 'Partly Cloudy', 'Slight ChanceShowers', 'ChanceShowers', 'Slight ChanceShowers', 'Mostly Clear', 'Sunny']


In [59]:
print(temps)

['High: 58 °F', 'Low: 49 °F', 'High: 65 °F', 'Low: 51 °F', 'High: 68 °F', 'Low: 52 °F', 'High: 66 °F', 'Low: 51 °F', 'High: 66 °F']


In [60]:
print(descs)

['Today: Mostly cloudy, with a high near 58. Calm wind becoming east southeast around 5 mph in the afternoon. ', 'Tonight: Cloudy, then gradually becoming partly cloudy, with a low around 49. West southwest wind around 6 mph becoming calm  in the evening. ', 'Monday: A 20 percent chance of showers.  Mostly sunny, with a high near 65. Light and variable wind becoming north around 6 mph in the morning. ', 'Monday Night: Partly cloudy, with a low around 51. North wind around 7 mph. ', 'Tuesday: A 20 percent chance of showers after 11am.  Mostly sunny, with a high near 68. North wind 5 to 7 mph becoming west northwest in the afternoon. ', 'Tuesday Night: A 30 percent chance of showers, mainly before 11pm.  Partly cloudy, with a low around 52. New precipitation amounts of less than a tenth of an inch possible. ', 'Wednesday: A 20 percent chance of showers.  Mostly sunny, with a high near 66.', 'Wednesday Night: Mostly clear, with a low around 51.', 'Thursday: Sunny, with a high near 66.']


# Combining our data into a Pandas Dataframe

A DataFrame is an object that can store tabular data.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

In [62]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})

In [65]:
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Mostly Cloudy,High: 58 °F,"Today: Mostly cloudy, with a high near 58. Cal..."
1,Tonight,DecreasingClouds,Low: 49 °F,"Tonight: Cloudy, then gradually becoming partl..."
2,Monday,Slight ChanceShowers,High: 65 °F,Monday: A 20 percent chance of showers. Mostl...
3,MondayNight,Partly Cloudy,Low: 51 °F,"Monday Night: Partly cloudy, with a low around..."
4,Tuesday,Slight ChanceShowers,High: 68 °F,Tuesday: A 20 percent chance of showers after ...
5,TuesdayNight,ChanceShowers,Low: 52 °F,"Tuesday Night: A 30 percent chance of showers,..."
6,Wednesday,Slight ChanceShowers,High: 66 °F,Wednesday: A 20 percent chance of showers. Mo...
7,WednesdayNight,Mostly Clear,Low: 51 °F,"Wednesday Night: Mostly clear, with a low arou..."
8,Thursday,Sunny,High: 66 °F,"Thursday: Sunny, with a high near 66."


We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

In [66]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    58
1    49
2    65
3    51
4    68
5    52
6    66
7    51
8    66
Name: temp_num, dtype: object

We could then find the mean of all the high and low temperatures:

In [67]:
weather["temp_num"].mean()

58.44444444444444

We could also only select the rows that happen at night:

In [68]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [69]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,DecreasingClouds,Low: 49 °F,"Tonight: Cloudy, then gradually becoming partl...",49,True
3,MondayNight,Partly Cloudy,Low: 51 °F,"Monday Night: Partly cloudy, with a low around...",51,True
5,TuesdayNight,ChanceShowers,Low: 52 °F,"Tuesday Night: A 30 percent chance of showers,...",52,True
7,WednesdayNight,Mostly Clear,Low: 51 °F,"Wednesday Night: Mostly clear, with a low arou...",51,True


Reference: https://www.dataquest.io/blog/web-scraping-tutorial-python/