In [2]:
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [5]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [4]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(page.content, 'html.parser')

In [12]:
# soup.prettify()
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

There are two tags at the top level of the page -- the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list 

In [13]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

See what the type of each element is 

In [15]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As you can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we'll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:



In [17]:
html = list(soup.children)[2]

In [22]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
body = list(html.children)[3]

In [24]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

Isolate `<p>` tag

In [26]:
p = list(body.children)[1]

In [27]:
p.get_text()

'Here is some simple content for this page.'

## Finding all instances of a tag at once

In [28]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [29]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [30]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching by Classes & IDs

In [31]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [32]:
# find any p tag with 'outer-text' class
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [33]:
# any tag with 'outer-text' class
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [34]:
# by id
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### With CSS Selectors
```
p a — finds all a tags inside of a p tag.
body p a — finds all a tags inside of a p tag inside of a body tag.
html body — finds all body tags inside of an html tag.
p.outer-text — finds all p tags with a class of outer-text.
p#first — finds all p tags with an id of first.
body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
```

In [35]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

--- 

In [36]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Patchy drizzle and fog after 11pm.  Patchy smoke before 11pm. Increasing clouds, with a low around 54. Southwest wind 13 to 20 mph, with gusts as high as 25 mph. " class="forecast-icon" src="newimages/medium/nra.png" title="Tonight: Patchy drizzle and fog after 11pm.  Patchy smoke before 11pm. Increasing clouds, with a low around 54. Southwest wind 13 to 20 mph, with gusts as high as 25 mph. "/>
 </p>
 <p class="short-desc">
  Patchy
  <br/>
  Drizzle and
  <br/>
  Patchy Fog
 </p>
 <p class="temp temp-low">
  Low: 54 °F
 </p>
</div>


The name of the forecast item — in this case, Tonight.

The description of the conditions — this is stored in the title property of img.

A short description of the conditions — in this case, Mostly Clear.

The temperature low — in this case, 49 degrees.

In [37]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print('period', period)
print('short_desc', short_desc)
print('temp', temp)

period Tonight
short_desc PatchyDrizzle andPatchy Fog
temp Low: 54 °F


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [38]:
img = tonight.find("img")
desc = img['title']

print(desc)

Tonight: Patchy drizzle and fog after 11pm.  Patchy smoke before 11pm. Increasing clouds, with a low around 54. Southwest wind 13 to 20 mph, with gusts as high as 25 mph. 


Select all items with the class period-name inside an item with the class tombstone-container in seven_day.


Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [39]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'IndependenceDay',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight']

Above gets the period names of the `tombstnoe-container`.  Below gets the other three tags

In [40]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['PatchyDrizzle andPatchy Fog', 'PatchyDrizzle andPatchy Fogthen MostlySunny andBreezy', 'IncreasingClouds', 'GradualClearing', 'Mostly Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Sunny', 'Mostly Clear']
['Low: 54 °F', 'High: 62 °F', 'Low: 54 °F', 'High: 67 °F', 'Low: 55 °F', 'High: 71 °F', 'Low: 55 °F', 'High: 74 °F', 'Low: 54 °F']
['Tonight: Patchy drizzle and fog after 11pm.  Patchy smoke before 11pm. Increasing clouds, with a low around 54. Southwest wind 13 to 20 mph, with gusts as high as 25 mph. ', 'Independence Day: Patchy drizzle before 8am.  Patchy fog before 11am.  Otherwise, cloudy, then gradually becoming mostly sunny, with a high near 62. Breezy, with a west southwest wind 14 to 24 mph, with gusts as high as 31 mph. ', 'Wednesday Night: Increasing clouds, with a low around 54. West wind 10 to 20 mph, with gusts as high as 25 mph. ', 'Thursday: Cloudy through mid morning, then gradual clearing, with a high near 67. West wind 8 to 18 mph, with gusts as high as 24 mph. ', 'T

In [41]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,PatchyDrizzle andPatchy Fog,Low: 54 °F,Tonight: Patchy drizzle and fog after 11pm. P...
1,IndependenceDay,PatchyDrizzle andPatchy Fogthen MostlySunny an...,High: 62 °F,Independence Day: Patchy drizzle before 8am. ...
2,WednesdayNight,IncreasingClouds,Low: 54 °F,"Wednesday Night: Increasing clouds, with a low..."
3,Thursday,GradualClearing,High: 67 °F,"Thursday: Cloudy through mid morning, then gra..."
4,ThursdayNight,Mostly Cloudy,Low: 55 °F,"Thursday Night: Mostly cloudy, with a low arou..."
5,Friday,Mostly Cloudy,High: 71 °F,"Friday: Mostly cloudy, with a high near 71."
6,FridayNight,Partly Cloudy,Low: 55 °F,"Friday Night: Partly cloudy, with a low around..."
7,Saturday,Sunny,High: 74 °F,"Saturday: Sunny, with a high near 74."
8,SaturdayNight,Mostly Clear,Low: 54 °F,"Saturday Night: Mostly clear, with a low aroun..."


In [42]:
# pull out numeric temperatures with regex
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    54
1    62
2    54
3    67
4    55
5    71
6    55
7    74
8    54
Name: temp_num, dtype: object

In [44]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
0,Tonight,PatchyDrizzle andPatchy Fog,Low: 54 °F,Tonight: Patchy drizzle and fog after 11pm. P...,54,True
2,WednesdayNight,IncreasingClouds,Low: 54 °F,"Wednesday Night: Increasing clouds, with a low...",54,True
4,ThursdayNight,Mostly Cloudy,Low: 55 °F,"Thursday Night: Mostly cloudy, with a low arou...",55,True
6,FridayNight,Partly Cloudy,Low: 55 °F,"Friday Night: Partly cloudy, with a low around...",55,True
8,SaturdayNight,Mostly Clear,Low: 54 °F,"Saturday Night: Mostly clear, with a low aroun...",54,True


---

In [5]:
page = requests.get("https://www.courts.com.sg/samsung-wa75h4400ss-sp-top-load-washer-7-5kg-ip086415")
soup = BeautifulSoup(page.content, 'html.parser')

In [7]:
price = soup.find(class_="price").get_text()

In [8]:
price

'S$359.00'