# Web Scraping

With reference to : https://hackernoon.com/web-scraping-bf2d814cc572

In [1]:
import requests

In [2]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. A status_code of 200 means that the page downloaded successfully. Status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

In [3]:
page.status_code

200

We use BeautifulSoup library to parse the document and extract the text in beautiful manner. We have to use print() with it

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content,'html.parser')

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


Now, if you want to select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it

In [8]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [9]:
html = list(soup.children)[2]

In [10]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [12]:
body = list(html.children)[3]

In [13]:
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [14]:
p = list(body.children)[1]
p.get_text()

'Here is some simple content for this page.'

We can use the get_text method to extract all of the text inside the tag
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page

In [15]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
# Note it returns the list so we use list indexing
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

In [16]:
soup.find('p').get_text()

'Here is some simple content for this page.'

# WS1

In [17]:
import requests
from bs4 import BeautifulSoup

In [18]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.XThEcegzZPY")

In [19]:
soup = BeautifulSoup(page.content,'html.parser')

In [26]:
seven_day = soup.find(id="seven-day-forecast")
# seven_day

In [29]:
forecast_items = seven_day.find_all(class_="tombstone-container")
# forecast_items

In [30]:
#one first item
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 72. Light west southwest wind becoming west 6 to 11 mph in the afternoon. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 72. Light west southwest wind becoming west 6 to 11 mph in the afternoon. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 72 °F
 </p>
</div>


We'll extract the name of the forecast item

In [31]:
period = tonight.find(class_ = "period-name").get_text()
short_Desc = tonight.find(class_ = "short-desc").get_text()
temp = tonight.find(class_ = "temp temp-high").get_text()

print(period)
print(short_Desc)
print(temp)

Today
Sunny
High: 72 °F


In [34]:
periods=[pds.get_text() for pds in seven_day.select(".tombstone-container .period-name")]
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(periods)
print(short_descs)
print(temps)
print(descs)

['Today', 'Tonight', 'Thursday', 'ThursdayNight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday']
['Sunny', 'Partly Cloudy', 'Mostly Sunny', 'Mostly Cloudy', 'Partly Sunny', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Clear', 'Sunny']
['High: 72 °F', 'Low: 56 °F', 'High: 70 °F', 'Low: 57 °F', 'High: 69 °F', 'Low: 57 °F', 'High: 75 °F', 'Low: 59 °F', 'High: 74 °F']
['Today: Sunny, with a high near 72. Light west southwest wind becoming west 6 to 11 mph in the afternoon. ', 'Tonight: Partly cloudy, with a low around 56. West southwest wind 7 to 13 mph. ', 'Thursday: Mostly sunny, with a high near 70. West southwest wind 7 to 13 mph. ', 'Thursday Night: Mostly cloudy, with a low around 57. West southwest wind 8 to 11 mph. ', 'Friday: Partly sunny, with a high near 69. West southwest wind 7 to 10 mph. ', 'Friday Night: Mostly cloudy, with a low around 57.', 'Saturday: Mostly sunny, with a high near 75.', 'Saturday Night: Mostly clear, with a low around 59.', 'Sunday: Sunny, wi

In [35]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light west ..."
1,Tonight,Partly Cloudy,Low: 56 °F,"Tonight: Partly cloudy, with a low around 56. ..."
2,Thursday,Mostly Sunny,High: 70 °F,"Thursday: Mostly sunny, with a high near 70. W..."
3,ThursdayNight,Mostly Cloudy,Low: 57 °F,"Thursday Night: Mostly cloudy, with a low arou..."
4,Friday,Partly Sunny,High: 69 °F,"Friday: Partly sunny, with a high near 69. Wes..."
5,FridayNight,Mostly Cloudy,Low: 57 °F,"Friday Night: Mostly cloudy, with a low around..."
6,Saturday,Mostly Sunny,High: 75 °F,"Saturday: Mostly sunny, with a high near 75."
7,SaturdayNight,Mostly Clear,Low: 59 °F,"Saturday Night: Mostly clear, with a low aroun..."
8,Sunday,Sunny,High: 74 °F,"Sunday: Sunny, with a high near 74."


In [36]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')

In [37]:
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num
0,Today,Sunny,High: 72 °F,"Today: Sunny, with a high near 72. Light west ...",72
1,Tonight,Partly Cloudy,Low: 56 °F,"Tonight: Partly cloudy, with a low around 56. ...",56
2,Thursday,Mostly Sunny,High: 70 °F,"Thursday: Mostly sunny, with a high near 70. W...",70
3,ThursdayNight,Mostly Cloudy,Low: 57 °F,"Thursday Night: Mostly cloudy, with a low arou...",57
4,Friday,Partly Sunny,High: 69 °F,"Friday: Partly sunny, with a high near 69. Wes...",69
5,FridayNight,Mostly Cloudy,Low: 57 °F,"Friday Night: Mostly cloudy, with a low around...",57
6,Saturday,Mostly Sunny,High: 75 °F,"Saturday: Mostly sunny, with a high near 75.",75
7,SaturdayNight,Mostly Clear,Low: 59 °F,"Saturday Night: Mostly clear, with a low aroun...",59
8,Sunday,Sunny,High: 74 °F,"Sunday: Sunny, with a high near 74.",74


# WS2

In [38]:
import requests
from bs4 import BeautifulSoup

In [45]:
page = requests.get("https://www.news.com.au/world")


In [46]:
page.status_code

200

In [47]:
soup = BeautifulSoup(page.content,'html.parser')

In [56]:
headlines = [head.get_text() for head in soup.find_all(class_="heading")]

# ToDo

Read and implement algorithm of document categorize on keyword