# Data Science Recap

Example Coursera DataScience course: labs/DP0701EN/Webscraping postal codes of Canada-Part 1 2 and 3.ipynb

  * https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857
  * https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319

  * https://www.dataquest.io/blog/web-scraping-tutorial-python/ - Beginner
  * https://www.dataquest.io/blog/web-scraping-beautifulsoup/
  * https://www.datacamp.com/community/tutorials/web-scraping-python-nlp
  * https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
  * https://www.datacamp.com/community/tutorials/web-scraping-using-python


## 1 Web Scraping with BeautifulSoup - National Weather Services
https://www.dataquest.io/blog/web-scraping-tutorial-python/


  * **Requests**
  * **Beautiful Soup**
  * Scrapy
  * Selenium

<img src = "https://forecast.weather.gov/wwamap/png/US.png" width = 400 align = 'left'>

### Install BeautifulSoup4 and Requests Python package

In [1]:
pip install BeautifulSoup4 requests

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 2.9MB/s ta 0:00:011
Collecting soupsieve>=1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/0b/44/0474f2207fdd601bb25787671c81076333d2c80e6f97e92790f8887cf682/soupsieve-1.9.3-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.8.0 soupsieve-1.9.3
Note: you may need to restart the kernel to use updated packages.


## D. Requests and Beautifulsoup - National Weather Services

### Import Library's and Parse HTML

In [2]:
# importing libraries
from bs4 import BeautifulSoup

# https://www.pythonforbeginners.com/requests/using-requests-in-python
import requests

import pandas as pd

In [3]:
# NWS San Francisco

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
page
#page.status_code
#page.content

<Response [200]>

### Prepare and pase webpage object into Beautifulsoup

In [4]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

#define url to scrape
url = "http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"

#connect to website
try:
    r = requests.get(url, headers=headers)
    print("Connection to ", url, "succesfull")
except:
    print("An error occured.")

Connection to  http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168 succesfull


In [5]:
# get webpage object into Beautifullsoup

soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>

### Extracting information from the page - only first item

In [19]:
#list(soup.children)

soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id = 'seven-day-forecast') # look up div tag with id 'seven-day-forecast'
#print(seven_day.prettify()) # show resulting content formatted with prettify() - find() and not find_all()
forecast_items = seven_day.find_all(class_ = 'tombstone-container') # find_all forecast items
tonight = forecast_items[0] # the first item in the forecast_items list
print(tonight)

<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear, with a steady temperature around 59. West wind 8 to 10 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a steady temperature around 59. West wind 8 to 10 mph. "/></p><p class="short-desc">Mostly Clear</p><p class="temp temp-low">Low: 59 °F</p></div>


In [32]:
# get the information from the forecast_item

period = tonight.find(class_ = 'period-name').text

img = tonight.find('img')
desc = img['title']

short = tonight.find(class_ = 'short-desc').text
low = tonight.find(class_ = 'temp temp-low').text

#results
#print(img)
print(period)
print(desc)
print(short)
print(low)

Tonight
Tonight: Mostly clear, with a steady temperature around 59. West wind 8 to 10 mph. 
Mostly Clear
Low: 59 °F


<hr>

### Extracting all information from the page

In [35]:
# CSS selector

period_tags = seven_day.select(".tombstone-container .period-name")
#print(period_tags)

# list comprehension
periods = [pt.get_text() for pt in period_tags]
#periods

[<p class="period-name">Tonight<br/><br/></p>, <p class="period-name">Thursday<br/><br/></p>, <p class="period-name">Thursday<br/>Night</p>, <p class="period-name">Friday<br/><br/></p>, <p class="period-name">Friday<br/>Night</p>, <p class="period-name">Saturday<br/><br/></p>, <p class="period-name">Saturday<br/>Night</p>, <p class="period-name">Sunday<br/><br/></p>, <p class="period-name">Sunday<br/>Night</p>]


['Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight']

In [39]:
# CSS selector and list comprehension combined

period_tags = [pt.get_text() for pt in seven_day.select(".tombstone-container .period-name")] # CSS selector combined with the list comprehension
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")] # this code line differs a bit - not using the same CSS selector structure, but uses the tags and properties img and title

print(periods)
print(short_descs)
print(temps)
print(descs)

['Tonight', 'Thursday', 'ThursdayNight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday', 'SundayNight']
['Mostly Clear', 'Sunny', 'Mostly Clear', 'Mostly Sunny', 'Partly Cloudy', 'Sunny', 'Clear', 'Mostly Sunny', 'Mostly Clear']
['Low: 59 °F', 'High: 70 °F', 'Low: 58 °F', 'High: 74 °F', 'Low: 58 °F', 'High: 77 °F', 'Low: 58 °F', 'High: 74 °F', 'Low: 59 °F']
['Tonight: Mostly clear, with a steady temperature around 59. West wind 8 to 10 mph. ', 'Thursday: Sunny, with a high near 70. Calm wind becoming west 5 to 9 mph in the afternoon. ', 'Thursday Night: Mostly clear, with a low around 58. West wind 6 to 11 mph becoming light west northwest  after midnight. ', 'Friday: Mostly sunny, with a high near 74. Light and variable wind becoming west 6 to 11 mph in the afternoon. ', 'Friday Night: Partly cloudy, with a low around 58. West wind 8 to 13 mph becoming light southwest  in the evening. ', 'Saturday: Sunny, with a high near 77.', 'Saturday Night: Clear, with a low around

### Put all information in a Pandas Data Frame

In [40]:
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Mostly Clear,Low: 59 °F,"Tonight: Mostly clear, with a steady temperatu..."
1,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. Calm win..."
2,ThursdayNight,Mostly Clear,Low: 58 °F,"Thursday Night: Mostly clear, with a low aroun..."
3,Friday,Mostly Sunny,High: 74 °F,"Friday: Mostly sunny, with a high near 74. Lig..."
4,FridayNight,Partly Cloudy,Low: 58 °F,"Friday Night: Partly cloudy, with a low around..."
5,Saturday,Sunny,High: 77 °F,"Saturday: Sunny, with a high near 77."
6,SaturdayNight,Clear,Low: 58 °F,"Saturday Night: Clear, with a low around 58."
7,Sunday,Mostly Sunny,High: 74 °F,"Sunday: Mostly sunny, with a high near 74."
8,SundayNight,Mostly Clear,Low: 59 °F,"Sunday Night: Mostly clear, with a low around 59."


<hr>

### Crunching and Cleanup

*Regular Expressions* and *Series.str.extract*

In [43]:
# Get the temperature from the temp text

temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False) #extracting the 'numbers' from the string, still a string!
weather["temp_num"] = temp_nums.astype('int') # cast the column temp from string to int and store them in the column temp_num
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num
0,Tonight,Mostly Clear,Low: 59 °F,"Tonight: Mostly clear, with a steady temperatu...",59
1,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. Calm win...",70
2,ThursdayNight,Mostly Clear,Low: 58 °F,"Thursday Night: Mostly clear, with a low aroun...",58
3,Friday,Mostly Sunny,High: 74 °F,"Friday: Mostly sunny, with a high near 74. Lig...",74
4,FridayNight,Partly Cloudy,Low: 58 °F,"Friday Night: Partly cloudy, with a low around...",58
5,Saturday,Sunny,High: 77 °F,"Saturday: Sunny, with a high near 77.",77
6,SaturdayNight,Clear,Low: 58 °F,"Saturday Night: Clear, with a low around 58.",58
7,Sunday,Mostly Sunny,High: 74 °F,"Sunday: Mostly sunny, with a high near 74.",74
8,SundayNight,Mostly Clear,Low: 59 °F,"Sunday Night: Mostly clear, with a low around 59.",59


In [45]:
# check if the value is of the type int
weather['temp_num'][0] - 10


49

In [46]:
# calculate the mean of the temperatures of the forecast

weather['temp_num'].mean()

65.22222222222223

In [86]:
# convert Farenheit to Celcius

'''
Fahrenheit = int(raw_input("Enter a temperature in Fahrenheit: "))
Celsius = (Fahrenheit - 32) * 5.0/9.0
print "Temperature:", Fahrenheit, "Fahrenheit = ", Celsius, " C"
'''

weather['temp_nums_C'] = ((weather['temp_num'] - 32) * 5.0 /9.0).round(2)
#weather = weather.drop("temp_nums_C2", axis=1)
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num,temp_nums_C
0,Tonight,Mostly Clear,Low: 59 °F,"Tonight: Mostly clear, with a steady temperatu...",59,15.0
1,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. Calm win...",70,21.11
2,ThursdayNight,Mostly Clear,Low: 58 °F,"Thursday Night: Mostly clear, with a low aroun...",58,14.44
3,Friday,Mostly Sunny,High: 74 °F,"Friday: Mostly sunny, with a high near 74. Lig...",74,23.33
4,FridayNight,Partly Cloudy,Low: 58 °F,"Friday Night: Partly cloudy, with a low around...",58,14.44
5,Saturday,Sunny,High: 77 °F,"Saturday: Sunny, with a high near 77.",77,25.0
6,SaturdayNight,Clear,Low: 58 °F,"Saturday Night: Clear, with a low around 58.",58,14.44
7,Sunday,Mostly Sunny,High: 74 °F,"Sunday: Mostly sunny, with a high near 74.",74,23.33
8,SundayNight,Mostly Clear,Low: 59 °F,"Sunday Night: Mostly clear, with a low around 59.",59,15.0


In [94]:
### Categorize the day and night records

is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [88]:
### explore Dataframe
# https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

weather.shape
weather.ndim
weather.head()
weather.tail()
weather.describe()
weather.dtypes

(9, 6)

In [95]:
weather.dtypes

period          object
short_desc      object
temp            object
desc            object
temp_num         int64
temp_nums_C    float64
is_night          bool
dtype: object

In [96]:
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num,temp_nums_C,is_night
0,Tonight,Mostly Clear,Low: 59 °F,"Tonight: Mostly clear, with a steady temperatu...",59,15.0,True
1,Thursday,Sunny,High: 70 °F,"Thursday: Sunny, with a high near 70. Calm win...",70,21.11,False
2,ThursdayNight,Mostly Clear,Low: 58 °F,"Thursday Night: Mostly clear, with a low aroun...",58,14.44,True
3,Friday,Mostly Sunny,High: 74 °F,"Friday: Mostly sunny, with a high near 74. Lig...",74,23.33,False
4,FridayNight,Partly Cloudy,Low: 58 °F,"Friday Night: Partly cloudy, with a low around...",58,14.44,True
5,Saturday,Sunny,High: 77 °F,"Saturday: Sunny, with a high near 77.",77,25.0,False
6,SaturdayNight,Clear,Low: 58 °F,"Saturday Night: Clear, with a low around 58.",58,14.44,True
7,Sunday,Mostly Sunny,High: 74 °F,"Sunday: Mostly sunny, with a high near 74.",74,23.33,False
8,SundayNight,Mostly Clear,Low: 59 °F,"Sunday Night: Mostly clear, with a low around 59.",59,15.0,True
