 ### Web scraping Tutorial
 https://www.dataquest.io/blog/web-scraping-tutorial-python/
 
In this tutorial, we’ll show you how to perform web scraping using Python 3 and the BeautifulSoup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

<html>
    <head>
    </head>
    <body>
        
        <p> List of HTML
        <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element">  tags </a>  
        </p>
    </body>
</html>

In [3]:
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
if page.status_code==200:
    print ("Page downloaded successfully")

Page downloaded successfully


In [4]:
print (page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
print(soup.prettify())


<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [7]:
list(soup.children)


['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [8]:
[type(item) for item in list(soup.children)] 


[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [9]:
html = list(soup.children)[2]


In [11]:
list(html.children)


['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [13]:
body = list(html.children)[3]


In [16]:
p=list(body.children)[1]

In [17]:
p.get_text()

'Here is some simple content for this page.'

In [19]:
soup.find('p').get_text()


'Here is some simple content for this page.'

In [20]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [21]:
soup.find_all('p', class_='outer-text')


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [22]:
soup.find_all(class_="outer-text")


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [26]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [27]:
soup.select("div p")


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [28]:
soup.select("p#first")


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [33]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  This
  <br>
   Afternoon
  </br>
 </p>
 <p>
  <img alt="This Afternoon: A 30 percent chance of showers.  Mostly sunny, with a high near 55. West wind around 15 mph, with gusts as high as 20 mph. " class="forecast-icon" src="newimages/medium/hi_shwrs30.png" title="This Afternoon: A 30 percent chance of showers.  Mostly sunny, with a high near 55. West wind around 15 mph, with gusts as high as 20 mph. "/>
 </p>
 <p class="short-desc">
  Chance
  <br>
   Showers
  </br>
 </p>
 <p class="temp temp-high">
  High: 55 °F
 </p>
</div>


In [34]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

ThisAfternoon
ChanceShowers
High: 55 °F


In [35]:
img = tonight.find("img")
desc = img['title']

print(desc)

This Afternoon: A 30 percent chance of showers.  Mostly sunny, with a high near 55. West wind around 15 mph, with gusts as high as 20 mph. 


In [36]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['ThisAfternoon',
 'Tonight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday']

In [37]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['ChanceShowers', 'ShowersLikely', 'Slight ChanceShowers', 'Partly Cloudy', 'Sunny', 'Mostly Clear', 'Sunny', 'Clear', 'Sunny']
['High: 55 °F', 'Low: 45 °F', 'High: 54 °F', 'Low: 43 °F', 'High: 57 °F', 'Low: 45 °F', 'High: 59 °F', 'Low: 47 °F', 'High: 59 °F']
['This Afternoon: A 30 percent chance of showers.  Mostly sunny, with a high near 55. West wind around 15 mph, with gusts as high as 20 mph. ', 'Tonight: Showers likely, mainly after 10pm.  Mostly cloudy, with a low around 45. West northwest wind 7 to 14 mph, with gusts as high as 18 mph.  Chance of precipitation is 60%. New precipitation amounts of less than a tenth of an inch possible. ', 'Monday: A 20 percent chance of showers.  Mostly sunny, with a high near 54. North northwest wind 7 to 13 mph.  New precipitation amounts of less than a tenth of an inch possible. ', 'Monday Night: Partly cloudy, with a low around 43. Northwest wind 11 to 14 mph, with gusts as high as 18 mph. ', 'Tuesday: Sunny, with a high near 57. North wind 

In [38]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

Unnamed: 0,desc,period,short_desc,temp
0,This Afternoon: A 30 percent chance of showers...,ThisAfternoon,ChanceShowers,High: 55 °F
1,"Tonight: Showers likely, mainly after 10pm. M...",Tonight,ShowersLikely,Low: 45 °F
2,Monday: A 20 percent chance of showers. Mostl...,Monday,Slight ChanceShowers,High: 54 °F
3,"Monday Night: Partly cloudy, with a low around...",MondayNight,Partly Cloudy,Low: 43 °F
4,"Tuesday: Sunny, with a high near 57. North win...",Tuesday,Sunny,High: 57 °F
5,"Tuesday Night: Mostly clear, with a low around...",TuesdayNight,Mostly Clear,Low: 45 °F
6,"Wednesday: Sunny, with a high near 59.",Wednesday,Sunny,High: 59 °F
7,"Wednesday Night: Clear, with a low around 47.",WednesdayNight,Clear,Low: 47 °F
8,"Thursday: Sunny, with a high near 59.",Thursday,Sunny,High: 59 °F


In [39]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    55
1    45
2    54
3    43
4    57
5    45
6    59
7    47
8    59
Name: temp_num, dtype: object

In [40]:
weather["temp_num"].mean()


51.555555555555557

In [41]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [42]:
weather[is_night]


Unnamed: 0,desc,period,short_desc,temp,temp_num,is_night
1,"Tonight: Showers likely, mainly after 10pm. M...",Tonight,ShowersLikely,Low: 45 °F,45,True
3,"Monday Night: Partly cloudy, with a low around...",MondayNight,Partly Cloudy,Low: 43 °F,43,True
5,"Tuesday Night: Mostly clear, with a low around...",TuesdayNight,Mostly Clear,Low: 45 °F,45,True
7,"Wednesday Night: Clear, with a low around 47.",WednesdayNight,Clear,Low: 47 °F,47,True


In [1]:
# Import required modules
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Create a values as dictionary of lists
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}

# Create a dataframe
raw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])

# View a dataframe
raw_df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25
1,Molly,Jacobson,52,24,94
2,Tina,Ali,36,31,57
3,Jake,Milner,24,2,62
4,Amy,Cooze,73,3,70


In [5]:
# Create a variable with the URL to this tutorial
url = 'http://nbviewer.ipython.org/github/chrisalbon/code_py/blob/master/beautiful_soup_scrape_table.ipynb'

# Scrape the HTML at the url
r = requests.get(url)

# Turn the HTML into a Beautiful Soup object
soup = BeautifulSoup(r.text, 'html.parser')

In [6]:
# Create four variables to score the scraped data in
first_name = []
last_name = []
age = []
preTestScore = []
postTestScore = []

# Create an object of the first object that is class=dataframe
table = soup.find(class_='dataframe')

# Find all the <tr> tag pairs, skip the first one, then for each.
for row in table.find_all('tr')[1:]:
    # Create a variable of all the <td> tag pairs in each <tr> tag pair,
    col = row.find_all('td')

    # Create a variable of the string inside 1st <td> tag pair,
    column_1 = col[0].string.strip()
    # and append it to first_name variable
    first_name.append(column_1)

    # Create a variable of the string inside 2nd <td> tag pair,
    column_2 = col[1].string.strip()
    # and append it to last_name variable
    last_name.append(column_2)

    # Create a variable of the string inside 3rd <td> tag pair,
    column_3 = col[2].string.strip()
    # and append it to age variable
    age.append(column_3)

    # Create a variable of the string inside 4th <td> tag pair,
    column_4 = col[3].string.strip()
    # and append it to preTestScore variable
    preTestScore.append(column_4)

    # Create a variable of the string inside 5th <td> tag pair,
    column_5 = col[4].string.strip()
    # and append it to postTestScore variable
    postTestScore.append(column_5)

# Create a variable of the value of the columns
columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}

# Create a dataframe from the columns variable
df = pd.DataFrame(columns)

In [7]:
df

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
0,42,Jason,Miller,25,4
1,52,Molly,Jacobson,94,24
2,36,Tina,Ali,57,31
3,24,Jake,Milner,62,2
4,73,Amy,Cooze,70,3
