# <center> Please go to https://ccv.jupyter.brown.edu </center>

## <center> Congratulations on mastering basic Python! </center>
### <center> Let's move on to a new major topic, web scraping! </center>

## By the end of today you will learn about:
- Introduction to HTML
- Making a request to a webpage and creating a beautiful soup object
- Simple and advanced navigation through a soup object
- Scraping weather data

- Introduction to HTML
- <font color='LIGHTGRAY'> Making a request to a webpage and creating a beautiful soup object </font>
- <font color='LIGHTGRAY'> Simple and advanced navigation through a soup object </font>
- <font color='LIGHTGRAY'> Scraping weather data </font>

# Intro to HTML Navigation
https://www.dataquest.io/blog/web-scraping-tutorial-python/

- <font color='LIGHTGRAY'> Introduction to HTML </font>
- Making a request to a webpage and creating a beautiful soup object
- <font color='LIGHTGRAY'> Simple and advanced navigation through a soup object </font>
- <font color='LIGHTGRAY'> Scraping weather data </font>

## Making a request to a webpage and creating a beautiful soup object

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup


In [None]:
import pandas as pd

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
print(page)

https://www.restapitutorial.com/httpstatuscodes.html

In [None]:
print(page.status_code)

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

## Exercise: make a request to a page of your choice, assign it to a variable, and print the status code. Use any variable name except for `page`.

In [None]:
print(page.content)

#### The content is ugly and hard to read. Let's turn it into a Beautiful Soup object, so we can make it pretty and easier to read, and make many other navigation attributes and functions available to us.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

- <font color='LIGHTGRAY'> Introduction to HTML </font>
- <font color='LIGHTGRAY'> Making a request to a webpage and creating a beautiful soup object </font>
- Simple and advanced navigation through a soup object
- <font color='LIGHTGRAY'> Scraping weather data </font>

## Simple navigation through the soup object

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup.

In [None]:
print(soup.prettify())
print()
print(list(soup.children))

Let's look at each child's type.

In [None]:
print([type(item) for item in list(soup.children)])

As you can see, all of the items are BeautifulSoup objects. 
* The first is a Doctype object, which contains information about the type of the document. 
* The second is a NavigableString, which represents text found in the HTML document. 
* The final item is a Tag object, which contains other nested tags. 

The most important object type, and the one we’ll deal with most often, is the Tag object. The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects

#### Extract the children of the html tag

In [None]:
html_tag = list(soup.children)[2]
print(html_tag)

In [None]:
print(list(html_tag.children))

#### Extract the children of the body tag

In [None]:
body_tag = list(html_tag.children)[3]
print(body_tag)

In [None]:
print(list(body_tag.children))

## Exercise: Extract the children of the head tag

#### Extract the text in the p tag using list indices

In [None]:
p_tag = list(body_tag.children)[1]
print(p_tag)

In [None]:
print(p_tag.get_text())
print(type(p_tag.get_text()))

## Exercise: Extract the text in the title tag

#### Extract the text in the p tag using dot notation

In [None]:
p_tag = body_tag.p
print(p_tag)

In [None]:
print(p_tag.get_text())

#### You can string multiple tags together using dot notation.

In [None]:
print(soup.prettify())
print()
print(soup.html.body.p.text)

## Exercise: extract the text in the title tag using multiple tag dot notation

## More advanced navigation

### Searching for tags

#### The `find` function

In [None]:
print(soup.find('p')) # find first instance of tag

In [None]:
print(soup.find('p').get_text()) # find first instance of tag and get text

#### The `find_all` function

In [None]:
print(soup.find_all('p')) # find all instances of tag

In [None]:
print(soup.find_all('p').get_text()) # find all instances of tag and get text of first tag

In [None]:
print(soup.find_all('p')[0].get_text()) # find all instances of tag and get text of first tag

### Searching for tags by class and id

#### Let's make a request to a slightly more complicated web page

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

In [None]:
print(soup.find_all('p'))

In [None]:
print(soup.find_all(class_="outer-text")) # find all tags with the "outer-text" class

In [None]:
print(soup.find_all('p', class_='first-item')) # find all p tags with the "first-item" class

In [None]:
print(soup.find_all(id="first")) # find all tags with the "first" id

## Exercise: find all tags with the 'inner-text' class

### Searching for tags using Selectors

* p a — finds all a tags inside of a p tag.
* body p a — finds all a tags inside of a p tag inside of a body tag.
* html body — finds all body tags inside of an html tag.
* .outer-text — finds all tags with a class of outer-text.
* p.outer-text — finds all p tags with a class of outer-text.
* p#first — finds all p tags with an id of first.
* body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

Find out more about selectors here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors

In [None]:
print(soup.select("div p"))

## Exercise: search for all tags with the 'inner-text' class using CSS selectors

- <font color='LIGHTGRAY'> Introduction to HTML </font>
- <font color='LIGHTGRAY'> Making a request to a webpage and creating a beautiful soup object </font>
- <font color='LIGHTGRAY'> Simple and advanced navigation through a soup object </font>
- Scraping weather data

# Scraping Weather Data
https://forecast.weather.gov/MapClick.php?lat=41.8239&lon=-71.412#.XkRoblNKglI

## Developer Tools in Google Chrome
You can start the developer tools in Chrome by clicking `View -> Developer -> Developer Tools`

The elements panel will show you all the HTML tags on the page, and let you navigate through them. It’s a really handy feature!

By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel: the extended forecast text.

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast: the div that contains the extended forecast items.

If you click around on the console, and explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class tombstone-container.

We now know enough to download the page and start parsing it.

#### Download the web page containing the forecast.

In [None]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=41.8239&lon=-71.412#.XkRoblNKglI")

#### Create a BeautifulSoup class to parse the page.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

#### Find the div with id seven-day-forecast, and assign to seven_day

In [None]:
seven_day = soup.find(id="seven-day-forecast-list")
print(seven_day)

#### Inside seven_day, find each individual forecast item.

In [None]:
forecast_items = seven_day.find_all(class_="forecast-tombstone")
print(list(forecast_items)[0])

## Exercise: Extract and print the first forecast item.

## Extracting information from the page

#### Use the `get_text()` method

In [None]:
print(tonight.prettify())
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
print(period)
print(short_desc) 

## Exercise: extract the temperature using the `get_text()` method

#### Can treat tags like dictionaries, where (key, value) pairs are (tag, attribute)

In [None]:
print(tonight.prettify())
img = tonight.find("img")
print(img)

In [None]:
desc = img['title']
print(desc)

## Extracting all information from the page

In [None]:
print(seven_day.prettify())
print(seven_day.find_all('p', class_='period-name'))

In [None]:
periods = [pt.get_text() for pt in period_tags]
print(periods)

In [None]:
short_descs = [sd.get_text() for sd in seven_day.find_all(class_="short-desc")]
temps = [t.get_text() for t in seven_day.find_all(class_="temp")]
descs = [d["title"] for d in seven_day.find_all("img")]

print(short_descs)
print(temps)
print(descs)

## Collect the same information using Selectors

In [None]:
period_tags = seven_day.select(".forecast-tombstone .period-name")
print(period_tags)

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".forecast-tombstone .short-desc")]
temps = [t.get_text() for t in seven_day.select(".forecast-tombstone .temp")]
descs = [d["title"] for d in seven_day.select(".forecast-tombstone img")]
print(short_descs)
print(temps)
print(descs)

## Storing weather data in dataframe

In [None]:
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
print(weather)

## Writing the dataframe to a CSV or Excel file

In [None]:
weather.to_csv('data/weather.csv', index=False)