<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<CENTER>
<H1 style="color:red">
Accessing Web Resources with Python
</H1>
</CENTER>

In [None]:
from __future__ import print_function

## <font color='red'>Reference Documents</font>

* <a href="http://zetcode.com/python/requests/">Python Requests Tutorial</a>
* <a href="https://realpython.com/python-requests/">Python’s Requests Library (Guide)</a>
* <a href="https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/">What is web scraping</a>
* <a href="https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184">Building a Web Scraper from start to finish</a>
* <a href="https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/">Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup</a>

## <font color='red'>What will be Covered?</font>
+ Accessing Web Pages with Requests
+ Web Scrapping with Json
+ Web Scrapping with Beautiful Soup

![fig_scrap](https://miro.medium.com/max/1400/1*4BnBQE9Bu-EQ-gGz25x8pg.png)
Image Source: gurutechnolabs.com

## <font color='red'>Python requests</font>

* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).
* Requests is a built-in Python module.

In [None]:
import requests as reqs

print(reqs.__version__)
print(reqs.__copyright__)

The json module enables you to convert between JSON and Python Objects. 

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
resp = reqs.get("http://www.webcode.me")

print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
import re

resp = reqs.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

**HTTP Request**
- An HTTP request is a message send from the client to the browser to retrieve some information or to make some action.
- Request's request method creates a new request. 
- We use the `request` module methods: `get()`, `post()`, or `put()`.

Create a `GET` request and send it to the web site.

In [None]:
resp = reqs.request(method='GET', url="http://www.webcode.me")
print(resp.text)

**Getting the Status of a Web Page**
- We perform an HTTP request with the `get()` method and check for the returned status.
- 200 is a standard response for a successful HTTP request and 404 tells that the requested resource could not be found.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
resp = reqs.get("http://www.webcode.me/news")
print(resp.status_code)

**Other Information**

In [None]:
print("\t URL:      ", resp.url)
print("\t Encoding: ", resp.encoding)
print("\t URL:      ", resp.url)
print("\t Time:     ", resp.elapsed)

**`requests` `head()` Method**
- The `head()` method retrieves document headers. 
- The headers consist of fields, including date, server, content type, or last modification time.

In [None]:
resp = reqs.head("http://www.webcode.me")

print("Server: " + resp.headers['server'])
print("Last modified: " + resp.headers['last-modified'])
print("Content type: " + resp.headers['content-type'])

**`requests` `get()` Method**
- The `get()` method issues a GET request to the server. 
- The GET method requests a representation of the specified resource.

The the following script sends a variable with a value to the `httpbin.org` server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

The `get()` method takes a params parameter where we can specify the query parameters.

We send a GET request to the web site and pass the data, which is specified in the params parameter.

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

**Other Methods**

```python
requests.post('https://httpbin.org/post', data={'key':'value'})
requests.put('https://httpbin.org/put', data={'key':'value'})
requests.delete('https://httpbin.org/delete')
requests.patch('https://httpbin.org/patch', data={'key':'value'})
requests.options('https://httpbin.org/get')
```

### Summary of `requests` Methods 

| Method	| Description |
| :--- | :--- |
| delete(url, args)	| Sends a DELETE request to the specified url | 
| get(url, params, args)	| Sends a GET request to the specified url | 
| head(url, args)	| Sends a HEAD request to the specified url | 
| patch(url, data, args)	| Sends a PATCH request to the specified url | 
| post(url, data, json, args)	| Sends a POST request to the specified url | 
| put(url, data, args)	| Sends a PUT request to the specified url | 
| request(method, url, args)	| Sends a request of the specified method to the specified url| 

---

# <font color='red'>Web Scraping</font>
> Web scraping is a mechanism of collecting large amounts of data from the webpage and store the data into any required format which further helps us to perform analysis on the extracted data.

![fig_json](https://daveberesford.co.uk/wp-content/uploads/2019/02/data-scraping-960x594.png)
Image Source: daveberesford.co.uk

Web scraping is used to extract or “scrape” data from any web page on the Internet.
Web scraping is performed using a “**web scraper**” (or a “bot” or a “web spider” or “web crawler”). 
A web-scraper is a program that goes to web pages, downloads the contents, extracts data out of the contents and then saves the data to a file or a database.

Web scraping involves a three-step process:

1. **Step 1**: Send an HTTP request to the webpage you want to scrape
     - The server responds to the request by returning the HTML content of the target webpage.
2. **Step 2**: Parse the HTML content
     - A parser is needed to create a nested structure of the HTML data. 
3. **Step 3**: Pull data out of HTML
     - We use Python packages such as Json and Beautiful Soup to pull out data and store them.
     
![fig_scrap](https://www.scrapehero.com/wp/wp-content/uploads/2018/01/xhow-does-a-web-scraper-work-simple-2.png.pagespeed.ic.MeNRriGmi9.webp)
Image Source: scrapehero.com

Web Scrapers crawl websites, extracts data from it, transforms to a usable structured format and load it to a file or database for subsequent use.

A typical web scraper has the following components:

![fig_scrap](https://www.scrapehero.com/wp/wp-content/uploads/2018/01/xComponents-of-web-scraper1.png.pagespeed.ic.uNMyC_Y5W4.webp)

## <font color='blue'>Web Scraping with JSON</font>

**First Example**

In [None]:
url = "https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY"
page_content = reqs.get(url)

Process the data with JSON

In [None]:
import json
if page_content.status_code == 200:
   json_page = json.loads(page_content.text)

Print the keys:

In [None]:
for x in json_page:
    print(x)

Print the keys and values:

In [None]:
for x in json_page:
    print(x+" -->", json_page[x])
    print()

In [None]:
from pprint import pprint
pprint(json_page)

Plot images:

In [None]:
%matplotlib inline
from skimage import io

io.imshow(io.imread(json_page["url"]))
io.show()

In [None]:
io.imshow(io.imread(json_page["hdurl"]))
io.show()

<font color="red">If you want to download the file on your local system:</font>

In [None]:
import urllib

url_name = json_page["url"]
loc_file_name = os.path.basename(url_name)

urllib.request.urlretrieve(url_name, loc_file_name)

**Second Example**

We will fetch data of CityBike NYC (Bike Sharing System) from specified <a href="https://feeds.citibikenyc.com/stations/stations.json">https://feeds.citibikenyc.com/stations/stations.json</a> and convert into dictionary format.

Get JSON string data from CityBike NYC using web requests library

In [None]:
json_response= reqs.get("https://feeds.citibikenyc.com/stations/stations.json")

Check type of json_response object

In [None]:
print(type(json_response.text))

Load data in `loads()` function of json library

In [None]:
bike_dict = json.loads(json_response.text)

Check type of news_dict

In [None]:
print(type(bike_dict))

List all the keys

In [None]:
for key in bike_dict:
    print(key)

Get stationBeanList key data from dict

In [None]:
print(bike_dict['stationBeanList'][0]) 

In [None]:
print(json.dumps(bike_dict['stationBeanList'][0], indent=3)) 

---

## <font color='blue'>Python Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scrapping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
source = reqs.get("http://www.webcode.me")
print(source)

**Create a beautiful soup object**

In [None]:
mysoup = bso(source.text, 'html.parser')
print(mysoup.prettify())

**Title of the page**

In [None]:
print(mysoup.title)

**Get attribute**

In [None]:
print(mysoup.title.name)

**Get values**

In [None]:
print(mysoup.title.string)

In [None]:
print(mysoup.title.text)

**Beginning navigation**

In [None]:
print(mysoup.title.parent.name)

**Getting specific tags**
- HTML is made up of tags. It stores all of it’s data in them, and in the midst of all that clutter lies the data we need. 
- The `find` method searches for the first tag with the needed name.
- The `find_all` method searches for all tags with the needed tag name and returns them as a list.

Assume that we want to find paragraph tags `<p>`:

In [None]:
print(mysoup.p)

In [None]:
print(mysoup.find('p'))
print(type(mysoup.find('p')))

In [None]:
print(mysoup.find_all('p'))
print(type(mysoup.find_all('p')))

In [None]:
print(mysoup.find_all('p')[-1])

In [None]:
for i, paragraph in enumerate(mysoup.find_all('p')):
    print("Type: {}".format(type(paragraph)))
    print("Inner Text {}: {}".format(i+1, paragraph.text))

In [None]:
body = mysoup.find_all('body')
print(body)

In [None]:
print("Type body:       ", type(body))
print("Type inner body: ", type(body[0]))

In [None]:
print(body[0].find_all('p'))

In [None]:
print(mysoup.find_all('body')[0].find_all('p'))

**Grab the text**

- Use the method `get_text`.

In [None]:
print(mysoup.get_text())

### <font color="blue">Web Scraping Rules</font>

![fig_ethics](https://hackernoon.com/hn-images/0*MPt2rectMhwklT63.jpg)

As reference, check: <a href="https://info.scrapinghub.com/web-scraping-guide/web-scraping-best-practices">The Web Scraping Best Practices Guide</a> or watch the video <a href="https://www.youtube.com/watch?v=i7DEy-ZB_Lk">Is Web Scraping Legal?</a>

- Check a website’s Terms and Conditions before you scrape it.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.

### Exercise 1
- Get information from the website `https://astg606.github.io/py_courses/virtual_courses/`
- Print the list of all the classes offered.

### Exercise 2

- Access the website `https://www.marketwatch.com`, and
- Extract the current values (together with the changes and percentages) of the three major indices that appear on the top.

## <font color="blue">Application</font>

- We want to get all book names on historic New York Time Best Sellers (Business section)
- The purpose is to:
     1. Help to compile my reading list in 2020
     2. Serve as reference to use Python for simple web analytics
- We use the Python packages: `Pandas`, `Requests` and `Baeutiful Soup`
- We save data in `pickle` and `csv` formats.

The example was taken from: <a href="https://towardsdatascience.com/building-my-2020-reading-list-with-a-simple-python-script-b610c7f2c223">Building my 2020 reading list with a simple Python script</a> by Pan Wu.

In [None]:
import pandas as pd

# Create an empty Pandas dataframe
nylist = pd.DataFrame()

beg_year = 2013
end_year = 2020
for the_year in range(beg_year, end_year):
    for the_month in range(1, 13):
        cur_month = str(the_month).zfill(2) # month in two digits
        # one need to get the URL pattern first, and then use Requests package to get the URL content
        url = 'https://www.nytimes.com/books/best-sellers/{0}/{1}/01/business-books/'.format(the_year, cur_month)
        page = reqs.get(url)
        print(" --  try: {0}, {1} -- ".format(the_year, cur_month))
        
        # Ensure proper result is returned
        if page.status_code != 200:
            print("      Missing data for Year {} and Month {}".format(the_year, cur_month))
            continue
        
        # one may want to use BeautifulSoup to parse the right elements out
        soup = bso(page.text, 'html.parser')
        
        # the specific class names are unique for this URL and they don't change across all URLs
        top_list = soup.findAll("ol", {"class": "css-12yzwg4"})[0].findAll("div", {"class": "css-xe4cfy"})
        print("Year: {} - Month: {} - How many in the top list: {}".format(the_year, the_month, len(top_list)))
        
        # loop through the Best Seller list in each Year-Month, and append the information into a pandas DataFrame
        for i in range(len(top_list)):
            book   = top_list[i].contents[0]
            title  = book.findAll("h3", {"class": "css-5pe77f"})[0].text
            author = book.findAll("p",  {"class": "css-hjukut"})[0].text
            review = book.get("href")
            # print("{0}, {1}; review: {2}".format(title, author, review))
            one_item = pd.Series([the_year, the_month, title, author, i+1, review], 
                                 index=['year', 'month', 'title', 'author', 'rank', 'review'])
            nylist = nylist.append(one_item, ignore_index=True, sort=False)

# write out the result to a pickle file for easy analysis later.
nylist.to_pickle("nylist.pkl")
nylist.to_csv("nylist.csv", index=False)

In [None]:
nylist

### Exercise 2

- Write a Python script that reads the pickle file `nylist.pkl` or the csv file `nylis.csv` and prints its content.