# Web Scraping

### The Internet and World Wide Web
- The Internet is a physical network of cables and routers, and a set of protocols for moving information across that network.
- The World Wide Web (WWW) is an information space on the Internet. It combines several concepts: 
    - Uniform Resource Locator (url) 
    - Hypertext Transfer Protocol (http) 
    - Hypertext Markup Language (html)

### Uniform Resource Locator
- URLs are a system of globally unique identiﬁers for resources on the Web and elsewhere.
- <code>scheme://host[:port]/path[?query][#fragment]</code>
- scheme is a protcol such as http, https, etc. 
- host is something like google.com or localhost
- :port is optional, allows a single host to have separate websites
- path is the path to a particular resource like index.html 
- queries

### Hypertext Transfer Protocol (HTTP)
- HTTP is the core communications protocol for retrieving data 
- Consists of messages– requests and responses– sent between a client and a server

![request_response](request_response.png)
- HTTP Request

![get_request](get_request.png)
    - First line contains:
        - HTTP method, here GET 
        - Requested URL 
        - HTTP version
    - Rest of request may contain: 
        - User-Agent: description of the client (Used e.g. to determine whether to serve mobile website version)
- HTTP Methods
    - GET
        - Most common method, used to get data
    - POST 
        - Used to send data to server, e.g. form entries, search queries
- HTTP Responses

![response](response.png)
    - First line contains:
        - HTTP version
        - HTTP response code
    - Rest of response contains:
        - Additional headers: Server, Content-Type, etc.
        - Requested Content
    - HTTP Response Codes
        - 1xx: informational
        - 2xx: Success
            - 200: OK
        - 3xx: Redirection
            - 301: Redirect
        - 4xx: Client Errors
            - 404: File not found
            - 403: Forbidden
        - 5xx: Server Errors
- HTTP GET Request Parameters

![query](query.png)
    - Query string with parameters sent in the URL of a GET request 
    - Parameter names and values are like a python dictionary
    - Shouldn’t use with sensitive data (AUTH TOKEN etc.)

### Hypertext Markup Language (HTML)
   - [LEARN MORE ABOUT HTML](https://www.w3schools.com/html/html_intro.asp)
   - Web browsers (Use Chrome) receive HTML documents from a web server or from local storage and render the documents into multimedia web pages. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document.
   - HTML sourcecode (View Page Source or Inspect)
![html_source](html_source.png)
    - HTML overview:
        - HTML elements are the building blocks of HTML pages; HTML elements are represented by tags; HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
        - Most tags come in pairs (opening and closing):
            - html
            - div
            - p 
            - a
            - body
        - Some do not (self-closing tags):
            - img
            - br
        - Whitespace doesn't matter (unlike Python)
    - Set up a basic HTML page in your text editor

### Common Web Technologies
- CSS
    - .css files or code embedded in style tags
- JavaScript
    - .js files or code embedded in script tags
    - browser uses the code to create dynamic content
- Server-side languages
    - Java, Ruby(on Rails), PHP, Python(Django), JavaScript(Node)

### Scraping Overview
- Scraping is the process of programmatically extracting information from websites 
- Anything that you can view in a web browser can potentially be scraped
- Reasons to Scrape:
    - Some websites oﬀer services (APIs) that allow you to get data directly. So why scrape?
        - Not all websites provide an API 
        - Not all of a website’s content is available through its API
        - APIs often use tokens to limit the amount of data that can be requested (With scraping there is, in principle, no limit)
- Note:
    - Credit all sources 
        - Publishing scraped content can be a copyright violation 
    - Don’t overload websites
        - Most sites will block you before you can do this
    - You are not anonymous on the web
        - Unless you take explicit steps (VPN, Tor, etc.) to do so
    - Follow the rules of robots.txt
- Common Web Resources:
    - Google Maps
    - OpenStreetMap
    - Mapbox
    - Twitter API
    - American Community Survey
    - Bureau of Labor Statistics
    - Wikipedia tables
    - Google Trends (No Need to Scrape)
- Tools:
    - [requests](https://2.python-requests.org/en/master/): Python module for retrieving web resources (mostly static) [tutorial](https://realpython.com/python-requests/#getting-started-with-requests)
    - [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): Python module for traversing and extracting elements from a web page.
    - [pandas.read_html()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html): reads a well-formatted html table into a pandas DataFrame
    - [selenium](https://selenium-python.readthedocs.io/): scrape JavaScript heavy pages with a web browser simulator
    - [scrapy](https://docs.scrapy.org/en/latest/topics/selectors.html)
    - regex: using the [re module](https://docs.python.org/3/library/re.html) to parse specific patterns; [good reference](https://www.tutorialspoint.com/python/python_reg_expressions.htm)

In [8]:
# Read html tables
import pandas as pd
pd.read_html("https://en.wikipedia.org/wiki/Economy_of_China", skiprows=2)[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Rank,CN¥,Nominal(US$),PPP(intl$.),realgrowth(%),Share(%),Rank,CN¥,Nominal(US$),PPP(intl$.),Share(%),,
1,China,,82712.20,12250.39,23589.60,6.9,100,,59660,8836,17015,100.0,1386395.0
2,Guangdong,1,8987.92,1331.19,2563.36,7.5,10.87,8,81089,12010,23127,136.0,109240.0
3,Jiangsu,2,8590.09,1272.27,2449.90,7.2,10.39,4,107189,17176,32570,180.0,79875.0
4,Shandong,3,7267.82,1076.43,2072.79,7.4,8.79,9,72851,10790,20777,122.0,99470.0
5,Zhejiang,4,5176.83,766.73,1476.44,7.8,6.26,5,92057,13634,26255,154.0,55645.0
6,Henan,5,4498.82,666.31,1283.07,7.8,5.44,19,47129,6980,13441,79.0,95062.0
7,Sichuan,6,3698.02,547.71,1054.68,8.1,4.47,22,44651,6613,12735,75.0,82330.0
8,Hubei,7,3652.30,540.94,1041.64,7.8,4.42,11,61971,9179,17674,104.0,58685.0
9,Hebei,8,3596.40,532.66,1025.70,6.7,4.35,18,47985,7107,13685,80.0,74475.0


In [9]:
# # Wikipedia Scrape Example:
# # https://en.wikipedia.org//wiki/List_of_accidents_and_incidents_involving_commercial_aircraft
# ################################################################################

# import pandas as pd
# import requests
# from bs4 import BeautifulSoup as bs

# # whether to import time based on whether initiate time.sleep()
# # import time

# ################################################################################

# base = 'https://en.wikipedia.org'
# path = '/wiki/List_of_accidents_and_incidents_involving_commercial_aircraft'
# response = requests.get(base + path)
# page = bs(response.text, 'html.parser')
# bolds = page.find_all('b')[:-1]
# a_tags = page.find_all('a')

# ################################################################################

# def get_table_data(one_accident_page, header_name):

#     '''
#     *This function takes as its argument an accident page and the name of
#     a header (e.g. `'Date'`, `'Flight origin'`, etc.) and returns the
#     corresponding value
#     *Input1: accident_page (bs4.BeautifulSoup Object)
#     *Input2: table_header text (<th>input2<th>)
#     *Output: table_data text (<td>output<td>)
#     '''

#     if one_accident_page.find('th', text = header_name) is not None:
#         header = one_accident_page.find('th', text = header_name)
#         table_data = header.find_next('td').text
#         return table_data

# ################################################################################

# def get_all_info(a_or_b_tag):

#     '''
#     *This function gives get all the table data and append them one by one to
#     an accidents_list
#     *Input: `<a>` or `<b>` (Note: just input `a` or `b` without quotation marks)
#     This input aims at getting accident name at the wiki page that list all of
#     the accidents.
#     *Output: [{},{},{}....], table_data wrapped by an accident_info dictionary
#     appended one by one to an accidents_list.
#     '''

#     accident_info = {}
#     accident_info['Date'] = get_table_data(accident_page, 'Date')
#     accident_info['Destination'] = get_table_data(accident_page, 'Destination')
#     accident_info['Fatalities'] = get_table_data(accident_page, 'Fatalities')
#     accident_info['Flight origin'] = get_table_data(accident_page, 'Flight origin')
#     accident_info['Name'] = a_or_b_tag.text
#     accident_info['Operator'] = get_table_data(accident_page, 'Operator')

#     accidents_list.append(accident_info)

# ################################################################################

# # create an empty list for appending accident info dictionaries later
# accidents_list = []

# # some formats are <a><b> Incident </b><a>
# for a in a_tags:

#     if a.find('b') is not None:

#         accident_path = a.get('href')
#         accident_response = requests.get(base + accident_path)
#         accident_page = bs(accident_response.text, 'html.parser')

#         get_all_info(a)

#         # time.sleep makes sure we are not requesting too quickly,
#         # and thus lowers the risk of being blocked. In this case, too slow
#         # time.sleep(2)

# for b in bolds:

#     if b.find('a') is not None:

#         accident_path = b.find('a').get('href')
#         accident_response = requests.get(base + accident_path)
#         accident_page = bs(accident_response.text, 'html.parser')

#         get_all_info(b)

#         # time.sleep(2)

# ################################################################################

# all_accidents_info = pd.DataFrame(accidents_list).drop_duplicates()
# all_accidents_info.to_csv('accidents.csv', index=False)

# ################################################################################

In [10]:
# # ACS Developer API Example
# # https://api.census.gov/data/2016/acs/acs5/variables.html
# # zcta, census tract
# # &for=tract:*&in=state:17 (fips code)
# ################################################################################

# import pandas as pd
# import requests

# ################################################################################

# api = 'https://api.census.gov/data/2016/acs/acs5?'

# params = {
# 'get':'NAME,B01001_001E,B01001_026E,B02001_002E,B01002_001E,B07011_001E',
# 'for':'county:*',
# 'in':'state:51',
# # in case api-key is needed:
# # 'key':'a04219e17e382b6cd50163e0780bf67fb23341d3'
# }

# j = requests.get(api, params=params).json()

# ################################################################################

# # <--! Data Selected from API: -->
# # `B01001_001E`: `total population`; `B01001_026E`: `total female population`
# # `B02001_002E`: `white population`; `B01002_001E`: `median age`
# # `B07011_001E`: `median income (past 12 months)`

# ################################################################################

# df = pd.DataFrame(j[1:], columns=j[0]).drop(['state', 'county'], axis=1)

# df.columns = ['County/City', 'Total_Population', 'Total_Female_Population',\
# 'White_Population', 'Median_Age', 'Median_Income']

# # create two other columns for data analysis
# df['Female_Proportion'] = df['Total_Female_Population'].astype(float)/\
# df['Total_Population'].astype(float)
# df['White_Proportion'] = df['White_Population'].astype(float)/\
# df['Total_Population'].astype(float)

# df['County/City'] = df['County/City'].str.rstrip(', Virginia')

# df.to_csv('acs.csv', index=False)

# ################################################################################

### Project Resources
- using [plotly](https://plot.ly/python/) to create interactive web plots instead of matplotlib
- using [django](https://www.djangoproject.com/) to make web apps in combination with your data analysis
- [cooperating on GitHub](https://help.github.com/en/categories/collaborating-with-issues-and-pull-requests)
    - branches
    - pull request
    - merge conflicts
