# Web scraping
## Background
### How web browsing works?
We send requests to the server through url(Uniform Resource Locator) in order to get information. 
- Clients enter url into web browser (e.g. Chrome);
- Web browser sends requests to web server;
- Web server sends 'html', 'css', 'javascript' back to web browser;
- Web browser interprets these files into human readable web pages.

### What is the structure of a webpage?
- html
- css
- Javascript

## Document Load
### How to load these documents into python?
'urllib' and 'requests' are two mostly used packages. 
- 'urllib' is built in for python, no installation needed;
- 'requests' can be installed in terminal by cmd:
    - _pip install requests_
        - pip3 for python3
        - add "--user" at the end for local installation in server

### _urllib_ package

In [1]:
from urllib.request import urlopen # from urllib package import function urlopen
URL = "https://www.imdb.com/chart/top"
html = urlopen(URL) # load the web pages into python
#print(dir(html))
#print(html)
#html_docs = html.read() # read the html file
html_docs = html.read().decode() # decode helps perform the html better
#print(html_docs)

<http.client.HTTPResponse object at 0x0000014B858875F8>


### _requests_ package
_requests_ is the only _Non-GMO_ HTTP library for Python, safe for human comsumption. The use of __Python 3__ is highly preferred for _requests_.

In [2]:
import requests
r = requests.get(URL)
#print(dir(r))

#print(r.text)
#print(r.content)
#print(r.content.decode())
#r.status_code
#r.apparent_encoding

Pass some parameters to the webpage.

In [3]:
import webbrowser
URL = "http://www.google.com/search"
param = {"q": "bumblebee"}
r = requests.get(URL, params=param)
webbrowser.open(r.url)

True

#### Post
We test post function using this [page](http://pythonscraping.com/pages/files/form.html). We can pass some data to the server for analysis, and the server will send some response to us accordingly.

In [4]:
data = {'firstname': 'Yao', 'lastname': 'Chen'}
r = requests.post('http://pythonscraping.com/pages/files/processing.php', data=data)
print(r.text)

Hello there, Yao Chen!


#### Uploading image
We test uploading using this [page](http://pythonscraping.com/files/form2.html).

In [5]:
file = {'uploadFile': open('../img/PU.jpg', 'rb')}
r = requests.post('http://pythonscraping.com/pages/files/processing2.php', files=file)
print(r.text)

uploads/PU.jpg
The file PU.jpg has been uploaded.


#### Login
Use post method to login to a [website](http://pythonscraping.com/pages/cookies/login.html).

In [6]:
login = {'username': 'Yao', 'password': 'password'}
r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=login)
print(r.cookies.get_dict())

{'loggedin': '1', 'username': 'Yao'}


In [7]:
session = requests.Session()
r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=login)
print(r.cookies.get_dict())

{'loggedin': '1', 'username': 'Yao'}
