Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
84 lines (55 sloc) 4.29 KB

Learn web scraping

by Aven, IMA, NYU Shanghai

What does web scraping do?

Even though some websites like NewYork times and twitter provide well documented api with accesses to information and low level operations but most others don't or at least, not ideal enough. For web sites does not provide data in a comfortable formats such as csv or json, people scrape and acquire their own version of data. In general, websites are presented with HTML, which means that each web page is a structured document that can be analyzed. What does web scraping do is basically get the html content, sort, analyze and present the data in structured and customized ways.

Requests vs. urllib(2): Get data from server.

There are different ways to get html content with code, and urllib2 as a python built in library always first jump into beginner's sight. Meanwhile, people are using the Requests module instead due to improvements in speed and readability; lxml is a pretty extensive library written for parsing XML and HTML documents very quickly, even handling messed up tags in the process. For more in general comparision of two libraries, further reading and code demo.

# with urllib2
# !/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2

gh_url = 'https://api.github.com'

req = urllib2.Request(gh_url)

password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, gh_url, 'user', 'pass')

auth_manager = urllib2.HTTPBasicAuthHandler(password_manager)
opener = urllib2.build_opener(auth_manager)

urllib2.install_opener(opener)

handler = urllib2.urlopen(req)

print (handler.getcode())
print (handler.headers.getheader('content-type'))
# with requests
# !/usr/bin/env python
# -*- coding: utf-8 -*-

import requests

r = requests.get('https://api.github.com', auth=('user', 'pass'))

print (r.status_code)
print (r.headers['content-type'])

Lxml vs beautifulSoup: Analyze and withdraw desired data.

Generally speacking, beautifulSoup is old while lxml is up to date and robust (not confirmed yet). Plus I haven't really put time into beaurigulSoup at all. Important Tip: Use CSS selector as element locator.

Requests + lxml demo: download image from website.

Requests sends request to server and get page source as response. Lxml takes raw html from requests and structures the content for later manipulations

What about selenium (vs. requests + lxml)?

What role will selenium play vs. requests and lxml combination? To understand this, some background information:

  • static vs. dynamic web page: What's the difference between static and dynamic website? In this article, it has a basic description of the difference of two main kinds of websites. And the simplest and most important key point(for web scraping) is how the content is presented,fixed or dynamically-generated.

  • structured vs. unstructured web content: Some websites built to display structured content with fixed html with css styling and response to no actions. And some others format the structured contents with templates and data presented has been edited and needed to be collected manually, presentation in later case is regraded as unstructured. And the process of transforming these unstructured content into structured form like spreadsheet, xml, csv representations is called web scraping.

Requests + lxml works well with static web page, but not so much for unstructured ones. Like scraping Sina Weibo, one's post will not be fully loaded unless the user scroll down, thus selenium simulatedbrowering is a better approach.

A note to mention here is lxml will some times be used with selenium for complicated cases but not always necessary. Selenium has its own easy use methods for withdrawing text or other values from elements.

Selenium demo: Scraping weibo.

Weibot: Webpage Weibot: Github

Not web scraping: acquire structured content.

Sina weibo also provide api for developer and here is a tutorial. And Sina offically authorized python sdk is here.