# Web Scraping with Python

Feng Li

School of Statistics and Mathematics

Central University of Finance and Economics

[feng.li@cufe.edu.cn](mailto:feng.li@cufe.edu.cn)

[https://feng.li/python](https://feng.li/python)

# What Is Web Scraping?

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.


In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.

# Your First Web Scraper

## Let's try the toy first

In [1]:
from urllib.request import urlopen
html = urlopen('https://feng.li/python/')
print(html.read())

b'<!doctype html>\n<html lang="en-US" class="respect-color-scheme-preference">\n<head>\n\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<title>Python\xe7\xa8\x8b\xe5\xba\x8f\xe8\xae\xbe\xe8\xae\xa1 &#8211; Dr. Feng Li</title>\n<meta name=\'robots\' content=\'max-image-preview:large\' />\n<link rel=\'dns-prefetch\' href=\'//s.w.org\' />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Feed" href="https://feng.li/feed/" />\n<link rel="alternate" type="application/rss+xml" title="Dr. Feng Li &raquo; Comments Feed" href="https://feng.li/comments/feed/" />\n\t\t<script>\n\t\t\twindow._wpemojiSettings = {"baseUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/13.1.0\\/72x72\\/","ext":".png","svgUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/13.1.0\\/svg\\/","svgExt":".svg","source":{"concatemoji":"https:\\/\\/feng.li\\/wordpress\\/wp-includes\\/js\\/wp-emoji-release.min.js?ver=5.8.2"}};\n\t\t\t!function

The above doesn’t look so great. Below is better.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs)

<!DOCTYPE html>

<html class="respect-color-scheme-preference" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Python程序设计 – Dr. Feng Li</title>
<meta content="max-image-preview:large" name="robots"/>
<link href="//s.w.org" rel="dns-prefetch"/>
<link href="https://feng.li/feed/" rel="alternate" title="Dr. Feng Li » Feed" type="application/rss+xml"/>
<link href="https://feng.li/comments/feed/" rel="alternate" title="Dr. Feng Li » Comments Feed" type="application/rss+xml"/>
<script>
			window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.1.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/13.1.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/feng.li\/wordpress\/wp-includes\/js\/wp-emoji-release.min.js?ver=5.8.2"}};
			!function(e,a,t){var n,r,o,i=a.createElement("canvas"),p=i.getContext&&i.getContext("2d");function s(e,t){var a=String.fromCharCode;p.cl

## The complete case

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://feng.li/python/')
bs = BeautifulSoup(html.read(), 'html.parser')
nameList = bs.findAll('div', {'class':'entry-content'})
for name in nameList:
    print(name.get_text())




Contents1 课程简介2 授课教师3 参考书4 讲课视频5 幻灯片
课程简介
Python程序设计是面向财经和统计专业学生开设的一门以应用为主的编程课程，该课程最早由李丰老师在中央财经大学以公开讲座的形式开设，后成为中央财经大学金融、会计和MBA项目的核心课程。
授课教师


李丰博士现任中央财经大学统计与数学学院副院长、副教授、硕士生导师。博士毕业于瑞典斯德哥尔摩大学，研究领域包括贝叶斯统计学，预测方法，大数据分布式学习等。曾获瑞典皇家统计学会 Cramér 奖，国际贝叶斯学会青年奖励基金， 第二届全国高校经管类实验教学案例大赛二等奖。主持和参与多项国家自然科学基金项目。
李丰博士最新研究成果发表在统计期刊 Journal of Computational and Graphical Statistics，Journal of Business and Economic Statistics, Statistical Analysis and Data Mining，经济与管理学期刊 International Journal of Forecasting，Journal of Business Research，运筹学期刊European Journal of Operational Research, Journal of the Operational Research Society，人工智能期刊 Expert Systems with Applications，医学期刊 BMJ Open, Journal of Surgical Research, Journal of Affective Disorders等。同时著有 Bayesian Modeling of Conditional Densities，《大数据分布式计算与案例》和《统计计算》。


参考书
Python可以被广泛地使用在财经领域，以下列出一些零基础书目。
类别书名中译本数据分析Python for Data Analysis (by Wes McKinney)利用Python进行数据分析（原书第2版）数据抓取Web Scraping with Python: Collecting More Data from the Modern Web (by Ryan Mitchel

## Web Scraping with `BeautifulSoup`

In [7]:
import logging
import requests
import sys
import urllib

from bs4 import BeautifulSoup
from collections import OrderedDict
from urllib.parse import urlencode

def get_list(comp, page):
    """Function to get  web list pages for a given company and page number.

    Args:
        comp: Company name.
        page: The page number.

    Returns:
        newsData: A dictionary with news title as its key and other details as values.

    """
    newsData = OrderedDict()
    href = 'http://search.sina.com.cn/?%s&range=title&c=news&num=20&col=1_7&page=%s' % (comp, page) # comp -> first %s; page -> 2nd %s; col=1_7 -> financial news in sina
    html = requests.get(href)
    # Parsing html
    soup = BeautifulSoup(html.content, 'html.parser')
    divs = soup.findAll('div', {"class": "r-info r-info2"})
    for div in divs:
        head = div.findAll('h2')[0]
        # News title
        titleinfo = head.find('a')
        title = titleinfo.get_text()
        # News url
        url = titleinfo['href']
        # Other info
        otherinfo = head.find('span', {"class": "fgray_time"}).get_text()
        source, date, time = otherinfo.split()
        # News abstract
        abstract = div.find('p', {"class": "content"}).get_text()
        newsData[title] = [date, source, abstract, url]
    return newsData



if __name__ == "__main__":
    compRawStr = '百度'
    # Dealing with character encoding
    comp = compRawStr.encode('gbk')
    d = {'q': comp}
    pname = urlencode(d)
    # Scraping and printing the first two pages
    for page in range(3)[1:]:
        newsData = get_list(pname, page)
        for ky in newsData:
            print('\001'.join([ky] + newsData[ky])) # "\001" as separator

In [6]:
import logging
import requests
import sys

from bs4 import BeautifulSoup




def get_body(href):
    """Function to retrieve news content given its url.

    Args:
        href: url of the news to be crawled.

    Returns:
        content: the crawled news content.

    """
    html = requests.get(href)
    soup = BeautifulSoup(html.content, 'html.parser')
    div = soup.find('div', {"id": "artibody"})
    paras = div.findAll('p')
    content = ''
    for p in paras:
        ptext = p.get_text().strip().replace("\n", "")
        content += ptext
    return content



if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)
    # Getting and printing content for each url in the crawled web list pages
    with open("data/baidu_list") as f:
        for line in f:
            title, date, source, abstract, href = line.strip().split('\001')
            # Printing progress onto console
            logging.info('Scraping ' + href)
            content = get_body(href)
            print('\001'.join([title, date, source, abstract, href, content]))

FileNotFoundError: [Errno 2] No such file or directory: 'data/baidu_list'

# Web Crawling with `Scrapy`

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: find all links on a page, evaluate the difference between internal and external links, go to new pages. These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you.

##  Installing Scrapy

- After Anaconda is installed, you can install Scrapy by using this command:
   
      conda install -c conda-forge scrapy



## Dealing with Different Website Layouts

Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.


## Initializing a New Spider

To create a new spider in the current directory, run the following from the **command line (NOT THE PYTHON PROMPT)**:
```
    scrapy startproject wikiSpider
```    
    
This creates a new subdirectory in the directory the project was created in, with the title wikiSpider. Inside this directory is the following file structure:

- scrapy.cfg
- wikiSpider
  - spiders
     - __init.py__
  - items.py
  - middlewares.py
  - pipelines.py
  - settings.py
  - __init.py__
  
  
## Generate some spiders with templates from the command line

    scrapy genspider example example.com 
    scrapy genspider example2 example.com 
    scrapy genspider example3 example2.com 
    
  
## Writing a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at wikiSpider/wikiSpider/spiders/article.py. In your newly created **article.py** file, write the following:

```python
    import scrapy

    class ArticleSpider(scrapy.Spider):
        name='article'

        def start_requests(self):
            urls = [
                'http://en.wikipedia.org/wiki/Python_%28programming_language%29',
                'https://en.wikipedia.org/wiki/Functional_programming',
                'https://en.wikipedia.org/wiki/Monty_Python']
            return [scrapy.Request(url=url, callback=self.parse) for url in urls]

        def parse(self, response):
            url = response.url
            title = response.css('h1::text').extract_first()
            print('URL is: {}'.format(url))
            print('Title is: {}'.format(title))
```

## Run this article spider

You can run this article spider by navigating to the wikiSpider/wikiSpider directory and running from the command line:

    scrapy runspider article.py
        
## Run your project with at the project root directory

    scrapy crawl table -o table.csv  --logfile table.log
    

## Scrapy Shell

To do the crawler interactively, just run from the command line

    scrapy shell "http://en.wikipedia.org/wiki/Python_%28programming_language%29"

# Lab 

Use `scrapy` framework to implement the we studied with `BeautifulSoup`