### 多线程并发
#### 同一时刻，Python 主程序只允许有一个线程执行，所以 Python 的并发，是通过多线程的切换完成的。
- 事实上，Python 的解释器并不是线程安全的，为了解决由此带来的 race condition 等问题，Python 便引入了全局解释器锁，也就是同一时刻，只允许一个线程执行。当然，在执行 I/O 操作时，如果一个线程被 block 了，全局解释器锁便会被释放，从而让另一个线程能够继续执行。
    - GIL，是最流行的 Python 解释器 CPython 中的一个技术术语。它的意思是全局解释器锁，本质上是类似操作系统的 Mutex。每一个 Python 线程，在 CPython 解释器中执行时，都会先锁住自己的线程，阻止别的线程执行。
    - CPython 会做一些小把戏，轮流执行 Python 线程。这样一来，用户看到的就是“伪并行”——Python 线程在交错执行，来模拟真正并行的线程。
    - CPython 引进 GIL 其实主要就是这么两个原因:
        - 一是设计者为了规避类似于内存管理这样的复杂的竞争风险问题（race condition）；
        - 二是因为 CPython 大量使用 C 语言库，但大部分 C 语言库都不是原生线程安全的（线程安全会降低性能和增加复杂度）。
    - CPython 中还有另一个机制，叫做 check_interval，意思是 CPython 解释器会去轮询检查线程 GIL 的锁住情况。每隔一段时间，Python 解释器就会强制当前线程去释放 GIL，这样别的线程才能有执行的机会。
- 如何绕过 GIL？Python 的 GIL，是通过 CPython 的解释器加的限制。如果你的代码并不需要 CPython 解释器来执行，就不再受 GIL 的限制。
    - 很多高性能应用场景都已经有大量的 C 实现的 Python 库，例如 NumPy 的矩阵运算，就都是通过 C 来实现的，并不受 GIL 影响。
    - 绕过 CPython，使用 JPython（Java 实现的 Python 解释器）等别的实现；

- with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor
- executor.map([function], [input])


### 多进程并行
- with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor


In [6]:
import requests
import bs4
import concurrent.futures

def fetch_content(url):
    try:
        # Add headers to make the request look more like a browser
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        # requests.get是线程安全的
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        if resp.text:
            print('Read {} from {}'.format(len(resp.text), url))
            return resp.text
        else:
            print('Empty response from {}'.format(url))
            return None
    except requests.RequestException as e:
        print('Error fetching {}: {}'.format(url, e))
        return None

def crawl_movie(url):
    init_page = fetch_content(url)
    init_soup = bs4.BeautifulSoup(init_page, 'lxml')

    movie_names, urls_to_fetch, movie_dates, pages = [], [], [], []
    all_movies = init_soup.find('div', id="showing-soon")
    for movie in all_movies.find_all('div', class_='item'):
        all_a_tag = movie.find_all('a')
        all_li_tag = movie.find_all('li')
        # eg:<a href="http://example.com/1">Link 1</a>
        movie_name = all_a_tag[1].text
        url_to_fetch = all_a_tag[1]['href']
        movie_date = all_li_tag[0].text

        movie_names.append(movie_name)
        urls_to_fetch.append(url_to_fetch)
        movie_dates.append(movie_date)


    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        pages.extend(executor.map(fetch_content, urls_to_fetch))
    
    for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):
        soup_item = bs4.BeautifulSoup(page, 'lxml')
        img_tag = soup_item.find("img")
        print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))

if __name__ == "__main__":
    url = "https://movie.douban.com/cinema/later/beijing/"
    crawl_movie(url)

Error fetching https://movie.douban.com/cinema/later/beijing/: 403 Client Error: Forbidden for url: https://sec.douban.com/b?r=https%3A%2F%2Fmovie.douban.com%2Fcinema%2Flater%2Fbeijing%2F


TypeError: object of type 'NoneType' has no len()

In [39]:
import concurrent.futures
import requests
import threading
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))


def download_all(sites):
    with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.map(download_one, sites)

def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()

Download 15 sites in 0.0686110001988709 seconds


Process SpawnProcess-95:
Process SpawnProcess-96:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/lib/python3.12/concurrent/futures/process.py", line 251, in _process_worker
    call_item = call_queue.get(block=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'download_one' on <module '__main__' (<class '_frozen_importlib.BuiltinImporter'>)>
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 10

In [1]:
import requests
import bs4
import concurrent.futures

def fetch_content(url):
    try:
        # Add headers to make the request look more like a browser
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        # requests.get是线程安全的
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        if resp.text:
            print('Read {} from {}'.format(len(resp.text), url))
            return resp.text
        else:
            print('Empty response from {}'.format(url))
            return None
    except requests.RequestException as e:
        print('Error fetching {}: {}'.format(url, e))
        return None

def crawl_movie(url):
    init_page = fetch_content(url)
    init_soup = bs4.BeautifulSoup(init_page, 'lxml')

    movie_names, urls_to_fetch, movie_dates, pages = [], [], [], []
    all_movies = init_soup.find('div', id="showing-soon")
    for movie in all_movies.find_all('div', class_='item'):
        all_a_tag = movie.find_all('a')
        all_li_tag = movie.find_all('li')
        # eg:<a href="http://example.com/1">Link 1</a>
        movie_name = all_a_tag[1].text
        url_to_fetch = all_a_tag[1]['href']
        movie_date = all_li_tag[0].text

        movie_names.append(movie_name)
        urls_to_fetch.append(url_to_fetch)
        movie_dates.append(movie_date)


    with concurrent.futures.ProcessPoolExecutor() as executor:
        pages.extend(executor.map(fetch_content, urls_to_fetch))
    
    for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):
        soup_item = bs4.BeautifulSoup(page, 'lxml')
        img_tag = soup_item.find("img")
        print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))

if __name__ == "__main__":
    url = "https://movie.douban.com/cinema/later/beijing/"
    crawl_movie(url)

Read 99349 from https://movie.douban.com/cinema/later/beijing/


Process SpawnProcess-3:
Process SpawnProcess-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/lib/python3.12/concurrent/futures/process.py", line 251, in _process_worker
    call_item = call_queue.get(block=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'fetch_content' on <module '__main__' (<class '_frozen_importlib.BuiltinImporter'>)>
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/anaconda3/lib/python3.12/multiprocessing/process.py", line 108

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

### futures
- executor.submit(func),将返回创建的future实例
- done(),非阻塞,立即返回查询结果
- add_done_callback(fn),future执行完成,回调对应函数
- result(),future执行完成后返回结果
- as_completed(fs),针对给定的 future 迭代器 fs，在其完成后，返回完成后的迭代器。

In [4]:
import concurrent.futures
import requests
import time

def download_one(url):
    try:
        resp = requests.get(url)
        print('Read {} from {}'.format(len(resp.content), url))
    except requests.RequestException as e:
        print('Error fetching {}: {}'.format(url, e))
        return None

def download_all(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        to_do = []
        for site in sites:
            future = executor.submit(download_one, site)
            to_do.append(future)
            
        try:
            for future in concurrent.futures.as_completed(to_do):
                future.result()
        except concurrent.futures.TimeoutError as e:
            print(f"Error during parallel processing: {e}")
        except concurrent.futures.TimeoutError:
            print('Request timed out')
        except concurrent.futures.CancelledError:
            print('Request was cancelled')
        except requests.RequestException as e:
            print(f'Request failed: {e}')
        except Exception as e:
            print(f'Unexpected error occurred: {e}')
def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()
    time.sleep(10)

Unexpected error occurred: A process in the process pool was terminated abruptly while the future was running or pending.
Download 15 sites in 0.11836389999371022 seconds
