# "A curious case of `requests.get`"
> "Python Crawler"

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]

It is quite easy to write a crawler with `Python`

In [None]:
import requests
requests.get("https://google.com")

But in a production environment it is not enough.

After iterating a few times, I came up with this one.

In [None]:
DEFAULT_TIMEOUT = 10
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, **kwargs):
        self.timeout = DEFAULT_TIMEOUT
        if "timeout" in kwargs:
            self.timeout = kwargs["timeout"]
            del kwargs["timeout"]
        super().__init__(*args, **kwargs)

    def send(self, request, **kwargs):
        timeout = kwargs.get("timeout")
        if timeout is None:
            kwargs["timeout"] = self.timeout
        return super().send(request, **kwargs)


def get_http_session(timeout=DEFAULT_TIMEOUT, retry_count=1):
    retry_strategy = Retry(
        total=retry_count,
        raise_on_redirect=True,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"]
    )
    session = requests.Session()
    adapter = TimeoutHTTPAdapter(timeout=timeout, max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.mount("http://", adapter)
    return session

In [None]:
get_http_session(timeout=3, retry_count=3).get("https://google.com")

But in reality, internet is full of dead bodies and broken links.
