<a href="https://colab.research.google.com/github/anilaksu/Algorithmic-Trading-Codes/blob/Web-scraping-with-Asyncio-in-Python/Asynchronous_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web scraping with Asyncio and Python**


Anil Aksu

Personal e-mail: aaa293@cornell.edu

**Asyncio allows us to concurrent programming in Python**

**Outline:**


1.   Fundamentals
  * Synchronous vs Asynchronous
  * Blocking & Timeouts
  * Scraping with Selenium
  * Async Web Scraping with chrome driver and arsenic
  * Hide Arsenic logs
2.   Extraction & Formatting
  * Async Data with Python Pandas
  * Prepare to scrape multiple URLS
  * Extract Product Data
  * Async Product Data Extraction
3.   Prepare for re-usability
  * Modules & Submodules
  * Service Specific Submodule
  * Decouple logging & scraper
4.   Storing Data
  * Synchronous SQL storage with Pandas
  * Store scrapped data to SQL
  * Inspect stored data in Jupyter
  * Scraping URLs from stored link table
  * Scrape paginated list view
  * Results & Timing



In [24]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/MyDrive/ColabNotebooks/FinanceAlgorithms
!ls # special shell command to view the files in the home directory of the notebook environment

Mounted at /content/gdrive
/content/gdrive/MyDrive/ColabNotebooks/FinanceAlgorithms
 2013-03-08options.csv	      EURUSD_Options_Data.csv	   OptionsTrading.ipynb
 2013-03-08stocks.csv	      EURUSD_Options_Data.gsheet   PriceJump.gdraw
'Asynchronous Python.ipynb'  'ForEx&IndexData.xls'	  'Stock Markets Codes.ipynb'


#**1.Fundamentals**

In [73]:
# Here we install required libraries for asynchronous programming
!python3 -V
!which pip3
!pip3 install requests-html --upgrade --no-cache-dir
!pip3 install selenium --upgrade --no-cache-dir
!pip3 install arsenic --upgrade --no-cache-dir
!pip3 install google-chrome-stable --upgrade --no-cache-dir

Python 3.10.12
/usr/local/bin/pip3
[31mERROR: Could not find a version that satisfies the requirement google-chrome-stable (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for google-chrome-stable[0m[31m
[0mCollecting chromedriver-autoinstaller
  Downloading chromedriver_autoinstaller-0.6.4-py3-none-any.whl (7.6 kB)
Installing collected packages: chromedriver-autoinstaller
Successfully installed chromedriver-autoinstaller-0.6.4


##1.2 Sync vs Async

It is basically consecutive vs concurrent

**Synchronous Code**

In [5]:
import time # it gives us the run time

iteration_times = [1, 3, 2, 4]

# It emulates the processes that take some time
def sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  time.sleep(seconds)

def run():
  for i, second in enumerate(iteration_times):
    sleeper(second, i=i)

run()

0	1s
1	3s
2	2s
3	4s


**Asynchronous Code**

In [11]:
import asyncio

iteration_times = [1, 3, 2, 4]

async def a_sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  await asyncio.sleep(seconds)

async def a_run():
  results = []
  for i, second in enumerate(iteration_times):
    results.append(
        asyncio.create_task(a_sleeper(second, i=i))
        )
  return results

results = await a_run()

print(results)

[<Task pending name='Task-23' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-24' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-25' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-26' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>]
0	1s
1	3s
2	2s
3	4s


##1.2 Blocking & Timeouts

In [13]:
def sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  time.sleep(seconds)

In [49]:
async def a_sleeper(seconds, i = -1, timeout = 4):
  if i != -1:
    print(f"{i}\t{seconds}s")
  await asyncio.wait_for(asyncio.sleep(seconds), timeout=timeout)

await a_sleeper(3) #Allows us to block tasks similar to cells in this notebook

In [23]:
# Running asynchronously here allows us to run this code cell and still use rest of the notebook except for the result here
loop = asyncio.get_event_loop()
# loop = asyncio.new_event_loop()
# asyncio.run()
loop.create_task(a_sleeper(123))

<Task pending name='Task-33' coro=<a_sleeper() running at <ipython-input-14-f326cdd00bdf>:1>>

In [37]:
# Here we can assign status of tasks based on the execution time using timeout
done, pending = await asyncio.wait([a_sleeper(1), a_sleeper(4)], timeout = 2)

  done, pending = await asyncio.wait([a_sleeper(1), a_sleeper(4)], timeout = 2)


In [38]:
done

{<Task finished name='Task-46' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>}

In [39]:
pending

{<Task finished name='Task-47' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>}

In [32]:
# It will finish the pending task
await asyncio.wait(pending)

({<Task finished name='Task-36' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>},
 set())

In [43]:
# This returns if a task passes timeout limit
try:
  await asyncio.wait_for(a_sleeper(5), timeout = 3)
except asyncio.TimeoutError:
  print("Task failed")

Task failed


##1.3 Scraping with Selenium

In [8]:
url = 'https://www.spoonflower.com/en/shop?on=fabric%27'

In [23]:
import re
import requests
from requests_html import HTML
import pandas as pd

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

In [19]:
def scraper(url):
  options = Options()
  options.add_argument("--headless")
  options.add_argument("--no-sandbox");
  options.add_argument("--disable-dev-shm-usage");
  driver = webdriver.Chrome(options = options)
  driver.get(url)
  return driver.page_source

def extract_id_slug(url_path):
  regex = r"^[^\s]+/(?P<id>\d+)-(?P<slug>[\w_-]+)$"
  group = re.match(regex, url_path)
  if not group:
    return None, None
  return group['id'], group['slug']

In [15]:
# Here we pull the data from the page
content = scraper(url)
html_r = HTML(html = content)

In [22]:
fabric_links = [x for x in list(html_r.links) if x.startswith("/en/fabric")]   # Links for fabric images

datas = []

for path in fabric_links:
  id_, slug_ = extract_id_slug(path) #product id and product
  data = {
      "id" : id_,
      "slug" : slug_,
      "path" : path,
      "scraped" : 0 # True / False -> 1 / 0
  }
  datas.append(data)
  #print(id_, slug_)

In [25]:
# Here we save it to dataframe
df = pd.DataFrame(datas)
df.head()

Unnamed: 0,id,slug,path,scraped
0,15632743,whimsigothic-moth-wallpaper-24-by-garabateo,/en/fabric/15632743-whimsigothic-moth-wallpape...,0
1,12300699,cozy-night-sky-large-full-moon-stars-over-clou...,/en/fabric/12300699-cozy-night-sky-large-full-...,0
2,12878048,blackthorn-by-william-morris-antique-colors-24...,/en/fabric/12878048-blackthorn-by-william-morr...,0
3,13614377,vintage-summer-romanticism-maximalism-moody-fl...,/en/fabric/13614377-vintage-summer-romanticism...,0
4,3730688,william-morris-antiqued-pimpernel-original-on-...,/en/fabric/3730688-william-morris-antiqued-pim...,0


In [26]:
# Here we save it to the local
df.to_csv("local.csv", index = False)
df = pd.read_csv("local.csv")
df.head(5)

Unnamed: 0,id,slug,path,scraped
0,15632743,whimsigothic-moth-wallpaper-24-by-garabateo,/en/fabric/15632743-whimsigothic-moth-wallpape...,0
1,12300699,cozy-night-sky-large-full-moon-stars-over-clou...,/en/fabric/12300699-cozy-night-sky-large-full-...,0
2,12878048,blackthorn-by-william-morris-antique-colors-24...,/en/fabric/12878048-blackthorn-by-william-morr...,0
3,13614377,vintage-summer-romanticism-maximalism-moody-fl...,/en/fabric/13614377-vintage-summer-romanticism...,0
4,3730688,william-morris-antiqued-pimpernel-original-on-...,/en/fabric/3730688-william-morris-antiqued-pim...,0


##1.4 Async Web Scraping with `chromedriver` and `arsenic`

In [76]:
%%writefile async_scrape.py

import os
import asyncio
from arsenic import get_session, keys, browsers, services
import pandas as pd
from requests_html import HTML
import itertools
import re
import time
import pathlib


async def extract_id_slug(url_path):
  regex = r"^[^\s]+/(?P<id>\d+)-(?P<slug>[\w_-]+)$"
  group = re.match(regex, url_path)
  if not group:
    return None, None
  return group['id'], group['slug']

async def scraper(url):
  service = services.Chromedriver()
  browser = browsers.Chrome(chromeOptions = {
    'args': ['--headless', '--disable-gpu']
  })

  async with get_session(service, browser) as session:
              await session.get(url)
              body = await session.get_page_source()
              print(body)
              return body

# main script
if __name__ == "__main__":
  url = 'https://www.spoonflower.com/en/shop?on=fabric%27'
  #loop = asyncio.get_event_loop()
  #results = loop.run_until_complete(scraper(url))
  results = asyncio.run(scraper(url))  # we can run alternatively this way

Overwriting async_scrape.py


In [77]:
!python async_scrape.py

Traceback (most recent call last):
  File "/content/gdrive/MyDrive/ColabNotebooks/FinanceAlgorithms/async_scrape.py", line 37, in <module>
    results = asyncio.run(scraper(url))  # we can run alternatively this way
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/content/gdrive/MyDrive/ColabNotebooks/FinanceAlgorithms/async_scrape.py", line 26, in scraper
    async with get_session(service, browser) as session:
  File "/usr/local/lib/python3.10/dist-packages/arsenic/__init__.py", line 16, in __aenter__
    self.session = await start_session(self.service, self.browser, self.bind)
  File "/usr/local/lib/python3.10/dist-packages/arsenic/__init__.py", line 28, in start_session
    driver = await service.start()
  File "/usr/local/lib/python3.10/dist-packages/arsenic/services.py", line 122, in start
    return awai

#**2. Extraction & Formatting**