<a href="https://colab.research.google.com/github/anilaksu/Algorithmic-Trading-Codes/blob/Web-scraping-with-Asyncio-in-Python/Asynchronous_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web scraping with Asyncio and Python**


Anil Aksu

Personal e-mail: aaa293@cornell.edu

**Asyncio allows us to concurrent programming in Python**

**Outline:**


1.   Fundamentals
  * Synchronous vs Asynchronous
  * Blocking & Timeouts
  * Scraping with Selenium
  * Async Web Scraping with chrome driver and arsenic
  * Hide Arsenic logs
2.   Extraction & Formatting
  * Async Data with Python Pandas
  * Prepare to scrape multiple URLS
  * Extract Product Data
  * Async Product Data Extraction
3.   Prepare for re-usability
  * Modules & Submodules
  * Service Specific Submodule
  * Decouple logging & scraper
4.   Storing Data
  * Synchronous SQL storage with Pandas
  * Store scrapped data to SQL
  * Inspect stored data in Jupyter
  * Scraping URLs from stored link table
  * Scrape paginated list view
  * Results & Timing



#**1.Fundamentals**

In [3]:
# Here we install required libraries for asynchronous programming
!python3 -V
!which pip3
!pip3 install requests-html --upgrade --no-cache-dir
!pip3 install selenium --upgrade --no-cache-dir
!pip3 install arsenic --upgrade --no-cache-dir

Python 3.10.12
/usr/local/bin/pip3
Collecting selenium
  Downloading selenium-4.18.1-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.24.0-py3-none-any.whl (460 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.2/460.2 kB[0m [31m280.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m1

##1.2 Sync vs Async

It is basically consecutive vs concurrent

**Synchronous Code**

In [5]:
import time # it gives us the run time

iteration_times = [1, 3, 2, 4]

# It emulates the processes that take some time
def sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  time.sleep(seconds)

def run():
  for i, second in enumerate(iteration_times):
    sleeper(second, i=i)

run()

0	1s
1	3s
2	2s
3	4s


**Asynchronous Code**

In [11]:
import asyncio

iteration_times = [1, 3, 2, 4]

async def a_sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  await asyncio.sleep(seconds)

async def a_run():
  results = []
  for i, second in enumerate(iteration_times):
    results.append(
        asyncio.create_task(a_sleeper(second, i=i))
        )
  return results

results = await a_run()

print(results)

[<Task pending name='Task-23' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-24' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-25' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>, <Task pending name='Task-26' coro=<a_sleeper() running at <ipython-input-11-2040a03a9e27>:5>>]
0	1s
1	3s
2	2s
3	4s


##1.2 Blocking & Timeouts

In [13]:
def sleeper(seconds, i = -1):
  if i != -1:
    print(f"{i}\t{seconds}s")
  time.sleep(seconds)

In [49]:
async def a_sleeper(seconds, i = -1, timeout = 4):
  if i != -1:
    print(f"{i}\t{seconds}s")
  await asyncio.wait_for(asyncio.sleep(seconds), timeout=timeout)

await a_sleeper(3) #Allows us to block tasks similar to cells in this notebook

In [23]:
# Running asynchronously here allows us to run this code cell and still use rest of the notebook except for the result here
loop = asyncio.get_event_loop()
# loop = asyncio.new_event_loop()
# asyncio.run()
loop.create_task(a_sleeper(123))

<Task pending name='Task-33' coro=<a_sleeper() running at <ipython-input-14-f326cdd00bdf>:1>>

In [37]:
# Here we can assign status of tasks based on the execution time using timeout
done, pending = await asyncio.wait([a_sleeper(1), a_sleeper(4)], timeout = 2)

  done, pending = await asyncio.wait([a_sleeper(1), a_sleeper(4)], timeout = 2)


In [38]:
done

{<Task finished name='Task-46' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>}

In [39]:
pending

{<Task finished name='Task-47' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>}

In [32]:
# It will finish the pending task
await asyncio.wait(pending)

({<Task finished name='Task-36' coro=<a_sleeper() done, defined at <ipython-input-14-f326cdd00bdf>:1> result=None>},
 set())

In [43]:
# This returns if a task passes timeout limit
try:
  await asyncio.wait_for(a_sleeper(5), timeout = 3)
except asyncio.TimeoutError:
  print("Task failed")

Task failed


##1.3 Scraping with Selenium