<a href="https://colab.research.google.com/github/armandossrecife/teste/blob/main/download_sync_async.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilidades

## Classe que extrai nome e extensão de um arquivo em uma URL

Uma URL (Uniform Resource Locator), ou endereço web, é como um endereço postal na internet. Ela indica a localização exata de um recurso online, como uma página web, imagem, vídeo ou outro arquivo.

**Componentes de uma URL:**

  * **Esquema:** Indica o protocolo utilizado para acessar o recurso. Os mais comuns são:
      * **http:** Protocolo de transferência de hipertexto, utilizado para páginas web não seguras.
      * **https:** Versão segura do HTTP, que utiliza criptografia para proteger a comunicação.
      * **ftp:** Protocolo de transferência de arquivos, utilizado para transferir arquivos entre computadores.
      * **mailto:** Utilizado para enviar e-mails.
  * **Host:** Identifica o servidor onde o recurso está localizado. É composto pelo nome de domínio (ex: [URL inválido removido]) e, opcionalmente, o número da porta.
  * **Caminho:** Especifica a localização exata do recurso dentro do servidor. É semelhante à estrutura de pastas em um computador.
  * **Parâmetros de consulta:** Fornecem informações adicionais ao servidor, como resultados de busca ou configurações personalizadas. São separados do caminho por um ponto de interrogação (?) e os pares chave-valor são separados por um & (ampersand).
  * **Fragmento:** Indica uma parte específica de uma página, como uma seção ou um link dentro da página. É separado do resto da URL por um jogo da velha (\#).

**Exemplo:**

```
https://www.exemplo.com/pasta/arquivo.html?parametro1=valor1&parametro2=valor2#secao
```

  * **https:** Protocolo seguro.
  * **[www.exemplo.com](https://www.google.com/url?sa=E&source=gmail&q=https://www.exemplo.com):** Nome de domínio do servidor.
  * **/pasta/arquivo.html:** Caminho do arquivo dentro do servidor.
  * **?parametro1=valor1\&parametro2=valor2:** Parâmetros de consulta.
  * **\#secao:** Fragmento, indicando uma seção específica da página.

**Para que serve cada componente:**

  * **Esquema:** Define como o navegador deve se conectar ao servidor.
  * **Host:** Identifica o servidor específico onde o recurso está hospedado.
  * **Caminho:** Direciona o navegador para o local exato do arquivo dentro do servidor.
  * **Parâmetros de consulta:** Permitem que você passe informações adicionais ao servidor, como resultados de busca, configurações personalizadas ou dados de formulários.
  * **Fragmento:** Permite que você navegue diretamente para uma parte específica de uma página, sem precisar carregar a página inteira novamente.

In [22]:
from urllib.parse import urlparse
import os

class Util:
  def extrair_nome_extensao_url(self, url):
    try:
      parsed_url = urlparse(url)
      if parsed_url.scheme not in ('http', 'https', 'ftp'):
        raise ValueError(f"Unsupported protocol: {parsed_url.scheme}")

      caminho_arquivo = parsed_url.path
      if not caminho_arquivo:
        raise ValueError("Missing file path in URL")

      #nome_arquivo, extensao = os.path.splitext(os.path.basename(caminho_arquivo))
      nome_arquivo, extensao = os.path.basename(caminho_arquivo).rsplit('.', 1)

      if not nome_arquivo:
        raise ValueError("Missing file name")

      return nome_arquivo, extensao

    except Exception as ex:
      raise ValueError(f"{str(ex)}") from ex

In [27]:
util = Util()
url = "https://exemplo.com/caminho/para/arquivo.txt"
nome_arquivo, extensao = util.extrair_nome_extensao_url(url)
print(nome_arquivo)  # Saída: arquivo
print(extensao)      # Saída: .txt

arquivo
txt


In [28]:
url2 = "https://www.exemplo.com/pasta/arquivo.html?parametro1=valor1&parametro2=valor2#secao"
nome_arquivo, extensao = util.extrair_nome_extensao_url(url2)
print(nome_arquivo)  # Saída: arquivo
print(extensao)      # Saída: .html

arquivo
html


## Limpa a estrutura das pastas que vão armazenar os arquivos

In [29]:
!rm -rf sincrono && mkdir sincrono
!rm -rf assincrono && mkdir assincrono
!rm -rf threads && mkdir threads

In [25]:
!ls -lia

total 28
6815750 drwxr-xr-x 1 root root 4096 Jan 13 15:19 .
5505040 drwxr-xr-x 1 root root 4096 Jan 13 14:25 ..
5509883 drwxr-xr-x 2 root root 4096 Jan 13 15:19 assincrono
2621447 drwxr-xr-x 4 root root 4096 Jan  9 14:24 .config
6815751 drwxr-xr-x 1 root root 4096 Jan  9 14:24 sample_data
5509882 drwxr-xr-x 2 root root 4096 Jan 13 15:19 sincrono
5509884 drwxr-xr-x 2 root root 4096 Jan 13 15:19 threads


In [26]:
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy


## Prepara as URLs dos arquivos de exemplo (Download de arquivos públicos)

In [30]:
my_raw_data_site = "https://raw.githubusercontent.com/armandossrecife/teste/main"
my_url1 = my_raw_data_site + "/" + "Adrienne.mp4"
my_url2 = my_raw_data_site + "/" + "Pizigani_1367_Chart_10MB.jpg"
my_url3 = my_raw_data_site + "/" + "Kalimba.mp3"
my_url4 = my_raw_data_site + "/" + "screen_matrix.jpeg"
my_url5 = my_raw_data_site + "/" + "demo.zip"

my_urls = [my_url1, my_url2, my_url3, my_url4, my_url5]

my_util = Util()
my_filenames = []
for url in my_urls:
  print(url)
  nome_arquivo, extensao = my_util.extrair_nome_extensao_url(url)
  filename = f"{nome_arquivo}.{extensao}"
  print(f"{nome_arquivo}.{extensao}")
  my_filenames.append(filename)

https://raw.githubusercontent.com/armandossrecife/teste/main/Adrienne.mp4
Adrienne.mp4
https://raw.githubusercontent.com/armandossrecife/teste/main/Pizigani_1367_Chart_10MB.jpg
Pizigani_1367_Chart_10MB.jpg
https://raw.githubusercontent.com/armandossrecife/teste/main/Kalimba.mp3
Kalimba.mp3
https://raw.githubusercontent.com/armandossrecife/teste/main/screen_matrix.jpeg
screen_matrix.jpeg
https://raw.githubusercontent.com/armandossrecife/teste/main/demo.zip
demo.zip


# Download síncrono

https://requests.readthedocs.io

Requests is an elegant and simple HTTP library for Python, built for human beings.

Exemplo de uso da biblioteca requets:

In [35]:
import requests

def buscar_cep(cep):
    url = f'https://viacep.com.br/ws/{cep}/json/'
    response = requests.get(url)
    dados = response.json()
    return dados

cep = '64007250'
resultado = buscar_cep(cep)
print(resultado)
print(f"Logradouro: {resultado['logradouro']}")
print(f"Bairro: {resultado['bairro']}")
print(f"Cidade: {resultado['localidade']}")
print(f"Estado: {resultado['uf']}")

{'cep': '64007-250', 'logradouro': 'Rua Território Fernando de Noronha', 'complemento': '', 'unidade': '', 'bairro': 'Aeroporto', 'localidade': 'Teresina', 'uf': 'PI', 'estado': 'Piauí', 'regiao': 'Nordeste', 'ibge': '2211001', 'gia': '', 'ddd': '86', 'siafi': '1219'}
Logradouro: Rua Território Fernando de Noronha
Bairro: Aeroporto
Cidade: Teresina
Estado: PI


## Funções para download de arquivos

In [37]:
import requests

# Faz o download de um único arquivo
def download_one_file(url, filename, path):
  response = requests.get(url, stream=True)
  if response.status_code == 200:
    total_size = int(response.headers['content-length'])  # Get total file size
    print(f"Total file size: {total_size} bytes")
    path = os.path.join(path, filename)
    with open(path, 'wb') as f:
      for chunk in response.iter_content(1024):
        f.write(chunk)
    print(f"Downloaded {filename}")
  else:
    print(f"Failed to download {filename}")

# Faz o download sincrono de uma lista de arquivos
def download_files_synchronous(my_urls, my_filenames, path):
  # Download each file synchronously
  for url, filename in zip(my_urls, my_filenames):
    download_one_file(url, filename, path)

In [39]:
my_urls

['https://raw.githubusercontent.com/armandossrecife/teste/main/Adrienne.mp4',
 'https://raw.githubusercontent.com/armandossrecife/teste/main/Pizigani_1367_Chart_10MB.jpg',
 'https://raw.githubusercontent.com/armandossrecife/teste/main/Kalimba.mp3',
 'https://raw.githubusercontent.com/armandossrecife/teste/main/screen_matrix.jpeg',
 'https://raw.githubusercontent.com/armandossrecife/teste/main/demo.zip']

In [6]:
import datetime

now1 = datetime.datetime.now()
print(now1)

download_files_synchronous(my_urls, my_filenames, 'sincrono')
print("All files downloaded (synchronously)")

now2 = datetime.datetime.now()
print(now2)

time_diff = now2 - now1
print(time_diff)

2025-01-13 14:26:12.107071
Total file size: 14944332 bytes
Downloaded Adrienne.mp4
Total file size: 10174706 bytes
Downloaded Pizigani_1367_Chart_10MB.jpg
Total file size: 8414449 bytes
Downloaded Kalimba.mp3
Total file size: 265136 bytes
Downloaded screen_matrix.jpeg
Total file size: 69856 bytes
Downloaded demo.zip
All files downloaded (synchronously)
2025-01-13 14:26:14.794788
0:00:02.687717


# Download assíncrono

https://docs.aiohttp.org

Asynchronous HTTP Client/Server for asyncio and Python.

https://en.wikipedia.org/wiki/Asynchrony_(computer_programming)

https://en.wikipedia.org/wiki/Async/await

In [7]:
!pip install aiohttp



In [8]:
!pip install aiodns

Collecting aiodns
  Downloading aiodns-3.2.0-py3-none-any.whl.metadata (4.0 kB)
Collecting pycares>=4.0.0 (from aiodns)
  Downloading pycares-4.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Downloading aiodns-3.2.0-py3-none-any.whl (5.7 kB)
Downloading pycares-4.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.6/288.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycares, aiodns
Successfully installed aiodns-3.2.0 pycares-4.5.0


In [9]:
import asyncio
import aiohttp
import datetime

In [10]:
async def teste_async():
  async with aiohttp.ClientSession() as session:
    async with session.get('http://python.org') as response:
      print("Status:", response.status)
      print("Content-type:", response.headers['content-type'])
      html = await response.text()

In [11]:
async def call_teste_async():
  await teste_async()

await call_teste_async()

Status: 200
Content-type: text/html; charset=utf-8


In [12]:
async def download_async(url, filename, path):
  """Downloads a file from the given URL and saves it with the specified filename."""
  async with aiohttp.ClientSession() as session:
    async with session.get(url) as response:
      total_size = int(response.headers['content-length'])  # Get total file size
      print(f"Total file size: {total_size} bytes")
      if response.status == 200:
        path = os.path.join(path, filename)
        with open(path, 'wb') as f:
          # Read content in chunks using aiohttp.read()
          while True:
            chunk = await response.content.read(1024)  # Read in chunks of 1024 bytes
            if not chunk:
              break
            f.write(chunk)
        print(f"Downloaded {filename}")
      else:
        print(f"Failed to download {filename}")

async def download_files_asynchronous(my_urls, my_filenames, path):
  """Downloads all files asynchronously."""
  tasks = []
  for url, filename in zip(my_urls, my_filenames):
    tasks.append(asyncio.create_task(download_async(url, filename, path)))

  # Run all tasks concurrently using asyncio.run
  await asyncio.gather(*tasks)

In [13]:
now1 = datetime.datetime.now()
print(now1)

async def main():
  await download_files_asynchronous(my_urls, my_filenames, 'assincrono')

# Call main as an async function and await it
await main()

now2 = datetime.datetime.now()
print(now2)

time_diff = now2 - now1
print(time_diff)

2025-01-13 14:26:27.874665
Total file size: 10174706 bytes
Total file size: 69856 bytes
Total file size: 8414449 bytes
Total file size: 14944332 bytes
Total file size: 265136 bytes
Downloaded demo.zip
Downloaded screen_matrix.jpeg
Downloaded Kalimba.mp3
Downloaded Pizigani_1367_Chart_10MB.jpg
Downloaded Adrienne.mp4
2025-01-13 14:26:28.309048
0:00:00.434383


# Downloads usando Threads

https://docs.python.org/3/library/threading.html

https://en.wikipedia.org/wiki/Thread_(computing)

In [14]:
import threading
import requests

def download_files_via_threads(urls, filenames, path):
  """Downloads multiple files asynchronously using threads."""
  threads = []
  for url, filename in zip(urls, filenames):
    thread = threading.Thread(target=download_one_file(url, filename, path), args=(url, filename, path))
    threads.append(thread)
    thread.start()

  # Wait for all threads to finish (blocking)
  for thread in threads:
    thread.join()

In [15]:
now1 = datetime.datetime.now()
print(now1)

download_files_via_threads(my_urls, my_filenames, 'threads')

now2 = datetime.datetime.now()
print(now2)

time_diff = now2 - now1
print(time_diff)

2025-01-13 14:26:28.342254
Total file size: 14944332 bytes
Downloaded Adrienne.mp4
Total file size: 10174706 bytes
Downloaded Pizigani_1367_Chart_10MB.jpg
Total file size: 8414449 bytes
Downloaded Kalimba.mp3
Total file size: 265136 bytes
Downloaded screen_matrix.jpeg
Total file size: 69856 bytes
Downloaded demo.zip
2025-01-13 14:26:29.316053
0:00:00.973799


# Conceitos chaves

## Chamadas síncronas

In [16]:
import time

def task1_sync():
  """Simulates a long-running task that takes 2 seconds."""
  print("Task 1 started")
  print(datetime.datetime.now())
  time.sleep(2)  # Simulate work for 2 seconds
  print("Task 1 finished")
  print(datetime.datetime.now())

def task2_sync():
  """Simulates a shorter task that takes 1 second."""
  print("Task 2 started")
  print(datetime.datetime.now())
  time.sleep(1)  # Simulate work for 1 second
  print("Task 2 finished")
  print(datetime.datetime.now())

def run_tasks_sinc():
  """Runs two tasks concurrently using asyncio."""
  task1_sync()
  task2_sync()

  print("All tasks finished")

In [17]:
print("#"*50)
print("Chamada síncrona")
now1 = datetime.datetime.now()
print(now1)
print("-"*50)

run_tasks_sinc()

now2 = datetime.datetime.now()
print(now2)
print("-"*50)
print(f"Tempo total: {now2-now1}")

##################################################
Chamada síncrona
2025-01-13 14:26:29.353808
--------------------------------------------------
Task 1 started
2025-01-13 14:26:29.355535
Task 1 finished
2025-01-13 14:26:31.357393
Task 2 started
2025-01-13 14:26:31.357455
Task 2 finished
2025-01-13 14:26:32.358374
All tasks finished
2025-01-13 14:26:32.358670
--------------------------------------------------
Tempo total: 0:00:03.004862


## Chamadas assíncronas

In [18]:
import asyncio

async def task1_async():
  """Simulates a long-running task that takes 2 seconds."""
  print("Task 1 started")
  print(datetime.datetime.now())
  await asyncio.sleep(2)  # Simulate work for 2 seconds
  print("Task 1 finished")
  print(datetime.datetime.now())

async def task2_async():
  """Simulates a shorter task that takes 1 second."""
  print("Task 2 started")
  print(datetime.datetime.now())
  await asyncio.sleep(1)  # Simulate work for 1 second
  print("Task 2 finished")
  print(datetime.datetime.now())

async def run_tasks_asinc():
  """Runs two tasks concurrently using asyncio."""
  task1_future = asyncio.create_task(task1_async())
  task2_future = asyncio.create_task(task2_async())

  # Wait for both tasks to complete concurrently (non-blocking)
  await task1_future
  await task2_future

  print("All tasks finished")

# Run the event loop in terminal
#loop = asyncio.get_event_loop()
#loop.run_until_complete(run_tasks())
#loop.close()

In [19]:
async def call_run_tasks():
  await run_tasks_asinc()

print("#"*50)
print("Chamada assíncrona")
now1 = datetime.datetime.now()
print(now1)
print("-"*50)

await call_run_tasks()

now2 = datetime.datetime.now()
print(now2)
print("-"*50)
print(f"Tempo total: {now2-now1}")

##################################################
Chamada assíncrona
2025-01-13 14:26:32.413042
--------------------------------------------------
Task 1 started
2025-01-13 14:26:32.424320
Task 2 started
2025-01-13 14:26:32.425335
Task 2 finished
2025-01-13 14:26:33.429250
Task 1 finished
2025-01-13 14:26:34.426474
All tasks finished
2025-01-13 14:26:34.426812
--------------------------------------------------
Tempo total: 0:00:02.013770


# Usando threads

In [20]:
import threading
import time

def task1():
  """Simulates a long-running task (2 seconds) using threading."""
  print("Task 1 started (Thread)")
  print(datetime.datetime.now())
  time.sleep(2)  # Simulate work for 2 seconds
  print("Task 1 finished (Thread)")
  print(datetime.datetime.now())

def task2():
  """Simulates a shorter task (1 second) using threading."""
  print("Task 2 started (Thread)")
  print(datetime.datetime.now())
  time.sleep(1)  # Simulate work for 1 second
  print("Task 2 finished (Thread)")
  print(datetime.datetime.now())

def run_tasks_via_threads():
  """Runs two tasks concurrently using threads."""
  thread1 = threading.Thread(target=task1)
  thread2 = threading.Thread(target=task2)

  # Start threads
  thread1.start()
  thread2.start()

  # Wait for threads to finish
  thread1.join()
  thread2.join()

  print("All tasks finished (Threads)")

In [21]:
print("#"*50)
print("Chamada assíncrona usando Threads")
now1 = datetime.datetime.now()
print(now1)
print("-"*50)

run_tasks_via_threads()

now2 = datetime.datetime.now()
print(now2)
print("-"*50)
print(f"Tempo total: {now2-now1}")

##################################################
Chamada assíncrona usando Threads
2025-01-13 14:26:34.520739
--------------------------------------------------
Task 1 started (Thread)
2025-01-13 14:26:34.527608
Task 2 started (Thread)
2025-01-13 14:26:34.543504
Task 2 finished (Thread)
2025-01-13 14:26:35.544396
Task 1 finished (Thread)
2025-01-13 14:26:36.531470
All tasks finished (Threads)
2025-01-13 14:26:36.532000
--------------------------------------------------
Tempo total: 0:00:02.011261
