Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
KAFKA_BROKER=localhost:9092
KAFKA_TOPIC=scraping
91 changes: 39 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,68 +3,55 @@

# Desafio Backend Python

**Objetivo:** Implementar um scraper web em Python para coletar dados da página web "Scrape This Site", estruturar esses dados em JSON, e enviá-los para uma fila Kafka.
**Objetivo:** O sistema consiste em um crawler que coleta dados de países de um site e os envia para uma fila no Kafka, no qual pode ser visualizado através de uma aplicação feita com streamlit.

**Requisitos:**
## Detalhes tecnicos

1. Coleta de Dados:
**Funcionamento do Crawler**

- Faça o scraping do site https://www.scrapethissite.com/pages/simple/.
- Colete os dados de todos os países listados, focando especificamente nos dados de população.
- O crawler é executado manualmente, coletando os dados de países do site https://www.scrapethissite.com/pages/simple/ e os envia para uma fila no Kafka.
- O crawler conta com um sistema de proxies rotativos, que são utilizados para evitar o bloqueio do site. (OBS: O crawler pode apresentar lentidão devido a utilização de proxies gratuitos)
- O crawler conta com um sistema de User-Agent rotativos, que também são utilizados para evitar o bloqueio do site.

2. Estruturação dos Dados:
**Integração com Kafka**:

- Estruture os dados coletados em JSON.
- Utilize classes ou dicionários em Python para representar a estrutura dos dados. A estrutura deve conter, no mínimo, os campos: "País" e "População".
- O crawler envia os dados para uma fila no Kafka, que é consumida pela aplicação feita com streamlit.
- O Kafka foi configurado utilizando o docker-compose, para facilitar a execução do projeto.

3. Integração com Kafka:
**Aplicação com streamlit**:

- Envie os dados estruturados para uma fila no Kafka.
- Providencie o arquivo Docker (Dockerfile e docker-compose, se aplicável) do Kafka utilizado no teste.
- A aplicação feita com streamlit consome os dados da fila no Kafka e os exibe em uma tabela além de apresentar um gráfico demonstrando o países com maiores densidades demográficas.

## Instalação

**Diferenciais:**
**Pré-requisitos:**

- Implemente lógicas e algoritmos para evitar o bloqueio do scraper, como:
- Uso de proxies rotativos.
- Intervals variáveis entre as requisições.
- Identificação e manipulação de headers (User-Agent) para simular diferentes browsers ou dispositivos.
- Docker Compose
- Python 3
- Pip
- Git

**O que será avaliado:**
**Instalação:**

1. Qualidade do código e organização.
2. Capacidade de definir e utilizar classes ou dicionários em Python.
3. Integração com Kafka e a correta configuração do ambiente Docker para o Kafka.
4. Implementação dos diferenciais (se aplicável).
5. Documentação do código e instruções para execução.
- Clone o repositório
- `git clone git@github.com:MauroTony/Teste-Backend-Python.git`
- `cd Teste-Backend-Python`
- `git checkout main`
- Execute o docker-compose
- `docker-compose up -d`
- Instale as dependências do projeto
- `pip install -r requirements.txt`
- Configue as variáveis de ambiente
- Valide que a .env existe na raiz do projeto
- Valide a existencia da variável de ambiente KAFKA_HOST e KAFKA_PORT e configure-as caso necessário

**Execução:**

**Instruções para a entrega:**

1. O candidato deve dar fork neste repositório e após o termino do desenvolvimento, realizar um pull request para análise do time.
2. Inclua um README com instruções claras sobre como executar e testar o projeto.

---
#### LICENSE
```
MIT License

Copyright (c) 2016 ZenoX IA

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```
- Inicialize o kafka
- `docker-compose up -d`
- Inicialize o streamlit
- `streamlit run streamlit-frontend.py`
- Execute o crawler
- `cd scraping`
- `python runCrawler.py`

13 changes: 13 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import os
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())


class GeneralConfig:
KAFKA_TOPIC: str = os.getenv('KAFKA_TOPIC')
KAFKA_BROKER: str = os.getenv('KAFKA_BROKER')


def get_config() -> GeneralConfig:
return GeneralConfig()
21 changes: 21 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
version: '3.8'

services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181

kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
restart: on-failure
Binary file added requirements.txt
Binary file not shown.
7 changes: 7 additions & 0 deletions scraping/runCrawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scraping.spiders.countries_spider import CountriesSpider

process = CrawlerProcess(get_project_settings())
process.crawl(CountriesSpider)
process.start()
Empty file added scraping/scraping/__init__.py
Empty file.
8 changes: 8 additions & 0 deletions scraping/scraping/items.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import scrapy


class ScrapingItem(scrapy.Item):
nameCountry = scrapy.Field()
capitalCountry = scrapy.Field()
populationCountry = scrapy.Field()
areaCountry = scrapy.Field()
44 changes: 44 additions & 0 deletions scraping/scraping/middlewares.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import random
import logging
from scrapy import signals
from itemadapter import is_item, ItemAdapter


class RandomUserAgentMiddleware:
def __init__(self, user_agents):
self.user_agents = user_agents

@classmethod
def from_crawler(cls, crawler):
return cls(user_agents=crawler.settings.getlist('USER_AGENTS'))

def process_request(self, request, spider):
request.headers.setdefault('User-Agent', random.choice(self.user_agents))


class RandomProxyMiddleware:
def __init__(self, proxies):
self.proxies = proxies
self.logger = logging.getLogger(__name__)

@classmethod
def from_crawler(cls, crawler):
return cls(proxies=crawler.settings.getlist('PROXIES'))

def process_request(self, request, spider):
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy

def process_exception(self, request, exception, spider):
if isinstance(exception, (ConnectionRefusedError, TimeoutError, )):
self.logger.warning(f"Failed to connect using proxy {request.meta['proxy']}, retrying a different proxy...")
new_request = request.copy()
new_request.dont_filter = True
new_request.priority = request.priority + 1
return new_request
return None
55 changes: 55 additions & 0 deletions scraping/scraping/pipelines.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import json
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
from confluent_kafka import Producer
import logging

logger = logging.getLogger('MyPipelineLogger')

class DataProcessingPipeline:

def process_item(self, item, spider):
if not item.get('populationCountry') or not item.get('areaCountry'):
raise DropItem("Item faltando campos necessários")

try:
item['populationCountry'] = int(item['populationCountry'])
item['areaCountry'] = float(item['areaCountry'])
except ValueError:
raise DropItem("Não foi possível converter os dados")
return item


class KafkaPipeline:

def __init__(self, kafka_broker, kafka_topic):
self.kafka_broker = kafka_broker
self.kafka_topic = kafka_topic
self.items = []

@classmethod
def from_crawler(cls, crawler):
return cls(
kafka_broker=crawler.settings.get('KAFKA_BROKER'),
kafka_topic=crawler.settings.get('KAFKA_TOPIC')
)

def open_spider(self, spider):
self.producer = Producer({'bootstrap.servers': self.kafka_broker})

def close_spider(self, spider):
self.process_all_items()
self.producer.flush()

def process_item(self, item, spider):
self.items.append(dict(item))
return item

def process_all_items(self):
try:
if self.items:
content = json.dumps(self.items)
self.producer.produce(self.kafka_topic, content)
logger.info(f"Enviando dados para o Kafka: {content}")
except Exception as e:
logger.error(f"Erro ao enviar dados para o Kafka: {e}")
116 changes: 116 additions & 0 deletions scraping/scraping/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Scrapy settings for scraping project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# from config import get_config

BOT_NAME = "scraping"

SPIDER_MODULES = ["scraping.spiders"]
NEWSPIDER_MODULE = "scraping.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scraping (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "scraping.middlewares.ScrapingSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
"scraping.middlewares.RandomUserAgentMiddleware": 200,
"scraping.middlewares.RandomProxyMiddleware": 100,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"scraping.pipelines.DataProcessingPipeline": 300,
"scraping.pipelines.KafkaPipeline": 400,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

USER_AGENTS = [
# Chrome 91 Windows 10
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
# Firefox 89 Windows 10
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
# Safari 14 macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
# iPhone X Safari
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
]

PROXIES = [
'http://181.191.94.126:8999',
'http://201.91.82.155:3128',
'http://191.243.46.162:43241',
]

KAFKA_BROKER = 'localhost:9092'
KAFKA_TOPIC = 'scraping'
4 changes: 4 additions & 0 deletions scraping/scraping/spiders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
Loading