forked from mordax7/flathunter
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP errors
on abstract_crawler.py
using docker
#168
Comments
The URLs in your config file appear to be invalid. In the urls section, you
need to include links to the pages on WG-gesucht, immoscout and immowelt
that you actually want to crawl.
Arthur
y-71 ***@***.***> schrieb am Fr., 29. Apr. 2022, 19:43:
… I've tried to use setup a telegram bot using docker
I ended up using this config file since i just want to make it work now
and don't want to handle filters yet:
---
# Enable verbose mode (print DEBUG log messages)
# verbose: true
# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
active: yes
sleeping_time: 600
# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database
# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, and www.ebay-kleinanzeigen.de.
# List the URLs in the following format:
# urls:
urls:
- https://www.immobilienscout24.de/Suche/...
- https://www.immowelt.de/...
- https://www.ebay-kleinanzeigen.de/...
- https://www.wg-gesucht.de/...
# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
# 'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
# excluded_titles:
# - "wg"
# - "zwischenmiete"
# min_price: 700
# max_price: 1000
# min_size: 50
# max_size: 80
# max_price_per_square: 1000
# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
# blacklist:
# - Innenstadt
# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
# - "bicycling"
# - "transit" (public transport)
# - "driving"
# - "walking"
#
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
# - name: John
# destination: Hauptbahnhof, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: bicycling
# title: "Rad"
# - name: Jane
# destination: Karlsplatz, München
# modes:
# - gm_id: transit
# title: "Öff."
# - gm_id: driving
# title: "Auto"
# Multiline message (yes, the | is supposed to be there),
# to format the message received from the Telegram bot.
#
# Available placeholders:
# - {title}: The title of the expose
# - {rooms}: Number of rooms
# - {price}: Price for the flat
# - {durations}: Durations calculated by GMaps, see above
# - {url}: URL to the expose
message: |
{title}
Zimmer: {rooms}
Größe: {size}
Preis: {price}
Ort: {address}
{url}
# Calculating durations requires access to the Google Maps API.
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that.
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
google_maps_api:
key: YOUR_API_KEY
url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
enable: False
# If you are planning to scrape immoscout24.de, the bot will need
# to circumvent the sites captcha protection by using a captcha
# solving service. Register at either imagetypers or 2captcha
# (the former is prefered), desposit some funds, uncomment the
# corresponding lines below and replace your API key/token.
# you will also have to install a Chrome Web Driver and write below
# the executable path, the driver_arguments can be left as is.
# captcha:
# imagetypers:
# token: alskdjaskldjfklj
# 2captcha:
# api_key: alskdjaskldjfklj
# driver_path: YOUR_CHROME_DRIVER_PATH
# driver_arguments:
# - "--headless"
# You can select whether to be notified by telegram or via a mattermost
# webhook. For all notifiers selected here a configuration must be provided
# below.
# notifiers:
# - telegram
# - mattermost
notifiers:
- telegram
# - mattermost
# Sending messages using Telegram requires a Telegram Bot configured.
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot.
#
telegram:
bot_token: [a set of numbers which I suppose are the bot's ID]:[a token]
receiver_ids:
- [ my ID ]
# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
# webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356
# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
# bot_name: bot_name_xxx
# domain: flathunter.example.com
# session_key: SomeSecretValue
# listen:
# host: 127.0.0.1
# port: 8080
# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id
# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True
the output is:
18:32:36|crawl_wggesucht.py|ERROR Got response (404):
18:32:36|abstract_crawler.py|ERROR ]: Got response (404):
[2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (404):
[2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (500):
the output is everytime the crawled page
and i don't receive any message on telegram
—
Reply to this email directly, view it on GitHub
<#168>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEK5VAY3UXOQNN6XDLZ2TVHQUWFANCNFSM5UW5HWQQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've tried to use setup a telegram bot using docker
I ended up using this config file since i just want to make it work now and don't want to handle filters yet:
the output is:
the output is everytime the crawled page
and i don't receive any message on telegram
The text was updated successfully, but these errors were encountered: