`HTTP errors` on `abstract_crawler.py` using docker #168

y-71 · 2022-04-29T18:43:35Z

I've tried to use setup a telegram bot using docker

I ended up using this config file since i just want to make it work now and don't want to handle filters yet:

---
# Enable verbose mode (print DEBUG log messages)
# verbose: true

# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
    active: yes
    sleeping_time: 600

# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database

# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, and www.ebay-kleinanzeigen.de.
# List the URLs in the following format:
# urls:
urls:
  - https://www.immobilienscout24.de/Suche/...
  - https://www.immowelt.de/...
  - https://www.ebay-kleinanzeigen.de/...
  - https://www.wg-gesucht.de/...

# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
#   'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
#   excluded_titles:
#     - "wg"
#     - "zwischenmiete"
#   min_price: 700
#   max_price: 1000
#   min_size: 50
#   max_size: 80
#   max_price_per_square: 1000

# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
# blacklist:
#   - Innenstadt

# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#  
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
# 	- "bicycling"
#	- "transit" (public transport)
#	- "driving"
#   - "walking"
# 
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
#     - name: John
#       destination: Hauptbahnhof, München
#       modes: 
#           - gm_id: transit
#             title: "Öff."
#           - gm_id: bicycling
#             title: "Rad"
#     - name: Jane
#       destination: Karlsplatz, München
#       modes: 
#           - gm_id: transit
#             title: "Öff."
#           - gm_id: driving
#             title: "Auto"

# Multiline message (yes, the | is supposed to be there), 
# to format the message received from the Telegram bot. 
# 
# Available placeholders:
# 	- {title}: The title of the expose
#	- {rooms}: Number of rooms
#	- {price}: Price for the flat
# 	- {durations}: Durations calculated by GMaps, see above
#	- {url}: URL to the expose
message: |
    {title}
    Zimmer: {rooms}
    Größe: {size}
    Preis: {price}
    Ort: {address}

    {url}

# Calculating durations requires access to the Google Maps API. 
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that. 
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
google_maps_api:
    key: YOUR_API_KEY
    url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
    enable: False

# If you are planning to scrape immoscout24.de, the bot will need 
# to circumvent the sites captcha protection by using a captcha 
# solving service. Register at either imagetypers or 2captcha 
# (the former is prefered), desposit some funds, uncomment the 
# corresponding lines below and replace your API key/token.
# you will also have to install a Chrome Web Driver and write below 
# the executable path, the driver_arguments can be left as is.
# captcha:
#       imagetypers:
#             token: alskdjaskldjfklj
#       2captcha:
#             api_key: alskdjaskldjfklj
#       driver_path: YOUR_CHROME_DRIVER_PATH
#       driver_arguments:
#         - "--headless"

# You can select whether to be notified by telegram or via a mattermost
# webhook. For all notifiers selected here a configuration must be provided
# below.
# notifiers:
#   - telegram
#   - mattermost
notifiers:
    - telegram
    # - mattermost

# Sending messages using Telegram requires a Telegram Bot configured. 
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot. 
#
telegram:
  bot_token: [a set of numbers which I suppose are the bot's ID]:[a token]
  receiver_ids:
    - [ my ID ]

# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
#   webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356

# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
#    bot_name: bot_name_xxx
#    domain: flathunter.example.com
#    session_key: SomeSecretValue
#    listen:
#      host: 127.0.0.1
#      port: 8080

# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id

# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True

the output is:

18:32:36|crawl_wggesucht.py|ERROR Got response (404): 
18:32:36|abstract_crawler.py|ERROR   ]: Got response (404): 
[2022/04/29 18:42:38|abstract_crawler.py|ERROR   ]:  Got response (404): 
[2022/04/29 18:42:38|abstract_crawler.py|ERROR   ]: Got response (500):

the output is everytime the crawled page

and i don't receive any message on telegram

The text was updated successfully, but these errors were encountered:

codders · 2022-04-30T09:15:43Z

The URLs in your config file appear to be invalid. In the urls section, you need to include links to the pages on WG-gesucht, immoscout and immowelt that you actually want to crawl. Arthur y-71 ***@***.***> schrieb am Fr., 29. Apr. 2022, 19:43:

…

I've tried to use setup a telegram bot using docker I ended up using this config file since i just want to make it work now and don't want to handle filters yet: --- # Enable verbose mode (print DEBUG log messages) # verbose: true # Should the bot endlessly looop through the URLs? # Between each loop it waits for <sleeping_time> seconds. # Note that Ebay will (temporarily) block your IP if you # poll too often - don't lower this below 600 seconds if you # are crawling Ebay. loop: active: yes sleeping_time: 600 # Location of the Database to store already seen offerings # Defaults to the current directory #database_location: /path/to/database # List the URLs containing your filter properties below. # Currently supported services: www.immobilienscout24.de, # www.immowelt.de, www.wg-gesucht.de, and www.ebay-kleinanzeigen.de. # List the URLs in the following format: # urls: urls: - https://www.immobilienscout24.de/Suche/... - https://www.immowelt.de/... - https://www.ebay-kleinanzeigen.de/... - https://www.wg-gesucht.de/... # Define filters to exclude flats that don't meet your critera. # Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size', # 'max_price', 'min_price', and 'excluded_titles'. # # 'excluded_titles' takes a list of regex patterns that match against # the title of the flat. Any matching titles will be excluded. # More to Python regex here: https://docs.python.org/3/library/re.html # # Example: # filters: # excluded_titles: # - "wg" # - "zwischenmiete" # min_price: 700 # max_price: 1000 # min_size: 50 # max_size: 80 # max_price_per_square: 1000 # There are often city districts in the address which # Google Maps does not like. Use this blacklist to remove # districts from the search. # blacklist: # - Innenstadt # If an expose includes an address, the bot is capable of # displaying the distance and time to travel (duration) to # some configured other addresses, for specific kinds of # travel. # # Available kinds of travel ('gm_id') can be found in the # Google Maps API documentation, but basically there are: # - "bicycling" # - "transit" (public transport) # - "driving" # - "walking" # # The example configuration below includes a place for # "John", located at the main train station of munich. # Two kinds of travel (bicycle and transit) are requested, # each with a different label. Furthermore a place for # "Jane" is included, located at the given destination and # with the same kinds of travel. # durations: # - name: John # destination: Hauptbahnhof, München # modes: # - gm_id: transit # title: "Öff." # - gm_id: bicycling # title: "Rad" # - name: Jane # destination: Karlsplatz, München # modes: # - gm_id: transit # title: "Öff." # - gm_id: driving # title: "Auto" # Multiline message (yes, the | is supposed to be there), # to format the message received from the Telegram bot. # # Available placeholders: # - {title}: The title of the expose # - {rooms}: Number of rooms # - {price}: Price for the flat # - {durations}: Durations calculated by GMaps, see above # - {url}: URL to the expose message: | {title} Zimmer: {rooms} Größe: {size} Preis: {price} Ort: {address} {url} # Calculating durations requires access to the Google Maps API. # Below you can configure the URL to access the API, with placeholders. # The URL should most probably just kept like that. # To use the Google Maps API, an API key is required. You can obtain one # without costs from the Google App Console (just google for it). # Additionally, to enable the API calls in the code, set the 'enable' key to True google_maps_api: key: YOUR_API_KEY url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival} enable: False # If you are planning to scrape immoscout24.de, the bot will need # to circumvent the sites captcha protection by using a captcha # solving service. Register at either imagetypers or 2captcha # (the former is prefered), desposit some funds, uncomment the # corresponding lines below and replace your API key/token. # you will also have to install a Chrome Web Driver and write below # the executable path, the driver_arguments can be left as is. # captcha: # imagetypers: # token: alskdjaskldjfklj # 2captcha: # api_key: alskdjaskldjfklj # driver_path: YOUR_CHROME_DRIVER_PATH # driver_arguments: # - "--headless" # You can select whether to be notified by telegram or via a mattermost # webhook. For all notifiers selected here a configuration must be provided # below. # notifiers: # - telegram # - mattermost notifiers: - telegram # - mattermost # Sending messages using Telegram requires a Telegram Bot configured. # Telegram.org offers a good documentation about how to create a bot. # Once you read it, will make sense. Still: bot_token should hold the # access token of your bot and receiver_ids should list the client ids # of receivers. Note that those receivers are required to already have # started a conversation with your bot. # telegram: bot_token: [a set of numbers which I suppose are the bot's ID]:[a token] receiver_ids: - [ my ID ] # Sending messages via mattermost requires a webhook url provided by a # mattermost server. You can find a description how to set up a webhook with # the official mattermost documentation: # https://docs.mattermost.com/developer/webhooks-incoming.html # mattermost: # webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356 # If you are running the web interface, you can configure Login with Telegram support # Follow the instructions here to register your domain with the Telegram bot: # https://core.telegram.org/widgets/login # # website: # bot_name: bot_name_xxx # domain: flathunter.example.com # session_key: SomeSecretValue # listen: # host: 127.0.0.1 # port: 8080 # If you are deploying to google cloud, # uncomment this and set it to your project id. More info in the readme. # google_cloud_project_id: my-flathunters-project-id # For websites like idealista.it, there are anti-crawler measures that can be # circumvented using proxies. # use_proxy_list: True the output is: 18:32:36|crawl_wggesucht.py|ERROR Got response (404): 18:32:36|abstract_crawler.py|ERROR ]: Got response (404): [2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (404): [2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (500): the output is everytime the crawled page and i don't receive any message on telegram — Reply to this email directly, view it on GitHub <#168>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEK5VAY3UXOQNN6XDLZ2TVHQUWFANCNFSM5UW5HWQQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

y-71 closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`HTTP errors` on `abstract_crawler.py` using docker #168

`HTTP errors` on `abstract_crawler.py` using docker #168

y-71 commented Apr 29, 2022

codders commented Apr 30, 2022 via email

HTTP errors on abstract_crawler.py using docker #168

HTTP errors on abstract_crawler.py using docker #168

Comments

y-71 commented Apr 29, 2022

codders commented Apr 30, 2022 via email

`HTTP errors` on `abstract_crawler.py` using docker #168

`HTTP errors` on `abstract_crawler.py` using docker #168