Crawly does not return any data. #62

ghost · 2020-03-17T14:53:07Z

Hello!

I copy the examples repos and try to run it on my system. What is, am I doing wrong?

iex(1)> Crawly.Engine.start_spider(Esl)

16:46:27.399 [info]  Starting the manager for Elixir.Esl
 
16:46:27.409 [debug] Starting requests storage worker for Elixir.Esl...
 
16:46:27.514 [debug] Started 4 workers for Elixir.Esl
:ok
iex(2)> 
16:47:27.515 [info]  Current crawl speed is: 0 items/min
 
16:47:27.515 [info]  Stopping Esl, itemcount timeout achieved

The text was updated successfully, but these errors were encountered:

oltarasenko · 2020-03-17T15:04:09Z

@Unumus sorry which repo are you talking about?

oltarasenko · 2020-03-17T15:26:28Z

It might happen due to the connectivity issues for example.

Ziinc · 2020-03-18T09:24:26Z

Could be related to httpoison not being updated.

#14

ghost · 2020-03-18T11:25:17Z

@Ziinc @oltarasenko Thanks for your replies!

Yes. In the quick start, guides defined an older version of Crawly.

I updated it to the last one. Now I have another type of issue.

iex(1)> Crawly.Engine.start_spider(Spider.Esl)

13:18:17.777 [debug] Starting the manager for Elixir.Spider.Esl
 
13:18:17.779 [debug] Starting requests storage worker for Elixir.Spider.Esl...
 
13:18:18.151 [debug] Started 4 workers for Elixir.Spider.Esl
:ok
iex(2)> 
13:18:18.787 [debug] Could not parse item, error: :error, reason: :undef, stacktrace: [{Floki, :find, ["<!DOCTYPE html>\n\n<!--[if IE 9 ]>    <html class=\"ie ie9 no-js\" lang=\"en\"> <![endif]-->\n<!--[if gt IE 9]><!--><html class=\"no-js\" lang=\"en\"><!--<![endif]-->\n<!-- the \"no-js\" class is for Modernizr. -->\n\n<head>\n    <link type=\"text/plain\" rel=\"author\" href=\"/humans.txt\" />\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n\n    <title>Blog | Erlang Solutions</title>\n\n    <meta name=\"author\" content=\"\">\n\n    <link rel=\"shortcut icon\" type=\"image/x-icon\" href=\"/assets/design/favicon-13104a1a076cb868e2103d4161b6b208d7957897d5d58cc2e9d317116b624263.ico\" />\n\n    <link rel=\"stylesheet\" media=\"all\" href=\"/assets/designs/design-ad7b2dcbd35413c3f8717bc51058ca7505610986521c52b900976c653ec5c102.css\" />\n\n    <!-- CSS libraries -->\n    <link rel=\"alternate\" type=\"application/atom+xml\" title=\"Blog RSS\" href=\"/news.rss\" />\n    <link rel=\"alternate\" type=\"application/atom+xml\" title=\"MongooseIM Blog RSS\" href=\"/mongoose_news.rss\" />\n    <script>\nvar debug = true;\nvar CookieConsent = 'CookieConsent';\n\nfunction log(msg1, msg2) {\n  if(!debug) { return; }\n  if(typeof msg2 === 'undefined') {\n    console.log(msg1);\n    return;\n  }\n  console.log(msg1, msg2);\n}\n\nfunction setCookie(cname, cvalue, exdays) {\n  var d = new Date();\n  d.setTime(d.getTime() + (exdays*24*60*60*1000));\n  var expires = \"expires=\"+ d.toUTCString();\n  document.cookie = cname + \"=\" + cvalue + \";\" + expires + \";path=/\";\n}\n\nfunction getCookie(cname) {\n  var name = cname + \"=\";\n  var decodedCookie = decodeURIComponent(document.cookie);\n  var ca = decodedCookie.split(';');\n  for(var i = 0; i <ca.length; i++) {\n    var c = ca[i];\n    while (c.charAt(0) == ' ') {\n      c = c.substring(1);\n    }\n    if (c.indexOf(name) == 0) {\n      return c.substring(name.length, c.length);\n    }\n  }\n  return \"\";\n}\n\nfunction isTrackingAllowed() {\n  var c = getCookie(CookieConsent);\n  return '' === c || '1' === c;\n}\n\nfunction async_pardot_load() {\n  log('async_pardot_load');\n  var s = document.createElement('script'); s.type = 'text/javascript';\n  s.src = ('https:' == document.location.protocol ? 'https://pi' : 'http://cdn') + '.pardot.com/pd.js';\n  var c = document.getElementsByTagName('script')[0]; c.parentNode.insertBefore(s, c);\n}\n\nfunction pardotTracking() {\n  log('pardotTracking on');\n  if(window.attachEvent) { window.attachEvent('onload', async_pardot_load); }\n  else { window.addEventListener('load', async_pardot_load, false); }\n}\n</script>\n\n    <meta content='We’re curious and inquisitive by nature. If we come across a smart new technology, something interesting or a little odd, we’re compelled to share it.' name='description'>\n<meta content='Erlang, Elixir, RabbitMQ, MQTT, AMQP, XMPP' name='keywords'>\n<meta content='Blog' name='title'>\n<meta property=\"og:title\" content=\"The Erlang Solutions blog: sharing BEAM, Erlang, and Elixir insights with the world.\" />\n<meta property=\"og:type\" content=\"website\" />\n<meta property=\"og:url\" content=\"https://www.erlang-solutions.com/blog.html\" />\n  <meta property=\"og:image\" content=\"https://esl-web-static.s3.amazonaws.com/uploads/image/file/410/Blog___Featured_images.png\" />\n\n\n\n    <!-- Google Tag Manager -->\n<script>\nfunction googleTagManager() {\nlog('googleTagManager tracking on');\n(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\nnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n})(window,document,'script','dataLayer','GTM-KTN9QLQ');\n}\nif(isTrackingAllowed()) { googleTagManager(); }\n</script>\n<!-- End Google Tag Manager -->\n\n</head>\n\n<body>\n<!-- Google Tag Manager (noscript) -->\n<noscript><iframe src=\"https://www.googletagmanager.com/ns.html?id=GTM-KTN9QLQ\"\nheight=\"0\" width=\"0\" style=\"display:none;visibility:hidden\"></iframe></noscript>\n<!-- End Google Tag Manager (noscript) -->\n<header>\n  <div class=\"container\">\n    <h1 class=\"logo\">\n      <a href=\"/home.html\"><img alt=\"Erlang\" srcset=\"/assets/design/logo@2x-903c3dc0dbe" <> ..., "a.more"], []}, {Spider.Esl, :parse_item, 1, [file: 'lib/spider/example_test.ex', line: 14]}, {Crawly.Worker, :parse_item, 1, [file: 'lib/crawly/worker.ex', line: 112]}, {:epipe, :run, 2, [file: '/Users/mycomputer/Documents/Projects/Playgraound/crawler_esl/deps/epipe/src/epipe.erl', line: 23]}, {Crawly.Worker, :handle_info, 2, [file: 'lib/crawly/worker.ex', line: 43]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 637]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 711]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]
          
 
13:18:18.788 [debug] Crawly worker could not process the request to "https://www.erlang-solutions.com/blog.html"
                  reason: :undef
 
13:19:18.152 [info]  Current crawl speed is: 0 items/min
 
13:19:18.152 [info]  Stopping Spider.Esl, itemcount timeout achieved

The spider code.

defmodule Spider.Esl do
  @behaviour Crawly.Spider
  alias Crawly.Utils

  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

  @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end

oltarasenko · 2020-03-18T11:37:39Z

Sorry @Unumus it's my bug. I need to update the documentation. Please add Floki to mix.exs file to the dependencies section:

  defp deps do
    [
      {:crawly, "~> 0.8.0"},
      {:floki, "~> 0.20.0"}
    ]
  end```

oltarasenko · 2020-03-18T11:39:31Z

Also alternatively you could try Meeseeks, as it's shown here:

ghost · 2020-03-18T13:16:05Z

I added Floki to my deps. I still try to run an example code you present in the guides. I am on EslSpider right now.

iex(1)> Crawly.Engine.start_spider(EslSpider)       

15:07:27.474 [debug] Starting the manager for Elixir.EslSpider
 
15:07:27.476 [debug] Starting requests storage worker for Elixir.EslSpider...
 
15:07:27.484 [debug] Started 8 workers for Elixir.EslSpider
:ok
iex(2)> 
15:07:29.372 [info]  Dropping item: %{title: "", url: "https://www.erlang-solutions.com/blog.html"}. Reason: missing required fields
 
15:07:29.719 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
 
15:07:29.993 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
 
15:07:30.004 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
 
15:07:30.017 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
 
.....
 
15:08:27.485 [info]  Current crawl speed is: 148 items/min
 
15:09:27.486 [info]  Current crawl speed is: 0 items/min
 
15:09:27.486 [info]  Stopping EslSpider, itemcount timeout achieved

The data did not stored, /tmp folder not been created.

The config file.

use Mix.Config

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url, :title]},
    {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
   ],
   port: 4001

oltarasenko · 2020-03-18T13:29:13Z

Oh, @Unumus could you please refer any existing folder in the {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}

Ziinc · 2020-03-18T13:50:31Z

If it is windows, specifying folder: "/tmp" would result in an error, if i'm not wrong, since it doesn't exist on windows. Without the :folder option, it should determine the temp folder path based on system.

@Unumus the behaviour (on unix) is such that it will create a file called /tmp/MySpiderName.jl

ghost · 2020-03-18T14:13:01Z

I refer config to existed folder /test, created there the file EslSpider.jl. Not errors occur. However, the ESL table file is empty. I am on MacOS 10.15.3.

The last lines of output:


16:07:57.450 [debug] Dropping request: https://www.erlang-solutions.com/blog.html, as it's already processed
 
16:08:47.266 [info]  Current crawl speed is: 148 items/min
 
16:09:47.267 [info]  Current crawl speed is: 0 items/min
 
16:09:47.267 [info]  Stopping EslSpider, itemcount timeout achieved

ghost · 2020-03-18T14:20:20Z

Sorry about the last post I did not updated my iex after changes in file was saved.

When I refer to a existed folder I see the next errors:

....

16:17:38.670 [error] Pipeline crash: Elixir.Crawly.Pipelines.WriteToFile, error: :error, reason: {:badmatch, {:error, :enoent}}, args: [extension: "jl", folder: "/test"]

16:18:26.447 [info]  Current crawl speed is: 148 items/min
 
16:19:26.448 [info]  Current crawl speed is: 0 items/min
 
16:19:26.448 [info]  Stopping EslSpider, itemcount timeout achieved

oltarasenko · 2020-03-18T14:44:15Z

Hello, sorry to double-check do you have the /test folder on your machine? Usually, it's not the case for Unix type systems. Another question is permissions.

Could you please do the following in your shell (bash, zsh): ls -la /test

oltarasenko · 2020-03-18T14:46:09Z

Another question. What happens if you do:

File.open("/test/spider.jl", [:binary, :write, :utf8]

in your iex?

ghost · 2020-03-18T14:53:27Z

Hello, sorry to double-check do you have the /test folder on your machine? Usually, it's not the case for Unix type systems. Another question is permissions.

Could you please do the following in your shell (bash, zsh): ls -la /test

mycomputer@MacBook-Pro-Mycomputer the_crawler % ls -la /test
ls: /test: No such file or directory

However I can see the folder in my IDE

Another question. What happens if you do:
File.open("/test/spider.jl", [:binary, :write, :utf8]
in your iex?

oltarasenko · 2020-03-18T15:00:34Z

Ok, now it's clear. The folder you're seeing in IDE is not /test but your '/test`.

{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} -> should be absolute path.

Otherwise, what do you see if calling: cat /tmp/EslSpider.jl

ghost · 2020-03-18T16:21:37Z

@oltarasenko Oh, I did not expect this))It is work now. Thanks for the support!

Ziinc · 2020-03-19T14:04:10Z

Sorry @Unumus it's my bug. I need to update the documentation. Please add Floki to mix.exs file to the dependencies section:
  defp deps do
    [
      {:crawly, "~> 0.8.0"},
      {:floki, "~> 0.20.0"}
    ]
  end```

Need to update docs to:

update quick start version
add floki/meeseeks as dependency

Ziinc · 2020-03-24T14:46:41Z

docs has been updated. Pending patch release

sbpipb · 2023-06-27T00:46:55Z

having the same issue even though path to tmp directory is existing.


iex(2)>
08:41:26.926 [error] Pipeline crash: Elixir.Crawly.Pipelines.WriteToFile, error: :error, reason: :undef, args: [folder: "/home/sbpipb/projects/price_spider/tmp", extension: "jl"]


import Config

config :crawly,
    middlewares: [
        {Crawly.Middlewares.UserAgent, user_agents: [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        ]}
        {Crawly.Pipelines.WriteToFile,
        folder: "/home/sbpipb/projects/price_spider/tmp",
        extension: "jl"}
    ]


08:42:26.921 [info] Current crawl speed is: 1 items/min

08:43:26.922 [info] Current crawl speed is: 0 items/min



iex(6)>     File.open("/home/sbpipb/projects/price_spider/tmp/test.jl", [:binary, :write, :utf8])

{:ok, #PID<0.490.0>}
iex(7)>     File.open("/home/sbpipb/projects/price_spider/tmp/test.jl", [:binary, :write, :utf8])
{:ok, #PID<0.492.0>}
iex(8)>

oltarasenko · 2023-06-27T11:38:39Z

@sbpipb It's better to open new issues for cases like this.

Your problem seems to be related to the fact that you have an incorrect configuration in your config file. E.g. you have listed the Crawly.Pipelines.WriteToFile, item pipeline in a list of middlewares. Please move it to the items pipelines instead, and test if it helped.

Check paragraph 4 of the quickstart https://hexdocs.pm/crawly/readme.html#quickstart for details.

sbpipb · 2023-06-27T15:53:29Z

@oltarasenko thanks for the help! I opted to upgrade my dependencies to match the quickstart guide.

ran the mix gen config command and now is working for me. thanks again for the swift replies!

ghost closed this as completed Mar 18, 2020

Ziinc reopened this Mar 19, 2020

Ziinc mentioned this issue Mar 23, 2020

Added floki into quickstart #67

Merged

Ziinc closed this as completed Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawly does not return any data. #62

Crawly does not return any data. #62

ghost commented Mar 17, 2020

oltarasenko commented Mar 17, 2020

oltarasenko commented Mar 17, 2020

Ziinc commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

Ziinc commented Mar 18, 2020

ghost commented Mar 18, 2020 •

edited by ghost

ghost commented Mar 18, 2020 •

edited by ghost

oltarasenko commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

Ziinc commented Mar 19, 2020 •

edited

Ziinc commented Mar 24, 2020

sbpipb commented Jun 27, 2023

oltarasenko commented Jun 27, 2023

sbpipb commented Jun 27, 2023

Crawly does not return any data. #62

Crawly does not return any data. #62

Comments

ghost commented Mar 17, 2020

oltarasenko commented Mar 17, 2020

oltarasenko commented Mar 17, 2020

Ziinc commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

Ziinc commented Mar 18, 2020

ghost commented Mar 18, 2020 • edited by ghost

ghost commented Mar 18, 2020 • edited by ghost

oltarasenko commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

oltarasenko commented Mar 18, 2020

ghost commented Mar 18, 2020

Ziinc commented Mar 19, 2020 • edited

Ziinc commented Mar 24, 2020

sbpipb commented Jun 27, 2023

oltarasenko commented Jun 27, 2023

sbpipb commented Jun 27, 2023

ghost commented Mar 18, 2020 •

edited by ghost

ghost commented Mar 18, 2020 •

edited by ghost

Ziinc commented Mar 19, 2020 •

edited