Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/benchmark #116

Closed

Conversation

filipevarjao
Copy link
Contributor

@filipevarjao filipevarjao commented Jul 2, 2020

This PR adds the possibility to run a local benchmark for web scraping tool performance, the measurement collected is the number of requests and items per minute, memory usage, and the number of reductions on the spider process.

It starts a dummy local HTTP server, generating severals URLs in order to perform concurrent requests per domain

@@ -0,0 +1,39 @@
defmodule BenchTest do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the test coverage up to 80% and to show it hat is possible to run the benchmark with a different spider

@@ -0,0 +1,48 @@
defmodule Features.Manager.TestSpider do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of splitting this code as you're suggesting. But could we address it in a separate PR?

@@ -1,6 +1,8 @@
defmodule ManagerTest do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this one is unrelated to bench as well.

@filipevarjao filipevarjao force-pushed the feature/benchmark branch 3 times, most recently from c1d32df to d98043c Compare July 6, 2020 19:22
Logger.info("Adding 10 workers for #{name}")

Enum.map(1..10, fn _x ->
DynamicSupervisor.start_child(name, {Crawly.Worker, [name]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather prefer to have it inside crawly. To do something like Crawly.Manager.add_worker/spider_name


{:stored_requests, req_count} = Crawly.RequestsStorage.stats(name)

{_, pid, :worker, _} =
Copy link
Collaborator

@oltarasenko oltarasenko Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to extend API to get specific manager using the following semantics:
Crawly.Engine.get_manager(spider_name)

Supervisor.which_children(Map.get(spiders, name))
|> Enum.find(&({Crawly.Manager, _, :worker, [Crawly.Manager]} = &1))

{:info, info} = GenServer.call(pid, :collect_metrics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an API of CrawlyManager as well. Lets extend it as a separate PR please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,31 @@
defmodule Crawly.Bench.BenchRouter do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to check if it's possible (or if we should do it) to abstract it to a separate node? E.g. having an exs file, which is started separately from Crawly. So potentially it's possible to run it as a standalone process, to avoid any possible collision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants