Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HTTP proxies #26

Closed
nfreear opened this issue Dec 12, 2017 · 19 comments
Closed

Support for HTTP proxies #26

nfreear opened this issue Dec 12, 2017 · 19 comments
Labels

Comments

@nfreear
Copy link

nfreear commented Dec 12, 2017

Hi @duzun,

Great work on the library. Are you open to adding support for HTTP proxies to it?

I've done a GIst test PHP which demos what's involved (currently without proxy authentication).

Alternatively, is there a way to use hQuery::fromFile() with create_stream_context()?

I'll investigate, and update this ticket ;).

Thanks,

Nick

@nfreear
Copy link
Author

nfreear commented Dec 12, 2017

To answer my question, the way to support a HTTP proxy is:

$http_context = stream_context_create([
    'http' => [
        'method' => 'GET',
        'user_agent' => 'MyAgent/1.0 +url',
        'proxy' => 'my-proxy.example.com:80',
        'header' => [],
]]);

$htmldoc = hQuery::fromFile( $scrape_url, false, $http_context );

See: GitHub: nfreear/school-closure .. bin/school-closure.php

Do you want to add an FAQ / Question label to this ticket?

Thanks,

Nick

@duzun duzun added the question label Dec 12, 2017
@duzun
Copy link
Owner

duzun commented Dec 12, 2017

Hi @nfreear

Thanks for the info!

Probably I should add more info to README to make things clear for new users.

@duzun
Copy link
Owner

duzun commented Dec 21, 2017

I’ve updated the README to clarify how to use ‘hQuery::fromFile’ for remote requests.

I’m still not happy unless we have HTTPlug!
PRs are welcome

@eugeniojauregui
Copy link

Hello!
Thank you for this great library.
Is there any way to use the proxy server but using it with username and password?
Thank you very much
Regards!

@duzun
Copy link
Owner

duzun commented Jun 26, 2018

Try this:

$auth = base64_encode('LOGIN:PASSWORD');
$http_context = stream_context_create([
    'http' => [
        'method' => 'GET',
        'user_agent' => 'MyAgent/1.0 +url',
        'proxy' => 'my-proxy.example.com:80',
        'request_fulluri' => true,
        'header' =>  "Proxy-Authorization: Basic $auth",
]]);

$htmldoc = hQuery::fromFile( $scrape_url, false, $http_context );

@gmmedia
Copy link

gmmedia commented Sep 16, 2018

How can I use a proxy with fromURL?

thanks
Jochen

@nfreear
Copy link
Author

nfreear commented Sep 17, 2018

Hi Jochen,

I think at the moment you can't use a proxy with fromURL. The techniques documented above make fromFile behave like fromURL, but with more control over the HTTP connection, e.g. use of a proxy.

What is your use-case?

Ta,

Nick

@gmmedia
Copy link

gmmedia commented Sep 17, 2018

Hey Nick,
I use the scraper to compare prices.

If I use fromFile, I get blocked (429 TOO MANY REQUESTS) already after my first request.

fromURL is working very good and I just need my script to sleep after 100 requests. I have 10 proxies, so I could completely avoid to sleep my process.

Jochen

@duzun
Copy link
Owner

duzun commented Sep 18, 2018

The main focus of this library is parsing of big HTML documents, not fetching them.
The fetching methods are there just for convenience.
There are excellent libraries for making HTTP requests out there, with lots of options (proxy, cache, retry, follow redirects etc).
Take for example Guzzle, which is PSR-7 compliant and has the ability to make HTTP requests through a proxy.
Another interesting option is HTTPlug, which is kind of a lego: plug any number of plugins to a PSR-7 client for as many features as you want.

After you've managed to set up the HTTP request with any PSR-7 compliant library, you can either feed the Response instance directly to hQuery::fromHTML($response, $request->getUri()), or just fetch the HTML as a string and feed that to hQuery::fromHTML($html, $url).

Here is a theoretical example (I did not test it!):

composer require php-http/guzzle6-adapter php-http/message php-http/discovery
use duzun\hQuery;

use Http\Discovery\MessageFactoryDiscovery;
use Http\Adapter\Guzzle6\Client as GuzzleAdapter;

$config = [
    'timeout' => 7,
    'proxy' => [
        'http'  => 'tcp://localhost:8125', // Use this proxy with "http"
        'https' => 'tcp://localhost:9124', // Use this proxy with "https",
        'no' => ['.mit.edu', 'foo.com']    // Don't use a proxy with these
    ]
];

$client = GuzzleAdapter::createWithConfig($config);
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET', 
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());

Though, what you are trying to do sounds a little too aggressive. Bots should be nice and not overwhelm servers with too many requests. Try to be nice and don't forget to add From: your@email.com header to the requests!

@gmmedia
Copy link

gmmedia commented Sep 20, 2018

Hey duzun,
thank you that helped me very much.

But that line:
$doc = hQuery::fromHTML($response, $request->getUri());
produces the following error:
PHP Catchable fatal error: Object of class GuzzleHttp\Psr7\Response could not be converted to string in /mnt/c/Users/joe/Projekte/bot/vendor/duzun/hquery/src/hQuery/HTML_Parser.php on line 124

Can you help me out again?

thanks
Jochen

@duzun
Copy link
Owner

duzun commented Sep 21, 2018

Hey @gmmedia,
I'm glad I could be of help.

The error you are getting is weird, because GuzzleHttp\Psr7\Response implements Psr\Http\Message\ResponseInterface, which has ->getBody() method, thus hQuery::fromHTML() should be able to extract response body from that instance.

I've added an example of using guzzle6-adapter with hQuery, and it works with no issue.

You could either figure out what is the issue, or just switch to the second option: extract body from response object as a string and use that with hQuery::fromHTML().

It would be nice to know any details of your issue though.

@gmmedia
Copy link

gmmedia commented Sep 25, 2018

Hello @duzun!
I would be lost without your examples. I am a "code user" not a real programmer :)

That error came because I did not include hQuery with that line: require_once __DIR__ . '/../hquery.php';
I thought that is done with the composer autoload.php. Why is that not the case?

It is working now, but the new code is not following a 302 redirect.
I found that guzzle is following redirects by default. How can I make your guzzle6-adapter.php following redirects?

I also miss the caching function now. How can I implement it again?

thanks
Jochen

@duzun
Copy link
Owner

duzun commented Sep 25, 2018

hQuery should be auto-loaded by composer, if you have installed it with composer, of course (composer require duzun/hquery).
Try composer dump-autoload, then it should work without explicitly including hquery.php.

guzzle6-adapter is a HTTPlug client, which means all other features besides raw HTTP requests have to be implemented through plugins.

If you don't like this approach (HTTPlug), just use any other library for fetching and caching HTML documents. Maybe guzzel directly.

hQuery::fromURL() supports only simple filesystem caching and no proxy support so far. I wish it had proxy support, but have no motivation for implementing it.

@gmmedia
Copy link

gmmedia commented Sep 25, 2018

Do you have an idea why the redirects not working?

@duzun
Copy link
Owner

duzun commented Sep 25, 2018

Yes, I've mentioned it above:

guzzle6-adapter is a HTTPlug client, which means all other features besides raw HTTP requests have to be implemented through plugins.

You have to use RedirectPlugin with guzzle6-adapter.

Here is an alternative example, using just guzzle client (no HTTPlug): guzzle example.

@gmmedia
Copy link

gmmedia commented Sep 26, 2018

I think the guzzle client is the easier version for me. Thank you, I will try that now.

Why do you load hquery not from your vendor directory with autoload.php?
You use always use require_once __DIR__ . '/../hquery.php'; in your examples.

@duzun
Copy link
Owner

duzun commented Sep 26, 2018

Why do you load hquery not from your vendor directory with autoload.php?
You use always use require_once __DIR__ . '/../hquery.php'; in your examples.

hQuery is not a dependency of examples/ "project", it is not required through composer, thus not available through autoload.php in examples/. I could instead require ../vendor/autoload.php from the parent folder, but I don't see it as an improvement.

Also, hQuery could be added to a project without composer too. Examples demonstrate just that - include hquery.php and you are ready to go.

@gmmedia
Copy link

gmmedia commented Sep 26, 2018

I am asking, because every other classes working fine for me with composer and autoload.
Only hquery is not working. Maybe because composer is loading version 2.1.0 and I load version 2.2.0 from github?

@duzun
Copy link
Owner

duzun commented Sep 27, 2018

Maybe because composer is loading version 2.1.0 and I load version 2.2.0 from github?

You are right, I just forgot to add the v2.2.0 tag to the repo (facepalm).

Try to composer update, it should fetch latest version that is on github.

Thank you for catching this issue! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants