Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-HTTP CONNECT proxies #140

Closed
callumgare opened this issue May 17, 2024 · 4 comments
Closed

Support for non-HTTP CONNECT proxies #140

callumgare opened this issue May 17, 2024 · 4 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@callumgare
Copy link

I'd like to use got-scrapping with @loopback/http-caching-proxy (I have some tests that I want to be able to re-run many times without putting extra load on/getting blocked by the upstream servers I'm hitting). However @loopback/http-caching-proxy only supports the type of proxing where the full url is included in the HTTP request (e.g. GET https://www.example.com/index.html HTTP/1.1) rather than using the CONNECT method (e.g. CONNECT www.example.com:443 HTTP/1.1). Presumably this is because the CONNECT type of proxing will simply forward encrypted HTTPS traffic between the client and the server and thus won't be able to cache it. However it seems like got-scraping doesn't support the non-CONNECT proxy type meaning there's no easy way to use a caching proxy server with it.

Example:

# Install deps
npm install got-scraping @loopback/http-caching-proxy

# Start proxy
env DEBUG=loopback:http-caching-proxy node -e 'const {HttpCachingProxy} = require("@loopback/http-caching-proxy"); const proxy = new HttpCachingProxy({cachePath: "/tmp/.proxy-cache", port: 3000}); proxy.start().then(() => console.log(`Proxy started at ${proxy.url}`))'

# Send request
node -e 'import("got-scraping").then(({gotScraping}) => { gotScraping.get({url: "https://apify.com", proxyUrl: "http://localhost:3000"}).then(({ body }) => console.log(body)) })'

The last command prints this:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

RequestError: Proxy responded with 501 Not Implemented: 0 bytes

(501 is the response @loopback/http-caching-proxy gives when it receives a CONNECT request)

So this ticket is a request to support non-CONNECT proxies. Thanks for your consideration and for such a handy library!

@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label May 20, 2024
@barjin
Copy link
Contributor

barjin commented May 22, 2024

Hello and thank you for your interest in this project!

got-scraping actually already supports this type of proxy with the HttpRegularProxyAgent (implementation here).

This Agent is only used with http proxy / http target. Something like this will work:

gotScraping
        .get({
            url: 'http://jindrich.bar/', // https:// will fail
            proxyUrl: 'http://127.0.0.1:3000/',
        })

This is because without the CONNECT support on the proxy side, got-scraping cannot create a secure TLS connection to the target server through the proxy server (the proxy acts as MITM and can read all the content sent from the server). This would lead to a TLS error - or, worse, could be used by attackers if we worked around this error (and e.g. suppressed it).

I see you've already commented on an issue in the loopback repo - you all have good points there, if I can add my two cents - I don't think it makes much sense to run a caching proxy locally, exactly because of TLS. Perhaps you can mock the network requests in your tests with nock... or something similar?

@callumgare
Copy link
Author

callumgare commented May 23, 2024

got-scraping actually already supports this type of proxy with the HttpRegularProxyAgent (implementation here).

Good to know! Thanks.

you all have good points there, if I can add my two cents - I don't think it makes much sense to run a caching proxy locally, exactly because of TLS

I can see how those could definitely be issues in a lot of situations but unless I'm misunderstanding things I think local caching still remains a strong use case and not one where TLS errors are likely or MITM attacks are much of a concern.

Requesting HTTPS resources via a target based proxy means that as far as got is concerned it's making a HTTP request and as far the end server is concerned it's receiving a HTTPS connection. I don't think there should be anything to tip the end server off to this being anything other than a standard HTTPS connection (again if I'm understanding things correctly?). If you're accessing sites that are heavily client-side driven then there could well be issues with client javascript freaking out because a resource it's expecting to be accessed via HTTPS is instead being accessed via HTTP. But if that's the case you're probably more likely to be using playwright or puppeteer rather than got. So I don't think someone using got-scraping for this is likely to run into any TLS issues (but once again I could be misunderstanding something here).

MITM attacks are of course a serious concert in general. However I'm not sure that's too much of a risk in this situation. The only major reason I can think of when using got-scrapping for wanting to access HTTPS sites over a HTTP target based proxy rather than the much more standard (for HTTPS) CONNECT method, is so the content can be transparently cached. And if you're setting up a caching proxy as part of your app (with something like @loopback/http-caching-proxy) then the first HTTP hop from got to the caching proxy never leaves your machine and so provides little greater opportunity for a MITM attack (unless an attacker has access to your machine but you've probably got bigger problems in that case). Of cause you do need to trust your proxy, which is likely written by a third-party, but that's not much more risk than trusting got-scrapping or any of the deps that got-scrapping uses.

I could certainly see that supporting HTTPS requests via a HTTP target based proxy would provide an extra way for someone to shoot themselves in the foot if they were to use it in a silly way like using an unencrypted proxy over the internet. But since if this feature were to be implemented I imagine it would be something you'd explicitly have to opt into I think including a warning about that in the same doco that describes how to enable it should be sufficient to provide reasonable mitigation.

To provide a stronger case as to why I think such a feature is helpful I'll flesh out the use-case I'm envisioning. got-scraping is obviously intended for scrapping websites and of cause web scrapping code is often pretty brittle as the website can be updated at any time with no warning. So it's possible that you might be regularly needing to update your scrapping code. When you're updating your scrapping code you may need to make many requests to ensure things are working right. In an ideal world you'd even have a tool like vitest running in watch mode which runs an extensive test suite on code changes so you can get immediate feedback as to whether you've fixed something or not. That's the setup I have in the app I'm using got-scrapping for.

However that can mean repeatedly making the same request even though you're expecting the result to be unchanged. Depending on the situation that might not be an issue at all. But if you're dealing with tight api limits it could certainly be an issue.

Perhaps you can mock the network requests in your tests with nock... or something similar?

Nock looks very cool! I hadn't come across it before so thanks for introducing me. I'm not seeing a pre-existing way to use it for automatically caching responses but doesn't seem like it would be too hard to write something similar to @loopback/http-caching-proxy but for nock. I actually have come up with another workaround (since I really wanted transparent caching of requests and came up with this workaround soon after creating this issue). It's ugly and I still think it would be good to have support at the got-scrapping level but it works! (at least for my situation)

I've added a small bit of code just before a request is actually passed to got-scraping that rewrites a request like https://www.example.com/index.html to http://localhost:<port of first proxy>/index.html and adds the header x-real-origin: https://www.example.com.

if (cachingProxyPort) {
  const targetUrl = new URL(url)
  url = `http://localhost:${cachingProxyPort}${targetUrl.pathname}`
  additionalHeaders["x-real-origin"] = targetUrl.origin
}

got-scrapping will then diligently send it to the first proxy. This proxy takes the request and formats it in the format expected by a HTTP target based proxy (since that's what the caching proxy requires):

import http from "http";
import { ProxyServer } from '@refactorjs/http-proxy' // The most popular node http proxy library, and the
    // one this is a refactor of, has essentially been unmaintained for the past 4 years and has some 
    // serious bugs.

const firstProxy = new ProxyServer();
const firstProxyServer = http.createServer(async function(req, res) {
  const realOriginHeaderName = 'x-real-origin'
  const realOrigin = [req.headers[realOriginHeaderName]].flat()[0] // Header value could be a string or an
      // array of strings so unwrap if array
  if (!realOrigin) {
    throw Error(`Received request without ${realOriginHeaderName} header`)
  }
  req.headers.host = new URL(realOrigin).hostname
  req.url = realOrigin + req.url
  firstProxy.web(req, res, { target: secondProxyServerAddress, toProxy: true, prependPath: false });
}).listen(0);

const firstProxyServerAddress = firstProxyServer.address()

Finally requests send out by the first proxy are received by the second/caching proxy:

import { HttpCachingProxy } from '@loopback/http-caching-proxy';
import path from 'path';

const secondProxyServer = new HttpCachingProxy({
  cachePath: path.resolve(__dirname, '.proxy-cache'),
  ttl: 24 * 60 * 60 * 1000, // 1 day
});
await secondProxyServer.start();
const secondProxyServerAddress = secondProxyServer.url

Again ugly! So I'd love to see support built into got-scraping but if anyone else is interested in this use-case hopefully this helps as a workaround. So far I haven't run into any TLS issues with the sites I'm using but your mileage may vary.

@barjin
Copy link
Contributor

barjin commented May 24, 2024

I see what you are trying to achieve - but keep in mind that if a regular user is requesting a URL with the protocol https://, they will (rightfully) assume that it's going through HTTP over TLS connection to the target server - no matter what proxy is on the way.

I understand that most people using got-scraping probably know what they are doing, but we're using this package in many user-friendlier packages (namely Crawlee) - and it's sort of hidden there and people might be getting confused).

We might look into it in the future if we see more push for this feature. For now, I'll have to icebox this, as it seems very niche and potentially problematic. If you find the time, feel free to shoot us a PR - I'm pretty sure it's gonna be as easy as repurposing the HttpRegularProxyAgent somehow - or even including your "intermediate" proxy server itself in got-scraping. If you decide to go this way, make sure this feature is hidden behind some descriptive-enough option, though.

@callumgare
Copy link
Author

Makes sense! Thanks for your thoughtful and considered response ❤️ If I end up getting some time to put together a PR I’ll reopen this but in the mean time I’ll close this to reduce noise on your end. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants