Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

建议内置支持Post请求的Downloader #108

Closed
usenrong opened this issue Apr 23, 2014 · 7 comments
Closed

建议内置支持Post请求的Downloader #108

usenrong opened this issue Apr 23, 2014 · 7 comments
Assignees

Comments

@usenrong
Copy link

黄大你好,最近在项目中使用webmagic,采集页面时遇到很多需要Post请求的页面,扩展了下webmagic 的HttpClientDownloader代码如下

 protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map<String, String> headers) {
        RequestBuilder requestBuilder = null;
        if (request.getExtra("isPost")!=null){ //post请求
            requestBuilder = RequestBuilder.post().setUri(request.getUrl());
            NameValuePair[]  nameValuePair = (NameValuePair[]) request.getExtra("nameValuePair");
            if (nameValuePair.length>0 ) {
                requestBuilder.addParameters(nameValuePair);
            }

        }else {//get请求
            requestBuilder =  RequestBuilder.get().setUri(request.getUrl());
        }

        if (headers != null) {
            for (Map.Entry<String, String> headerEntry : headers.entrySet()) {
                requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
            }
        }
        RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
                .setConnectionRequestTimeout(site.getTimeOut())
                .setSocketTimeout(site.getTimeOut())
                .setConnectTimeout(site.getTimeOut())
                .setCookieSpec(CookieSpecs.BEST_MATCH);
        if (site != null && site.getHttpProxy() != null) {
            requestConfigBuilder.setProxy(site.getHttpProxy());
        }
        requestBuilder.setConfig(requestConfigBuilder.build());
        return requestBuilder.build();
    }

isPost 和请求参数数组放在request的属性里可能更优雅些。

@code4craft
Copy link
Owner

仔细思考了下,决定对做HTTP Method做基本的支持,为Request增加method属性,支持GET, POST, HEAD, PUT, TRACE五种。在HttpClientDownloader中添加方法,根据属性选择方法。

protected RequestBuilder selectRequestMethod(String method) {
    if (method == null || method.equalsIgnoreCase(HttpConstant.Method.GET)) {
        //default get
        return RequestBuilder.get();
    } else if (method.equalsIgnoreCase(HttpConstant.Method.POST)) {
        return RequestBuilder.post();
    } else if (method.equalsIgnoreCase(HttpConstant.Method.HEAD)) {
        return RequestBuilder.head();
    } else if (method.equalsIgnoreCase(HttpConstant.Method.PUT)) {
        return RequestBuilder.put();
    } else if (method.equalsIgnoreCase(HttpConstant.Method.DELETE)) {
        return RequestBuilder.delete();
    } else if (method.equalsIgnoreCase(HttpConstant.Method.TRACE)) {
        return RequestBuilder.trace();
    }
    throw new IllegalArgumentException("Illegal HTTP Method " + method);
}

使用request.setMethod(String method)可以进行设置,尽量使用us.codecraft.webmagic.constant.HttpConstant.Method定义的常量。

一些说明:

  1. 简单起见,去重仍然只针对URL,即是两个Request使用同一个URL,即使它们的Method不同,也会被去重。但是你可以扩展LocalDuplicatedRemovedScheduler实现自己的去重逻辑。
  2. 使用String而不是enum,是考虑到Request可以会进行远程传递,避免增加序列化/反序列化的难度。

@code4craft
Copy link
Owner

关于POST的Parameters,仍然使用request.getExtra("nameValuePair")的方式来获取。

protected RequestBuilder selectRequestMethod(Request request) {
        String method = request.getMethod();
        if (method == null || method.equalsIgnoreCase(HttpConstant.Method.GET)) {
            //default get
            return RequestBuilder.get();
        } else if (method.equalsIgnoreCase(HttpConstant.Method.POST)) {
            RequestBuilder requestBuilder = RequestBuilder.post();
            NameValuePair[] nameValuePair = (NameValuePair[]) request.getExtra("nameValuePair");
            if (nameValuePair.length > 0) {
                requestBuilder.addParameters(nameValuePair);
            }
            return requestBuilder;
        } else if (method.equalsIgnoreCase(HttpConstant.Method.HEAD)) {
            return RequestBuilder.head();
        } else if (method.equalsIgnoreCase(HttpConstant.Method.PUT)) {
            return RequestBuilder.put();
        } else if (method.equalsIgnoreCase(HttpConstant.Method.DELETE)) {
            return RequestBuilder.delete();
        } else if (method.equalsIgnoreCase(HttpConstant.Method.TRACE)) {
            return RequestBuilder.trace();
        }
        throw new IllegalArgumentException("Illegal HTTP Method " + method);
    }
``

@vuuihc
Copy link

vuuihc commented Apr 20, 2015

黄大,有没有哪个例子是用到了提交post请求的?我看到了你们上面的实现,但是不知道如何用。

@yes-github
Copy link

Webmagic的POST请求例子
————————————————————————————————————————————
PageProcessor pageProcessor = new DemoPageProcessor();
Spider spider = Spider.create(pageProcessor);
Request request = new Request("http://www.demo.com");
Map<String, Object> nameValuePair = new HashMap<String, Object>();
NameValuePair[] values = new NameValuePair[1];
values[0] = new BasicNameValuePair("_version", "2.2.1");
nameValuePair.put("nameValuePair", values);
request.setExtras(nameValuePair);
request.setMethod(HttpConstant.Method.POST);
spider.addRequest(request);
spider.run();

@wuqunfei
Copy link

能简单一点,直接调用hashmap实现吗,这个用起来很变妞

@l4dfire
Copy link

l4dfire commented Feb 29, 2016

貌似不能爬 传参数和没传一样

@SuperChrisliu
Copy link

简单起见,去重仍然只针对URL,即是两个Request使用同一个URL,即使它们的Method不同,也会被去重。restful api是依据request method的,这样忽略method,感觉不太好

@code4craft code4craft reopened this Dec 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants