Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教版主关于正则表达式匹配的问题 求解答 #399

Closed
EdronCai opened this issue Nov 24, 2016 · 8 comments
Closed

请教版主关于正则表达式匹配的问题 求解答 #399

EdronCai opened this issue Nov 24, 2016 · 8 comments

Comments

@EdronCai
Copy link

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;

/**

如上代码 如何匹配列表数据哦? 正则表达式匹配不上

@code4craft
Copy link
Owner

你指的是哪个正则?代码太多了,如果是正则问题,请提供:
待匹配文本
正则表达式

@code4craft
Copy link
Owner

code4craft commented Nov 25, 2016

pve_1092_1/\\?
这里?表示之前的字符出现0到1次,所以必须加\

@EdronCai
Copy link
Author

@code4craft 好的 我现在试试 马上告知结果

@EdronCai
Copy link
Author

@code4craft 非常感谢 正则表达式通过了 只是这些我应该如何查阅呢?在没有你的帮助下 我似乎在文档里面没有发现这个关于正则表达式的点 这个是点是正则表达式的通用情况吧
正则验证通过后 又遇到了这样的问题
[pool-1-thread-1] ERROR [us.codecraft.webmagic.Spider$1] - process request Request{url='http://cs.58.com/zhaozu/pn1/pve_1092_1/?PGTID=0d30000d-0019-e4b5-3eff-f37b576ebd78&ClickID=1', method='null', extras={statusCode=200}, priority=0} error
java.lang.IllegalArgumentException: invalid regex
at us.codecraft.webmagic.selector.RegexSelector.(RegexSelector.java:39) ~[webmagic-core-0.5.2.jar:na]
at us.codecraft.webmagic.selector.RegexSelector.(RegexSelector.java:45) ~[webmagic-core-0.5.2.jar:na]
at us.codecraft.webmagic.selector.Selectors.regex(Selectors.java:12) ~[webmagic-core-0.5.2.jar:na]
at us.codecraft.webmagic.selector.AbstractSelectable.regex(AbstractSelectable.java:77) ~[webmagic-core-0.5.2.jar:na]
at com.goldencloud.office.project.service.Client58Processor.process(Client58Processor.java:29) ~[classes/:na]
at us.codecraft.webmagic.Spider.processRequest(Spider.java:420) ~[webmagic-core-0.5.2.jar:na]
at us.codecraft.webmagic.Spider$1.run(Spider.java:322) ~[webmagic-core-0.5.2.jar:na]
at us.codecraft.webmagic.selector.thread.CountableThreadPool$1.run(CountableThreadPool.java:74) [webmagic-core-0.5.2.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_112]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_112]
Caused by: java.util.regex.PatternSyntaxException: Unclosed group near index 151
(http://cs\.58\.com/zhaozu/\w+\.shtml?psid=124606730193983807135678330&entinfo=\w+\&iuType=p_0&PGTID=0d30000d-0019-e991-490c-51f69d09bf8f&ClickID=\w+\)
^
at java.util.regex.Pattern.error(Pattern.java:1955) ~[na:1.8.0_112]
at java.util.regex.Pattern.accept(Pattern.java:1813) ~[na:1.8.0_112]
at java.util.regex.Pattern.group0(Pattern.java:2908) ~[na:1.8.0_112]
at java.util.regex.Pattern.sequence(Pattern.java:2051) ~[na:1.8.0_112]
at java.util.regex.Pattern.expr(Pattern.java:1996) ~[na:1.8.0_112]
at java.util.regex.Pattern.compile(Pattern.java:1696) ~[na:1.8.0_112]
at java.util.regex.Pattern.(Pattern.java:1351) ~[na:1.8.0_112]
at java.util.regex.Pattern.compile(Pattern.java:1054) ~[na:1.8.0_112]
at us.codecraft.webmagic.selector.RegexSelector.(RegexSelector.java:37) ~[webmagic-core-0.5.2.jar:na]
... 10 common frames omitted

@code4craft
Copy link
Owner

code4craft commented Nov 25, 2016

正则表达式是通用技术,文档没有提供,搜一下吧。这个issue不再回复了。

@EdronCai
Copy link
Author

@code4craft 搜到了 谢谢

@sutra sutra added this to the WebMagic-0.7.4 milestone Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants