Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize regular expression to match URLs #79

Closed
alter2000 opened this issue Jan 28, 2019 · 19 comments
Closed

Customize regular expression to match URLs #79

alter2000 opened this issue Jan 28, 2019 · 19 comments

Comments

@alter2000
Copy link

I have a couple of use cases here: one that wants only https?:// URLs, one for everything except mailto: links, and any custom protocol prefix. The easiest way to fit all this in is with a custom regex rather than putting them as options. I haven't looked into the code to see how feasible this is and how it works right now, so it might even be easier to create something else entirely to handle custom regexes.
Or is it already implemented? I didn't see anything in the issues and google.

Anyway, thanks for the work. It's really nice.

@firecat53
Copy link
Owner

All the regex code is in urlscan/urlscan.py lines 261-286. These are the current horrific regexes:

URLINTERNALPATTERN = r'[{}()@\w/\\\-%?!&.=:;+,#~]'
URLTRAILINGPATTERN = r'[{}(@\w/\-%&=+#]'
HTTPURLPATTERN = (r'(?:(https?|file|ftps?)://' + URLINTERNALPATTERN +
                  r'*' + URLTRAILINGPATTERN + r')')
# Used to guess that blah.blah.blah.TLD is a URL.
....
TLDS = load_tlds()
GUESSEDURLPATTERN = (r'(?:[\w\-%]+(?:\.[\w\-%]+)*\.(?:' +
                     '|'.join(TLDS) + ')$)')
URLRE = re.compile(r'(?:<(?:URL:)?)?(' + HTTPURLPATTERN + '|' +
                   GUESSEDURLPATTERN +
                   r'|(?P<email>(mailto:)?[\w\-.]+@[\w\-.]*[\w\-]))>?',
                   flags=re.U)

Haven't touched these in quite awhile :D I've avoided adding the complexity of
a config file up to now but I'm not sure a command line option would be
particularly friendly for a regex. Looking at the regexes above, do you have a
sense of what the regex might be to detect the URLs you would be filtering?

@alter2000
Copy link
Author

Thanks, will check out tomorrow. I guess only the HTTPURLPATTERN and the compile call will have to be modified in the beginning for some of my use cases, but I can make a PR for XDG-compliant config some time next week.

@firecat53
Copy link
Owner

Oof, well I'm kinda dumb...forgot that I added in a config file for people that use different palettes 🙄 . I don't use it so it slipped my mind! So you could probably add something in there if that makes sense. It's just a json file.

@alter2000
Copy link
Author

I've looked at urlchooser.py and as far as I can see, I will have to add the logic to add a regex array to the config file. I was thinking about giving the user access to some of the prebuilt regexes somehow. What would you suggest, do I use an array in the JSON file to chop the regex to make it easier to read and understand, or cut it short somewhere else (maybe a separate file with just the regex), since JSON and regular expression storage don't go very well together without a load of backslashes?

Even though I don't think it's worth converting to YAML for just this, we could get by with treating the JSON as YAML somewhat easily.

@firecat53
Copy link
Owner

Hmm...it'll be a bit before I can sit down and dig into this, but I have to ask...is this worth the effort for such a restricted use case? Would a slightly modified local version of urlscan installed as urlscan-https be an easier solution?

@alter2000
Copy link
Author

I just have the free time and enough knowhow to be able to do this. Since I'm going to either use urlscan or urlview anyway, I thought about making it a public fork and eventually merging it.

Simply changing ~8 lines would be much much easier, but I'd rather make it more general (albeit with a chance of new bugs) for all than just changing 2 paragraphs. If you're okay with it, I can work on another config rule for the regex.

@firecat53
Copy link
Owner

What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just reg = importlib.import_module('customregex') and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.

Thoughts?

@alter2000
Copy link
Author

That seems like the best idea. Will get to it this weekend.

@firecat53
Copy link
Owner

Hold off until you see some commits either on develop or master adding keybindings to the config file. I did some significant refactoring yesterday that hasn't been pushed to Github yet and that might affect what you're working on!

@rslindee
Copy link

Just to add to this:

I personally don't have much of a use for scanning mailto: links in emails and I often find this clutters things up. I'd love to ignore mailto, either via modifying the regex or via something along the lines of a simple "--nomail" argument.

Thank you again for all your hard work on this!

@rafaeluriarte
Copy link

rafaeluriarte commented Apr 15, 2019

+1 for an argument to ignore mailto.... Has anyone managed to do it?

@alter2000
Copy link
Author

I've finally got some free time now, so I'm fleshing out ideas to work on this week.

We could have configuration options in $XDG_CONFIG_DIR/urlscan:

  • inside config.json: PITA to write, edit and manage, would not recommend
  • inside a separate file (named e.g. regex) that overrides the default config
    • should this be a single regex or a Python expression we could evaluate into one?

And/or as a flag:

  • path to a file
  • Python regex string as argument

I don't know which one is the best fit, since for my use case I'm just hardcoding my regex into the file, although it would be useful to others.

@firecat53
Copy link
Owner

I think my vote is still for doing as I described above:

What about just using a separate config file 'customregex.py' that just contains those variables. Then if it exists, you can just reg = importlib.import_module('customregex') and set the variables from the file instead. Seems like it's easier doing that then trying to figure out escaping for either a JSON or ConfigParser config file. We can just put a note in the manpage for advanced usage and add a command line switch to generate the customregex.py file.

@kylebarbour
Copy link

kylebarbour commented Nov 20, 2019

I think this issue might be pretty common. I integrate urslcan with mutt, and it gets its most usage with HTML email with embedded links. Picking some recent HTML emails and sending them through urlscan I wind up with multiple pages of links, many of which are are mailto: or href links in HTML tags, sometimes surrounded by multiple pages of CSS code and other similar things that a regex could help with.

@rpolve
Copy link
Contributor

rpolve commented Dec 10, 2020

And/or as a flag:
* path to a file
* Python regex string as argument

I'd prefer this approach as it can be generalized/repurposed for the most different use cases.

E.g. I have a keybinding for piping terminal buffer into urlscan, and it interprets stuff like some_archive.zip as URL, which I don't desire obviously.

It would help if I could just --regex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'.

I'll see if I can come up with a functional PR.

@firecat53
Copy link
Owner

@rpolve - Perhaps having the regex option available in two places: 1. In the existing config file for global regex changes. and 2. as a command line switch which would override the config file (for special use cases...like processing the terminal buffer vs general email links).

What do you think? Thanks for your interest!!

@rpolve
Copy link
Contributor

rpolve commented Dec 12, 2020

In the existing config file for global regex changes

Sorry, do you mean the --genconf one? Or something else?

@rpolve
Copy link
Contributor

rpolve commented Jan 7, 2021

Hi. Did you have any chance to take a look at PR #102?

@firecat53
Copy link
Owner

Studying for promotional exam. It'll be a month or so before I sit down to any of projects. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants