Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add regexp related eslint plugins #202

Merged
merged 11 commits into from
Oct 20, 2023

Conversation

kurtextrem
Copy link
Contributor

@kurtextrem kurtextrem commented Oct 7, 2023

As discussed, this adds a few eslint plugins.

The IP/IPv6 regexes get a bit bigger, because they have a lot of capture groups that were unused, so every capture group now is no longer capturing (via ?:).

By the way, while going over the ruleset of the unicorn plugin, I've found two rules:

  • no-array-reduce - in my experience reduce is indeed slower than doing manual loops. Have you tested replacing the reduce calls already?
  • no-useless-spread - I see some spread calls, e.g. in src/methods/strict. Is immutability needed here? (e.g. what if instead of spreading it, we'd use something like { _schema: schema, _parse: () => (...) }. Other functions then could check e.g. !(key in schema._schema.object))

And additionally, currently it looks to me like regexes are created every time a schema with e.g. ipv4 is validated. Creating regexps is expensive and caching them would be faster. This is also true for serverless, where you're billed for run costs (and not startup costs). So moving the regex creation outside, moves the regex compilation time to the startup time (so regex is compiled only once instead of every invocation).

@netlify
Copy link

netlify bot commented Oct 7, 2023

Deploy Preview for valibot ready!

Name Link
🔨 Latest commit e1969e3
🔍 Latest deploy log https://app.netlify.com/sites/valibot/deploys/6532bb8df6d2af0007602a26
😎 Deploy Preview https://deploy-preview-202--valibot.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@fabian-hiller
Copy link
Owner

Thank you for your contribution!

The IP/IPv6 regexes get a bit bigger, because they have a lot of capture groups that were unused, so every capture group now is no longer capturing (via ?:).

That's fine!

By the way, while going over the ruleset of the unicorn plugin, I've found two rules:

  • no-array-reduce - in my experience reduce is indeed slower than doing manual loops. Have you tested replacing the reduce calls already?

I have not tested it yet, but I am open to this change. It should have little effect on the bundle size due to compression. Do you want to take a look at it or should I do it?

  • no-useless-spread - I see some spread calls, e.g. in src/methods/strict. Is immutability needed here? (e.g. what if instead of spreading it, we'd use something like { _schema: schema, _parse: () => (...) }. Other functions then could check e.g. !(key in schema._schema.object))

The problem is that the methods that currently use spread copy the schema and only modify the ._parse method. Since schemas can have different keys, this must be done dynamically. Modifying the existing schema directly is unfortunately not an option, as this would lead to bugs. Also, _schema: schema is not an alternative as it would break compatibility.

And additionally, currently it looks to me like regexes are created every time a schema with e.g. ipv4 is validated. Creating regexps is expensive and caching them would be faster. This is also true for serverless, where you're billed for run costs (and not startup costs). So moving the regex creation outside, moves the regex compilation time to the startup time (so regex is compiled only once instead of every invocation).

Great catch! I tend to check with a benchmark if your assumption is correct and if so, I would initialize the regex lazy, because then only those are initialized that are actually used.

Copy link
Owner

@fabian-hiller fabian-hiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for you contribution!

library/package.json Outdated Show resolved Hide resolved
library/src/schemas/special/special.test.ts Show resolved Hide resolved
library/src/validations/email/email.ts Outdated Show resolved Hide resolved
@@ -12,8 +12,8 @@ import { getOutput, getPipeIssues, isLuhnAlgo } from '../../utils/index.ts';
*/
export function imei<TInput extends string>(error?: ErrorMessage) {
return (input: TInput): PipeResult<TInput> =>
!/^\d{2}[ |/|-]?\d{6}[ |/|-]?\d{6}[ |/|-]?\d$/.test(input) ||
!isLuhnAlgo(input)
// eslint-disable-next-line security/detect-unsafe-regex -- false positive according to https://devina.io/redos-checker
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this rule is so buggy, we should consider removing it in general or create an issue on GitHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. this is one of the longest standing issues of the security plugin: eslint-community/eslint-plugin-security#28

Two alternatives exist:

For my personal config I've opted with sticking with eslint-plugin-security until they've sorted out what to use, but I wasn't aware of eslint-plugin-redos-detector until now. Maybe worth trying out. I'll also mention that one in the thread above

Copy link
Owner

@fabian-hiller fabian-hiller Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there is no fix, I prefer to disable or replace this linting rule globally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll play with the two plugins and update the PR accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colinhacks/zod#2849 the zod PR is also interesting for this PR I'd say

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should keep an eye on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with redos-detector, since it seems to find more (non-false positives) of redos regexps.

@fabian-hiller fabian-hiller self-assigned this Oct 7, 2023
@fabian-hiller fabian-hiller added enhancement New feature or request priority This has priority labels Oct 7, 2023
@kurtextrem
Copy link
Contributor Author

Do you want to take a look at it or should I do it?

If you got time, feel free to, as you can probably judge better how much impact it has. In any case, I can provide this benchmark: https://www.measurethat.net/Benchmarks/Show/2472/0/arrayreduce-vs-for-loop-vs-arrayforeach, where reduce is much slower. By removing reduce, we also get rid of on closure per execution (e.g. in comparable), so double win.

Also, _schema: schema is not an alternative as it would break compatibility.

Backwards compat you mean?

Great catch! I tend to check with a benchmark if your assumption is correct and if so, I would initialize the regex lazy, because then only those are initialized that are actually used.

Through tree-shaking regexps that aren't used shouldn't execute anyway, or am I missing something?
so e.g. the following should work:

const regex = /.../;

export const isABC = () => () => regex.test(...)

if isABC is used somewhere in the application, regex will compile at startup time.
As alternative, if you're concerned a lot of regexps are in the bundle but aren't directly on the execution path, we could use https://github.com/sindresorhus/lazy-value, but that then adds a small overhead (as instead of reading a variable it becomes a function call). However, based on https://v8.dev/blog/regexp-tier-up, I'd say we should pick the first approach and not overcomplicate.
For global regexps, we need to make sure to reset the .lastIndex property (or use String#replaceAll, which is one of the eslint rules, currently commented out in the PR)

To be sure, we could check the flamegraph of the startup to figure out if the simple approach significantly slows down the startup. Back in 2019, big regexps definitely did slow down: garycourt/uri-js#40. However, the v8 blog above is from half a year later, so I do think there is a chance this is much better nowadays.


Unfortunately the uri-js issue made me realize the current regexp used for checking ipv6 in valibot might not be correct, as per garycourt/uri-js#40 (comment). The fastify team is now also using a different approach to parse ipv6 addresses: https://github.com/fastify/fast-uri/blob/main/lib/utils.js#L27. And they have a fast path for rejecting invalid ipv4's (https://github.com/fastify/fast-uri/blob/main/lib/utils.js#L6).
Example invariant for valibot's regex: fe80::3:bEFf:5b:%3NG (generated via https://marketplace.visualstudio.com/items?itemName=Nixinova.inlay-regex). It validates to true with valibot's ipv6 regex, but not with the one from uri-js.

@fabian-hiller
Copy link
Owner

If you got time, feel free to, as you can probably judge better how much impact it has...

Yes, I will do that. I have created an issue: #203

Also, _schema: schema is not an alternative as it would break compatibility.

Backwards compat you mean?

No, the point is that schemas must be able to be passed to other schemas and methods in a type-safe manner. This requires a consistent approach. Therefore, we cannot simply change the structure. Also, the modification methods should only modify and return a schema instead of producing a new schema or format.

Through tree-shaking regexps that aren't used shouldn't execute anyway, or am I missing something? so e.g. the following should work:

const regex = /.../;

export const isABC = () => () => regex.test(...)

The problem here is that if isABC is used in a schema, but the regex is never used because a user is just clicking through a website, for example, there is still an initial cost for initializing the regex. My approach would be to check if the regex is already initialized when the validation is executed. If not, it will be initialized before it is used.

But as you already mentioned, we would check if all the extra work really improves the performance.

Unfortunately the uri-js issue made me realize the current regexp...

Can you create an issue or PR?

@kurtextrem
Copy link
Contributor Author

Yes, I will do that. I have created an issue: #203

Thank you, I've also created an issue for the ipv6 issue: #206

No, the point is that schemas must be able to be passed to other schemas and methods in a type-safe manner. This requires a consistent approach. Therefore, we cannot simply change the structure. Also, the modification methods should only modify and return a schema instead of producing a new schema or format.

As far as I can tell we can control every schema, so if we were to rework every schema to use _schema instead of spreading, we'd be consistent. I might be missing something though, I'll maybe try experimenting with it and make a draft PR when I get to some results.

The problem here is that if isABC is used in a schema, but the regex is never used because a user is just clicking through a website, for example, there is still an initial cost for initializing the regex. My approach would be to check if the regex is already initialized when the validation is executed. If not, it will be initialized before it is used.

Agreed.
Speaking of, maybe that's also useful for schema creation (lazy evaluation). If parse is never called on the schema, why pay the cost of creating the schema?
I'm not sure how much impact this has in the end though, you can probably tell better how expensive schema creation could turn out.

@fabian-hiller
Copy link
Owner

As far as I can tell we can control every schema, so if we were to rework every schema to use _schema instead of spreading, we'd be consistent. I might be missing something though, I'll maybe try experimenting with it and make a draft PR when I get to some results.

Feel free to take a look at it. For now, I think spread is fine here. It is called only once for initialization. The difference without spread should not be noticeable in the real world, away from specific benchmarks.

Agreed. Speaking of, maybe that's also useful for schema creation (lazy evaluation). If parse is never called on the schema, why pay the cost of creating the schema? I'm not sure how much impact this has in the end though, you can probably tell better how expensive schema creation could turn out.

Either this is not possible due to dynamic arguments of a schema or I don't understand what you mean. Maybe you can explain it in more detail and provide a minimal code example.

@fabian-hiller
Copy link
Owner

To get a faster merge, I can offer you to remove eslint-config-canonical and eslint-plugin-unicorn and disable security/detect-unsafe-regex globally. If eslint-plugin-redos-detector is a replacement for security/detect-unsafe-regex and works, you can add it. For everything else where we are not sure yet like eslint-config-canonical and eslint-plugin-unicorn, you can create another PR.

@kurtextrem
Copy link
Contributor Author

kurtextrem commented Oct 8, 2023

Sounds like a plan! That's fine for me.

kurtextrem and others added 2 commits October 15, 2023 14:27
@kurtextrem kurtextrem changed the title chore: add more eslint plugins chore: add regexp related eslint plugins Oct 15, 2023
@kurtextrem
Copy link
Contributor Author

Either this is not possible due to dynamic arguments of a schema or I don't understand what you mean. Maybe you can explain it in more detail and provide a minimal code example.

I think I was a bit confused, I meant more like the schema itself. Think about React lazy useState.

Example (from readme):

// Create login schema with email and password
const LoginSchema = object({
  email: string([email()]),
  password: string([minLength(8)]),
});

// later on user interaction:
parse(LoginSchema, { email: '', password: '' });

vs. lazy:

// Create login schema with email and password
const LoginSchema = () => object({
  email: string([email()]),
  password: string([minLength(8)]),
});

// later on user interaction
parse(LoginSchema, { email: '', password: '' });

In the 2nd code example, the LoginSchema would only be "generated" once the parse actually needs the schema. So if e.g. a user never submits the form where email & password are validated, their browser never has to create the schema.

@kurtextrem
Copy link
Contributor Author

kurtextrem commented Oct 15, 2023

For the avoid spread topic, I played around a bit and I'm actually confident it would work (but is not backwards compatible). Example strict.ts:

return {
    ...schema,
    _parse(input, info) {
      const result = schema._parse(input, info);
      return !result.issues &&
        Object.keys(input as object).some((key) => !(key in schema.object)) // <-- ℹ️ note the use of `schema` here
        ? getSchemaIssues(
            info,
            'object',
            'strict',
            error || 'Invalid keys',
            input
          )
        : result;
    }
}

The returned object has spread schema into the returned schema, however, the _parse function uses the original schema, which means now two objects live inside the memory heap (original schema and cloned one, next to the newly returned schema).
So when changing it to e.g.:

return {
    _schema: schema,
    _parse(input, info) {
      const result = schema._parse(input, info);
      return !result.issues &&
        Object.keys(input as object).some((key) => !(key in schema.object)) // <-- ℹ️ we're still referencing the argument `schema`
        ? getSchemaIssues(
            info,
            'object',
            'strict',
            error || 'Invalid keys',
            input
          )
        : result;
    }
}

The code keeps working as intended. Only if someone called strict(object(...)).async = true, then it'd no longer work and it should be updated to strict(object(...))._schema.async = true (so this would be semver major, breaking).

That all being said, I think lazy evaluation instead of avoiding spreads is probably more beneficial.

@fabian-hiller
Copy link
Owner

In the 2nd code example, the LoginSchema would only be "generated" once the parse actually needs the schema. So if e.g. a user never submits the form where email & password are validated, their browser never has to create the schema.

Compared to Zod and ArkType, Valibot's initialization is very efficient. So I would not focus on it at the moment. But in the long run, we can look at it again.

The code keeps working as intended. Only if someone called strict(object(...)).async = true, then it'd no longer work and it should be updated to strict(object(...))._schema.async = true (so this would be semver major, breaking).

I would not go down this path at this time. I think copying a schema with the spread syntax is just right in this use case. With your approach the return value is now no longer of the type ObjectSchema. This has the consequence that the schema can no longer be used in other schemas or methods without restrictions.

@fabian-hiller
Copy link
Owner

fabian-hiller commented Oct 15, 2023

Thank you for your contribution. I will try to review and merge this PR next week.

Copy link
Owner

@fabian-hiller fabian-hiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good and ready to merge. Thank you!

*
* @param error The error message.
*
* @returns A validation function.
*/
export function ipv4<TInput extends string>(error?: ErrorMessage) {
return (input: TInput): PipeResult<TInput> =>
!/^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}$/.test(input)
!/^(?:(?:(?=(25[0-5]))\1|(?=(2[0-4]\d))\2|(?=(1\d{2}))\3|(?=(\d{1,2}))\4)\.){3}(?:(?=(25[0-5]))\5|(?=(2[0-4]\d))\6|(?=(1\d{2}))\7|(?=(\d{1,2}))\8)$/u.test(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly was changed in this regex that it is now so much longer and more complicated? Did you do this by hand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, from the zod PR: colinhacks/zod#2849. The previous regex is vulnerable to redos attacks.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked our current regex on devina.io and redosdetector.com and it seems to be safe. Can you check it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://redosdetector.com/?pattern=%5E%28%2825%5B0-5%5D%7C%282%5B0-4%5D%7C1%5Cd%7C%5B1-9%5D%7C%29%5Cd%29%5C.%3F%5Cb%29%7B4%7D%24&caseInsensitive=false&unicode=true

redosdetector is sensitive to / (devina.io isn't), maybe you copy pasted the leading and trailing slash in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking a manual look, it appears like the regex is not vulnerable (tested also via Rescue and regexploit) and top voted on stackoverflow.

with capture groups, the regex is much faster: https://esbench.com/bench/6532596e7ff73700a4debb6a than the one from e.g. zod or fastify

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redosdetector says that both, the new and the old regex, are insecure: https://redosdetector.com/?pattern=%5E%28%3F%3A%28%3F%3A25%5B0-5%5D%7C%28%3F%3A2%5B0-4%5D%7C1%5Cd%7C%5B1-9%5D%29%3F%5Cd%29%5C.%3F%5Cb%29%7B4%7D%24&caseInsensitive=false&unicode=false

Therefore, I am unsure how to proceed. Basically, I could merge the PR anyway and we'll look at that in a separate issue or PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, only the one from the zod PR isn't flagged by redosdetector (the one you commented on). As mentioned in my comment above, I'd say let's ignore it for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!

So the reason it’s flagging it is because it’s hitting the max backtrack limit given there are so many permutations. You can see if you change the number of groups to {2} it’s fine and {3} is if you increase the number of steps with redos-detector check --unicode "^(?:(?:25[0-5]|(?:2[0-4]|1\d|[1-9])?\d)\.?\b){3}$" --maxBacktracks -1 --maxSteps -1. With {4} though it would take too long to finish.

The larger regex that pollyfills atomic groups does technically make a difference. E.g. if the input was 255.2! I think it will backtrack with the following:

  • 255.2 = 25[0-5]\.2
  • 255.2 = 25[0-5]\.[1-9]
  • 255.2 = 25[0-5]\.\d
  • 25 = 2[0-4]
  • 2 = 2
  • 2 = [1-9]
  • 2 = \d

Whereas with the pollyfilled atomic group it would be just:

  • 255.2 = 25[0-5]\.2
  • 255.2 = 25[0-5]\.[1-9]
  • 255.2 = 25[0-5]\.\d

With 255.255.255.25! there would be a bigger difference (although probably not a concerning level).

Also if we look at one part in reality it wouldn’t be that high because it doesn’t group the results into ones that can match the same string. E.g. with ^(1|1|2|2)$ the tool will currently report 2 backtracks even though there can only be 1.

Definitely some room for improvement there. Created tjenkinson/redos-detector#445

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking in!

library/src/validations/ipv6/ipv6.ts Outdated Show resolved Hide resolved
library/src/validations/ulid/ulid.ts Show resolved Hide resolved
library/tsconfig.json Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority This has priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants