feat(933151): update PHP fn to latest version #3228

jptosso · 2023-05-28T14:41:05Z

This PR adds a new setlist of PHP functions based on @fzipi command grep -o --no-file -R 'ZEND_FUNCTION(.*)' | cut -f2 -d\( | cut -f1 -d\) | sort > input.txt to extract a list of functions from the PHP source code.

This PR includes instructions and scripts to build this list using the following criteria:

We extract the function list from PHP source code
We remove all functions that are English dictionary words
We search all functions used on Github API
We filter only functions used at least 100 times.

Apparently, previous criteria were to ignore non-unsafe PHP functions, but this includes all.

Fixes #2685.

fzipi · 2023-05-30T18:13:32Z

Thanks @jptosso for this one! It was a long-standing list that needed updating.

This approach you took is reproducible also, so this is a gem here. Let's get more people involved.

@theMiddleBlue @airween @dune73 @theseion What do you think about the list splitting?

Also, the idea can be ported to do the same with the code on the first ~ 40 functions used in the top PHP web shells with code available.

fzipi · 2023-05-30T18:15:16Z

The test is failing because we somehow classify functions like print_r as nondangerous. Does this make sense?

  - test_title: 933151-3
    desc: non-dangorous PHP functions, removed to reduce FP
    stages:
      - stage:
          input:
            data: x=Print_r%28%20%29
            dest_addr: 127.0.0.1
            headers:
              Host: localhost
              User-Agent: "OWASP CRS test agent"
              Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
            method: POST
            port: 80
            uri: /?foo=filemtime%28%24foo%29
          output:
            no_log_contains: id "933151"

util/php-dictionary-gen/gen.sh

util/php-dictionary-gen/filter_dict.py

util/php-dictionary-gen/gen.sh

util/php-dictionary-gen/bot.py

util/php-dictionary-gen/README.md

theseion · 2023-05-31T04:46:28Z

Great stuff @jptosso!

What do you think about the list splitting?

I don't understand the question. What would we have after splitting?

Excluding print_r and friends is probably important. It's easy to imagine an application parameter named print_recurring, for example.

dune73 · 2023-05-31T08:38:39Z

This is very cool. Thank you @jptosso. Also not understanding the splitting question.

But I think there is some discussion due how this greatly improved rule really blends in with the existing PHP function names rules, namely the following matrix:

    Exists in English Language  |  Does not exist in English language
    High risk                   |  High Risk
    933160 PL2                  |  933150 PL1
 ------------------------------------------------------
    Exists in English Language  |  Does not exist in English language
    Low Risk                    |  Low Risk
    933161 PL3                  |  933151 PL2

I think you try to stay in the right quadrant, but I am not sure the automatic dictionary cuts it. We probably have to take snippets of words into consideration (-> eval as in medieval) and also non-dictionary words that are regularly in texts like the sqrt that you list.

Probably we need an option to define manual allow- and deny-lists.

But this left aside: Great work. Thank you very much.

theMiddleBlue · 2023-05-31T08:45:05Z

awesome contribution @jptosso !!

agree with @dune73 we should try to keep that list categorized, or at least excluding eng words from PL1

fzipi · 2023-05-31T11:37:47Z

@theseion @dune73 Rule 933151 will match an additional (. If that doesn't help, who can come up with a justified list of things to remove?

Co-authored-by: Max Leske <th3s3ion@gmail.com>

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

theseion

Not a requirement but I think it would be useful to have a Dockerfile + script to build and run. Not all of us use a *nix environment and you wouldn't have to explain that apt install dict is required.

util/php-dictionary-gen/gen.sh

theseion · 2023-06-01T05:08:43Z

@theseion @dune73 Rule 933151 will match an additional (. If that doesn't help, who can come up with a justified list of things to remove?

That chained rule could even be improved to not just find any opening parenthesis but match something like <word><optional spaces><opening parenthesis>. We could even do a full second pass if the first pass found something, using @rx and then use a regex that requires word boundaries. That would get rid of partial matches like medieval.

Another option could be to take out the partial english words like eval and put them into another list that is used in an additional rule at PL 3 maybe. That would give users more flexibility when tuning.

dune73 · 2023-06-01T08:52:10Z

I like all the proposed solutions to the eval problem and I think it useful to avoid the manually maintained list. The simpler the better.

As for the the proposed docker to create rules: Is not this something that goes beyond this rule alone? I'd rather not take such a fundamental decision on an individual or we end up with very heterogeneous build instructions.

theseion · 2023-06-03T07:11:56Z

As for the the proposed docker to create rules: Is not this something that goes beyond this rule alone? I'd rather not take such a fundamental decision on an individual or we end up with very heterogeneous build instructions.

Not necessarily. You can treat a Docker container as part of a script. You let something run inside the container, grab the output and throw the container away. It's like running a script on another host that has capabilities that you don't have on your own machine. I wasn't proposing a complete containerized infrastructure for scripting lists.

dune73 · 2023-06-05T07:32:42Z

Sure, I see how docker would be isolated, but I still think it's overkill if all you want to do is avoiding to explain apt install dict. And there are people without working unix environment developing CRS?

fzipi · 2023-06-05T11:28:38Z

My concern is not that worried about the docker vs. native. I think how to avoid doing manual work unless strictly necessary is the difference. If things need to be manual, PL1 lists will take another 10 years to update. And we want to avoid that situation.

dune73 · 2023-06-05T11:55:22Z

I think you are generally right, @fzipi. That should be the main concern.

Yet I think a unix box can be assumed. It's not the apt install that would delay updates by several years ...

fzipi · 2023-06-05T12:52:57Z

I'll be updating the python script, no worries. Let's swarm over the func splitting problem.

Co-authored-by: Max Leske <th3s3ion@gmail.com>

dune73 · 2023-06-16T13:34:58Z

filter_dict.py should probably check for existence of dict binary instead of hiding all errors.

dune73 · 2023-06-16T14:02:28Z

Getting 401s when running bot.py despite having a working TOKEN. Any clue?

fzipi · 2023-06-16T14:03:18Z

Will take a look and reply.

dune73 · 2023-06-16T14:45:15Z

OK, I think I got this.

@jptosso proposes 3 scripts to automate 933151. I think we can use the scripts to create all the rules in the 922150/51/60/61 rule group.

New scripts in use:

gen.sh - wrapper script that also downloads PHP source code and extracts function names
filter_dict.py - script to call dict and decide whether a function name is an English word
bot.py - script to call github and check for frequency of use of given function name (currently not working for me, but the purpose is clear)

The existing rules take risk and existence in English language into consideration. All manual. This new approach automates the English language check and brings frequency of use on GH on the table. This is very useful.

I think we can not let go of the risk completely. But we can limit the risk to the English language terms that are also a PHP function name. And that list is relatively small, and new high risk additions are rare or never.

With this approach, I think we can catch all 4 PHP function name issues.

Here is a mermaid diagram of what I have in mind:

flowchart TD
    id0((PHP Function \n names from \n PHP source code))
    idlist((Manual list \n of high-risk \n English words))
    id1{Is word an English word?}
    id2{Is it on the manual high-risk list?}
    id3{Is it a frequently used function name?}
    id933160[Rule 933160]
    id933161[Rule 933161]
    id933150[Rule 933150]
    id933151[Rule 933151]
    id0---->id1
    id1--yes-->id2
    id1--no-->id3
    id2--yes-->id933160
    id2--yes+no-->id933161
    id3--yes-->id933150
    id3--no-->id933151
    idlist---->id2

dune73 · 2023-06-21T16:04:12Z

Simplifying the setup a bit. New mermaid:

flowchart TD
    idlist((Manual list \n of high-risk \n English words))
    id933160[Rule 933160\nregex]
    idlist------>id933160
    id0((PHP Function \n names from \n PHP source code))
    id1{Is word an English word?}
    id2{Is it a frequently\nused function\nname?}
    id933161[Rule 933161\nregex]
    id933150[Rule 933150\npmFromFile]
    id933151[Rule 933151\npmFromFile]
    id0---->id1
    id1--yes--->id933161
    id1--no-->id2
    id2--yes-->id933150
    id2--no-->id933151

dune73 · 2023-06-21T16:06:57Z

@fzipi and @M4tteoP : I still can't get the token thing to work properly. Script fails for me. Does this work for you? If I can get it working, I'll be happy to come up with a cleaner base script and Matteo can then polish it into a new PR. Also looking at 933160 by hand. The script should be able to output 933161, 933150 and 933151 sources.

fzipi · 2023-06-21T16:40:46Z

Let me try it...

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

fzipi · 2023-06-21T17:17:41Z

@dune73 Ok, cleaned up a bit.

removed dict dependency, now it is full python.
added Pipfile for dependencies and isolation.
updated readme.

It works for me with the token properly, hopefully it is cleaner now.

theseion · 2023-06-21T18:29:33Z

PR for spell.sh update is here: #3242.

dune73 · 2023-06-21T20:42:29Z

OK. I got it. There was a problem with my API token. Solved now.

Thank you for the cleanup.

Script is running as we speak, but I think the filenames are botched. The gen.sh cds into the php source and then it wants to run the filter and the bot scripts as local scripts. I created softlink, so it works, but is this only me?

dune73 · 2023-06-22T09:16:57Z

For the record: The script checking for occurrences of php functions across github takes 16+ hours here. I guess that's the cost of doing business. Correct?

fzipi · 2023-06-22T12:30:32Z

For the record: The script checking for occurrences of php functions across Git Hub takes 16+ hours here. I guess that's the cost of doing business. Correct?

Yes. This is because of the rate limit in GitHub. We can only have 1 query per second I think, there is a sleep in the code for that reason. Then the total time would be # of words to check / 60.

dune73 · 2023-06-22T12:37:32Z

Got you. Thanks. Makes sense.

I'm currently reimplementing the bot.py. I'm aiming for saving into a local file and adding a time-stamp. That way the script can run daily and the update via github calls only happens after timeout.

dune73 · 2023-06-23T20:42:13Z

Existing 933160 is described as follows:

# - Rule 933160: ~220 words which are common in PHP code, but have a higher chance to cause
#               false positives in natural language or other contexts.
#               Examples: 'chr', 'eval'.
#               To mitigate false positives, a regexp looks for PHP function syntax, e.g. 'eval()'.
#               Regexp is generated from function names in util/regexp-assemble/data/933160.data

But this is how it actually looks like:

array_diff_uassoc
array_diff_ukey
array_filter
array_intersect_uassoc
array_intersect_ukey
array_map
array_reduce
array_udiff
array_udiff_assoc
array_udiff_uassoc
array_uintersect
array_uintersect_assoc
array_uintersect_uassoc
assert
assert_options
base64_encode
bson_decode
bson_encode
bzopen
chr
convert_uuencode
...

Maybe this is only me, but I do not see how array_intersect_uassoc has a higher chance to cause false positives in natural language or other contexts unless the other context is talking about PHP code.

Here is existing 933161

abs
acos
acosh
array
arsort
asin
asinh
asort
assert
atan
atan2
atanh
basename
bindec
...

I'm now a bit stuck, but I think we should move to the new recipe outlined above. Running new spell.sh (-> WordNet spell.sh) against the PHP functions is meant to give us 933161. Here is this list:

abs
asin
assert
compact
constant
copy
cos
cosh
count
crypt
current
date
decoct
define
defined
end
exec
explode
extract
file
flock
floor
flush
glob
hash
header
implode
key
link
log
mail
max
min
name
next
pack
pass
pi
pow
rand
range
rename
reset
round
serialize
shuffle
sin
sleep
sort
system
tan
time
touch
trim
unpack
virtual

This is substantially smaller than the existing 933160, but also substantially smaller than existing 933161. I like how the automatic list does not come with atan2, but it also misses basename.

What do we do?

theseion · 2023-06-24T13:29:20Z

I agree. We might have to start maintaining a list of words we want to treat as common. These words will be hard to detect automatically since they are domain specific but they would lead to false positives. Examples: basename, unset, syslog, symlink. We could then check against that list as well in the spell script.

dune73 · 2023-06-26T09:02:50Z

Ah, what you propose is like making the spell.sh check 2-fold: First against English language via WordNet (or aspell, ispell whatever) and Second against custom list of widespread terms that are not official Enlish language, but used casually and they are also keyword in different domains (so not only PHP, but bash, JS, whatever).

I would suggest to incorporate this into spell.sh and expose it by default (with accompanying cmd line options where you can chose to get only first or second as described above.

theseion · 2023-06-27T04:45:16Z

Yes, precisely.

dune73 · 2023-08-31T20:44:37Z

Thank you very much for launching the discussion with this PR @jptosso. I'm closing this in favor of #3273.

jptosso added 2 commits May 28, 2023 10:35

feat(933151): update PHP fn to latest version

9153721

fix EOFs

c71b94c

fzipi added the ⚡ list update label May 30, 2023

theseion requested changes May 31, 2023

View reviewed changes

fzipi and others added 4 commits May 31, 2023 08:38

Apply suggestions from code review

fdd03f0

Co-authored-by: Max Leske <th3s3ion@gmail.com>

fix: sort functions alpabetically, not per usage

5183042

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

fix(933151): remove short words < 4 chars

c3b8040

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

fix: update script text

d0cc8fb

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

theseion requested changes Jun 1, 2023

View reviewed changes

util/php-dictionary-gen/gen.sh Outdated Show resolved Hide resolved

fzipi mentioned this pull request Jun 4, 2023

Monthly Chat Agenda June 2023 (2023-06-05 and 2023-06-19) #3221

Closed

Update util/php-dictionary-gen/gen.sh

75b75cd

Co-authored-by: Max Leske <th3s3ion@gmail.com>

Merge branch 'v4.0/dev' into v4/php-dictonary

4fa038f

Merge branch 'v4.0/dev' into v4/php-dictonary

17f0ed8

fzipi added 2 commits June 21, 2023 14:08

feat: remove script dict dependency

064c067

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

fix: add uniq after sorting words

6413ef8

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>

fzipi mentioned this pull request Jul 2, 2023

Monthly Chat Agenda July 2023 (2023-07-03 and 2023-07-17) #3240

Closed

Merge branch 'v4.0/dev' into v4/php-dictonary

a263733

M4tteoP mentioned this pull request Jul 28, 2023

feat: PHP functions generator #3273

Merged

dune73 closed this Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(933151): update PHP fn to latest version #3228

feat(933151): update PHP fn to latest version #3228

jptosso commented May 28, 2023 •

edited by fzipi

fzipi commented May 30, 2023

fzipi commented May 30, 2023

theseion commented May 31, 2023

dune73 commented May 31, 2023 •

edited

theMiddleBlue commented May 31, 2023

fzipi commented May 31, 2023

theseion left a comment

theseion commented Jun 1, 2023

dune73 commented Jun 1, 2023 •

edited

theseion commented Jun 3, 2023

dune73 commented Jun 5, 2023

fzipi commented Jun 5, 2023

dune73 commented Jun 5, 2023

fzipi commented Jun 5, 2023

dune73 commented Jun 16, 2023 •

edited

dune73 commented Jun 16, 2023

fzipi commented Jun 16, 2023

dune73 commented Jun 16, 2023

dune73 commented Jun 21, 2023

dune73 commented Jun 21, 2023

fzipi commented Jun 21, 2023

fzipi commented Jun 21, 2023

theseion commented Jun 21, 2023

dune73 commented Jun 21, 2023

dune73 commented Jun 22, 2023

fzipi commented Jun 22, 2023

dune73 commented Jun 22, 2023

dune73 commented Jun 23, 2023 •

edited

theseion commented Jun 24, 2023

dune73 commented Jun 26, 2023

theseion commented Jun 27, 2023

dune73 commented Aug 31, 2023

feat(933151): update PHP fn to latest version #3228

feat(933151): update PHP fn to latest version #3228

Conversation

jptosso commented May 28, 2023 • edited by fzipi

fzipi commented May 30, 2023

fzipi commented May 30, 2023

theseion commented May 31, 2023

dune73 commented May 31, 2023 • edited

theMiddleBlue commented May 31, 2023

fzipi commented May 31, 2023

theseion left a comment

Choose a reason for hiding this comment

theseion commented Jun 1, 2023

dune73 commented Jun 1, 2023 • edited

theseion commented Jun 3, 2023

dune73 commented Jun 5, 2023

fzipi commented Jun 5, 2023

dune73 commented Jun 5, 2023

fzipi commented Jun 5, 2023

dune73 commented Jun 16, 2023 • edited

dune73 commented Jun 16, 2023

fzipi commented Jun 16, 2023

dune73 commented Jun 16, 2023

dune73 commented Jun 21, 2023

dune73 commented Jun 21, 2023

fzipi commented Jun 21, 2023

fzipi commented Jun 21, 2023

theseion commented Jun 21, 2023

dune73 commented Jun 21, 2023

dune73 commented Jun 22, 2023

fzipi commented Jun 22, 2023

dune73 commented Jun 22, 2023

dune73 commented Jun 23, 2023 • edited

theseion commented Jun 24, 2023

dune73 commented Jun 26, 2023

theseion commented Jun 27, 2023

dune73 commented Aug 31, 2023

jptosso commented May 28, 2023 •

edited by fzipi

dune73 commented May 31, 2023 •

edited

dune73 commented Jun 1, 2023 •

edited

dune73 commented Jun 16, 2023 •

edited

dune73 commented Jun 23, 2023 •

edited