Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(933151): update PHP fn to latest version #3228

Closed
wants to merge 12 commits into from

Conversation

jptosso
Copy link
Member

@jptosso jptosso commented May 28, 2023

This PR adds a new setlist of PHP functions based on @fzipi command grep -o --no-file -R 'ZEND_FUNCTION(.*)' | cut -f2 -d\( | cut -f1 -d\) | sort > input.txt to extract a list of functions from the PHP source code.

This PR includes instructions and scripts to build this list using the following criteria:

  • We extract the function list from PHP source code
  • We remove all functions that are English dictionary words
  • We search all functions used on Github API
  • We filter only functions used at least 100 times.

Apparently, previous criteria were to ignore non-unsafe PHP functions, but this includes all.

Fixes #2685.

@fzipi
Copy link
Member

fzipi commented May 30, 2023

Thanks @jptosso for this one! It was a long-standing list that needed updating.

This approach you took is reproducible also, so this is a gem here. Let's get more people involved.

@theMiddleBlue @airween @dune73 @theseion What do you think about the list splitting?

Also, the idea can be ported to do the same with the code on the first ~ 40 functions used in the top PHP web shells with code available.

@fzipi
Copy link
Member

fzipi commented May 30, 2023

The test is failing because we somehow classify functions like print_r as nondangerous. Does this make sense?

  - test_title: 933151-3
    desc: non-dangorous PHP functions, removed to reduce FP
    stages:
      - stage:
          input:
            data: x=Print_r%28%20%29
            dest_addr: 127.0.0.1
            headers:
              Host: localhost
              User-Agent: "OWASP CRS test agent"
              Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
            method: POST
            port: 80
            uri: /?foo=filemtime%28%24foo%29
          output:
            no_log_contains: id "933151"

util/php-dictionary-gen/gen.sh Outdated Show resolved Hide resolved
util/php-dictionary-gen/filter_dict.py Outdated Show resolved Hide resolved
util/php-dictionary-gen/gen.sh Outdated Show resolved Hide resolved
util/php-dictionary-gen/gen.sh Outdated Show resolved Hide resolved
util/php-dictionary-gen/bot.py Show resolved Hide resolved
util/php-dictionary-gen/README.md Outdated Show resolved Hide resolved
@theseion
Copy link
Contributor

Great stuff @jptosso!

What do you think about the list splitting?

I don't understand the question. What would we have after splitting?

Excluding print_r and friends is probably important. It's easy to imagine an application parameter named print_recurring, for example.

@dune73
Copy link
Member

dune73 commented May 31, 2023

This is very cool. Thank you @jptosso. Also not understanding the splitting question.

But I think there is some discussion due how this greatly improved rule really blends in with the existing PHP function names rules, namely the following matrix:

    Exists in English Language  |  Does not exist in English language
    High risk                   |  High Risk
    933160 PL2                  |  933150 PL1
 ------------------------------------------------------
    Exists in English Language  |  Does not exist in English language
    Low Risk                    |  Low Risk
    933161 PL3                  |  933151 PL2

I think you try to stay in the right quadrant, but I am not sure the automatic dictionary cuts it. We probably have to take snippets of words into consideration (-> eval as in medieval) and also non-dictionary words that are regularly in texts like the sqrt that you list.

Probably we need an option to define manual allow- and deny-lists.

But this left aside: Great work. Thank you very much.

@theMiddleBlue
Copy link
Contributor

awesome contribution @jptosso !!

agree with @dune73 we should try to keep that list categorized, or at least excluding eng words from PL1

@fzipi
Copy link
Member

fzipi commented May 31, 2023

@theseion @dune73 Rule 933151 will match an additional (. If that doesn't help, who can come up with a justified list of things to remove?

fzipi and others added 4 commits May 31, 2023 08:38
Co-authored-by: Max Leske <th3s3ion@gmail.com>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Copy link
Contributor

@theseion theseion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement but I think it would be useful to have a Dockerfile + script to build and run. Not all of us use a *nix environment and you wouldn't have to explain that apt install dict is required.

util/php-dictionary-gen/gen.sh Outdated Show resolved Hide resolved
@theseion
Copy link
Contributor

theseion commented Jun 1, 2023

@theseion @dune73 Rule 933151 will match an additional (. If that doesn't help, who can come up with a justified list of things to remove?

That chained rule could even be improved to not just find any opening parenthesis but match something like <word><optional spaces><opening parenthesis>. We could even do a full second pass if the first pass found something, using @rx and then use a regex that requires word boundaries. That would get rid of partial matches like medieval.

Another option could be to take out the partial english words like eval and put them into another list that is used in an additional rule at PL 3 maybe. That would give users more flexibility when tuning.

@dune73
Copy link
Member

dune73 commented Jun 1, 2023

I like all the proposed solutions to the eval problem and I think it useful to avoid the manually maintained list. The simpler the better.

As for the the proposed docker to create rules: Is not this something that goes beyond this rule alone? I'd rather not take such a fundamental decision on an individual or we end up with very heterogeneous build instructions.

@theseion
Copy link
Contributor

theseion commented Jun 3, 2023

As for the the proposed docker to create rules: Is not this something that goes beyond this rule alone? I'd rather not take such a fundamental decision on an individual or we end up with very heterogeneous build instructions.

Not necessarily. You can treat a Docker container as part of a script. You let something run inside the container, grab the output and throw the container away. It's like running a script on another host that has capabilities that you don't have on your own machine. I wasn't proposing a complete containerized infrastructure for scripting lists.

@dune73
Copy link
Member

dune73 commented Jun 5, 2023

Sure, I see how docker would be isolated, but I still think it's overkill if all you want to do is avoiding to explain apt install dict. And there are people without working unix environment developing CRS?

@fzipi
Copy link
Member

fzipi commented Jun 5, 2023

My concern is not that worried about the docker vs. native. I think how to avoid doing manual work unless strictly necessary is the difference. If things need to be manual, PL1 lists will take another 10 years to update. And we want to avoid that situation.

@dune73
Copy link
Member

dune73 commented Jun 5, 2023

I think you are generally right, @fzipi. That should be the main concern.

Yet I think a unix box can be assumed. It's not the apt install that would delay updates by several years ...

@fzipi
Copy link
Member

fzipi commented Jun 5, 2023

I'll be updating the python script, no worries. Let's swarm over the func splitting problem.

Co-authored-by: Max Leske <th3s3ion@gmail.com>
@dune73
Copy link
Member

dune73 commented Jun 16, 2023

filter_dict.py should probably check for existence of dict binary instead of hiding all errors.

@dune73
Copy link
Member

dune73 commented Jun 16, 2023

Getting 401s when running bot.py despite having a working TOKEN. Any clue?

@fzipi
Copy link
Member

fzipi commented Jun 16, 2023

Will take a look and reply.

@dune73
Copy link
Member

dune73 commented Jun 16, 2023

OK, I think I got this.

@jptosso proposes 3 scripts to automate 933151. I think we can use the scripts to create all the rules in the 922150/51/60/61 rule group.

New scripts in use:

  • gen.sh - wrapper script that also downloads PHP source code and extracts function names
  • filter_dict.py - script to call dict and decide whether a function name is an English word
  • bot.py - script to call github and check for frequency of use of given function name (currently not working for me, but the purpose is clear)

The existing rules take risk and existence in English language into consideration. All manual. This new approach automates the English language check and brings frequency of use on GH on the table. This is very useful.

I think we can not let go of the risk completely. But we can limit the risk to the English language terms that are also a PHP function name. And that list is relatively small, and new high risk additions are rare or never.

With this approach, I think we can catch all 4 PHP function name issues.

Here is a mermaid diagram of what I have in mind:

flowchart TD
    id0((PHP Function \n names from \n PHP source code))
    idlist((Manual list \n of high-risk \n English words))
    id1{Is word an English word?}
    id2{Is it on the manual high-risk list?}
    id3{Is it a frequently used function name?}
    id933160[Rule 933160]
    id933161[Rule 933161]
    id933150[Rule 933150]
    id933151[Rule 933151]
    id0---->id1
    id1--yes-->id2
    id1--no-->id3
    id2--yes-->id933160
    id2--yes+no-->id933161
    id3--yes-->id933150
    id3--no-->id933151
    idlist---->id2

@dune73
Copy link
Member

dune73 commented Jun 21, 2023

Simplifying the setup a bit. New mermaid:

flowchart TD
    idlist((Manual list \n of high-risk \n English words))
    id933160[Rule 933160\nregex]
    idlist------>id933160
    id0((PHP Function \n names from \n PHP source code))
    id1{Is word an English word?}
    id2{Is it a frequently\nused function\nname?}
    id933161[Rule 933161\nregex]
    id933150[Rule 933150\npmFromFile]
    id933151[Rule 933151\npmFromFile]
    id0---->id1
    id1--yes--->id933161
    id1--no-->id2
    id2--yes-->id933150
    id2--no-->id933151

@dune73
Copy link
Member

dune73 commented Jun 21, 2023

@fzipi and @M4tteoP : I still can't get the token thing to work properly. Script fails for me. Does this work for you? If I can get it working, I'll be happy to come up with a cleaner base script and Matteo can then polish it into a new PR. Also looking at 933160 by hand. The script should be able to output 933161, 933150 and 933151 sources.

@fzipi
Copy link
Member

fzipi commented Jun 21, 2023

Let me try it...

Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
Signed-off-by: Felipe Zipitria <felipe.zipitria@owasp.org>
@fzipi
Copy link
Member

fzipi commented Jun 21, 2023

@dune73 Ok, cleaned up a bit.

  • removed dict dependency, now it is full python.
  • added Pipfile for dependencies and isolation.
  • updated readme.

It works for me with the token properly, hopefully it is cleaner now.

@theseion
Copy link
Contributor

PR for spell.sh update is here: #3242.

@dune73
Copy link
Member

dune73 commented Jun 21, 2023

OK. I got it. There was a problem with my API token. Solved now.

Thank you for the cleanup.

Script is running as we speak, but I think the filenames are botched. The gen.sh cds into the php source and then it wants to run the filter and the bot scripts as local scripts. I created softlink, so it works, but is this only me?

@dune73
Copy link
Member

dune73 commented Jun 22, 2023

For the record: The script checking for occurrences of php functions across github takes 16+ hours here. I guess that's the cost of doing business. Correct?

@fzipi
Copy link
Member

fzipi commented Jun 22, 2023

For the record: The script checking for occurrences of php functions across Git Hub takes 16+ hours here. I guess that's the cost of doing business. Correct?

Yes. This is because of the rate limit in GitHub. We can only have 1 query per second I think, there is a sleep in the code for that reason. Then the total time would be # of words to check / 60.

@dune73
Copy link
Member

dune73 commented Jun 22, 2023

Got you. Thanks. Makes sense.

I'm currently reimplementing the bot.py. I'm aiming for saving into a local file and adding a time-stamp. That way the script can run daily and the update via github calls only happens after timeout.

@dune73
Copy link
Member

dune73 commented Jun 23, 2023

Existing 933160 is described as follows:

# - Rule 933160: ~220 words which are common in PHP code, but have a higher chance to cause
#               false positives in natural language or other contexts.
#               Examples: 'chr', 'eval'.
#               To mitigate false positives, a regexp looks for PHP function syntax, e.g. 'eval()'.
#               Regexp is generated from function names in util/regexp-assemble/data/933160.data

But this is how it actually looks like:

array_diff_uassoc
array_diff_ukey
array_filter
array_intersect_uassoc
array_intersect_ukey
array_map
array_reduce
array_udiff
array_udiff_assoc
array_udiff_uassoc
array_uintersect
array_uintersect_assoc
array_uintersect_uassoc
assert
assert_options
base64_encode
bson_decode
bson_encode
bzopen
chr
convert_uuencode
...

Maybe this is only me, but I do not see how array_intersect_uassoc has a higher chance to cause false positives in natural language or other contexts unless the other context is talking about PHP code.

Here is existing 933161

abs
acos
acosh
array
arsort
asin
asinh
asort
assert
atan
atan2
atanh
basename
bindec
...

I'm now a bit stuck, but I think we should move to the new recipe outlined above. Running new spell.sh (-> WordNet spell.sh) against the PHP functions is meant to give us 933161. Here is this list:

abs
asin
assert
compact
constant
copy
cos
cosh
count
crypt
current
date
decoct
define
defined
end
exec
explode
extract
file
flock
floor
flush
glob
hash
header
implode
key
link
log
mail
max
min
name
next
pack
pass
pi
pow
rand
range
rename
reset
round
serialize
shuffle
sin
sleep
sort
system
tan
time
touch
trim
unpack
virtual

This is substantially smaller than the existing 933160, but also substantially smaller than existing 933161. I like how the automatic list does not come with atan2, but it also misses basename.

What do we do?

@theseion
Copy link
Contributor

I agree. We might have to start maintaining a list of words we want to treat as common. These words will be hard to detect automatically since they are domain specific but they would lead to false positives. Examples: basename, unset, syslog, symlink. We could then check against that list as well in the spell script.

@dune73
Copy link
Member

dune73 commented Jun 26, 2023

Ah, what you propose is like making the spell.sh check 2-fold: First against English language via WordNet (or aspell, ispell whatever) and Second against custom list of widespread terms that are not official Enlish language, but used casually and they are also keyword in different domains (so not only PHP, but bash, JS, whatever).

I would suggest to incorporate this into spell.sh and expose it by default (with accompanying cmd line options where you can chose to get only first or second as described above.

@theseion
Copy link
Contributor

Yes, precisely.

@dune73
Copy link
Member

dune73 commented Aug 31, 2023

Thank you very much for launching the discussion with this PR @jptosso. I'm closing this in favor of #3273.

@dune73 dune73 closed this Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update file php-function-names-933151.data
5 participants