This repository has been archived by the owner on Jun 15, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 47
/
processing-java
23 lines (13 loc) · 2.11 KB
/
processing-java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Processing Java files in the ranking module
Separating scanning for programming languages
We have chosen to treat Java and C executables in a different way. There are good reasons for this:
* strings that are very common in C programs might be very significant for Java programs or vice versa. If all strings would be used for all scans a string found in lots of C programs, but only one Java program, it would be considered irrelevant, even though it is very significant.
* although embedding Java in C programs and vice versa does happen it is not the most common situation
The database in the ranking module has a separate field 'language' where for each string that has been extracted the language of the file is recorded. The language is determined by looking at the extension of the file and using a special lookup table that maps extensions to a programming language.
Processing binaries and narrowing results
Java binaries contain quite a bit of data that is not useful for our string based search, such as datatypes, etcetera. Also, when running the command 'strings' on Java class files sometimes some additional whitespace (like a tab) is printed in front of the stringdata we want to use, because the Java compiler has inserted that at one point.
It is possible to get just string constants out of Java binaries (both .class files and Android's DEX files) and discard all other information. For Java class files this can be done using jcf-dump (part of gcc) and processing the output. For DEX files this can be done by running Dedexer and processing its output.
Granularity of scans
Granularity could possibly be an issue when scanning Java class files. Executables that are generated when compiling C programs (like ELF executables or libraries) usually contain many more strings than
Java class files, which are conceptually perhaps closer to object files than to executables. So the amount of strings extracted from a Java class file compared to a 'normal' executable is significantly lower. Whether or not this will affect the result is currently unknown.
Since Dalvik bytecode is always in one archive it is not a problem there.