EMOTET: a State-Machine reversing exercise
Around the 20th of December 2020, there was one of the "usual" EMOTET email campaign hitting several countries. I had the possibility to get some sample and I decided to make this little analysis, to deep dive some specific aspects of the malware itself.
In particular I had a look to how the malware has been written, with an analysis of the interesting techniques used.
There is a very good analysis done by Fortinet in 2019, where the also the first stage has been analyzed. My exercise is more focused on the second stage on a recent sample.
In this repository you will find all the DLLs, scripts and tools used for the analysis, with the annotated Ghidra project file, with all the mapping to my findings (API calls, program logic, etc). You can use this as starting point for additional investigation on it. Enjoy ;-)
The infection chain
EMOTET is usually spread by using e-mail campaign (in this case in Italian language)
This particular sample is coming from what we can call the usual infection chain:
- delivery of an e-mail with a malicious zipped document
- once opened, the document runs an obfuscated powershell script and downloads the 2nd stage
- the 2nd stage (in form of a DLL) is then executed
- the 2nd stage establish some persistence and try to connect a C2
The initial triage
All the files used for this analysis are in the repository. The "dangerous" ones are password protected (with the usual pwd).
The DLL (
sg.dll) has the following characteristics:
File Name: sg.dll Size: 340480 SHA1: b08e07b1d91f8724381e765d695601ea785d8276
This DLL exports a single function named
RunDLL: once executed, it decrypts "in-memory" an additional DLL. This one, dumped as
dump_1_0418.bin, is the target of my analysis:
File Name: dump_1_0418.bin Size: 122880 SHA1: 57cd8eac09714effa7b6f70b34039bbace4a3e23
An initial overview of the dumped DLL, shows immediately that we don't have any string visible in it, no imports and a first look to the disassembly shows a heavily obfuscated code. We need to do some work here.
I fired up Ghidra and started to snoop around. Starting from the only exported function
RunDLL you quickly end up to
FUN_10009716 where you can spot a main loop with a kind of "State-Machine":
It looks like that a given double-word (stored in
ECX) is controlling what the program is doing. But this looks convoluted and not very easy to unroll, since nothing is really in clear. For example, if you try to isolate the library API call in x64dbg, you will face something like this:
Every single API call is done in this way: there is a bunch of
MOV, XOR, SHIFT and PUSH followed by a call to
xxx606F (first red box), which decode in EAX the address of the function (called by the second red box). The number of
PUSH just before the
CALL EAX are the parameters, which could be worth to inspect.
The same "state" approach is also used in several sub-functions, not only in the main loop. So, everything looks time consuming, and I'd like to find a way to get the high level picture of it.
This tool is a little gem: Speakeasy can emulate the execution of user and kernel mode malware, allowing you to interact with the emulated code by using quick Python scripts. What I'd like to do was to map every single state of the machine (
ECX value of the main loop), to something more meaningful, like DLL API calls.
I had to work a bit to get what I wanted:
- the emulation was failing in more than one point, with some invalid read. I investigated a bit the reason, and I saw that sometimes the
CALL EAXdone in some location was not valid (
EAXset to 0). I decided to get the easy way and just skip these calls
- I had to modify the call to a specific API (
- I mapped the machine state
- added a
--stateswitch to control the flow of the emulation. You can use it to explore all the states (ex.
--state 0x167196bc). You may encounter errors if needed parts are not initialized, but you can reconstruct the proper flow by looking at the Ghidra decompilation
- in a second iteration, knowing where strings are decrypted, I added a dump of all the strings in clear (see following sections)
Then the execution of the final script (
python emu_emotetdll.py -f sg.dll) gave me something very interesting. The list of the imported DLLs (with related addresses):
0x10017a4c: 'kernel32.LoadLibraryW("advapi32.dll")' -> 0x78000000 0x10017a4c: 'kernel32.LoadLibraryW("crypt32.dll")' -> 0x58000000 0x10017a4c: 'kernel32.LoadLibraryW("shell32.dll")' -> 0x69000000 0x10017a4c: 'kernel32.LoadLibraryW("shlwapi.dll")' -> 0x67000000 0x10017a4c: 'kernel32.LoadLibraryW("urlmon.dll")' -> 0x54500000 0x10017a4c: 'kernel32.LoadLibraryW("userenv.dll")' -> 0x76500000 0x10017a4c: 'kernel32.LoadLibraryW("wininet.dll")' -> 0x7bc00000 0x10017a4c: 'kernel32.LoadLibraryW("wtsapi32.dll")' -> 0x63000000 ...
and a lot of API calls, mapped to the machine state:
[+] State: 1de2d3e5 0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280 0x10018080: 'kernel32.HeapAlloc(0x7280, 0x8, 0x4c)' -> 0x72a0 [+] State: 5c80354 0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280 0x10018080: 'kernel32.HeapAlloc(0x7280, 0x8, 0x20)' -> 0x72f0 0x10017a4c: 'kernel32.LoadLibraryW("advapi32.dll")' -> 0x78000000 0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280 0x10014b3a: 'kernel32.HeapFree(0x7280, 0x0, 0x72f0)' -> 0x1 0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280 ...
This list was not complete (because I skipped on purpose some failing calls and probably some calls were not correctly intercepted), but it gave me an overall picture of what was going on. Thanks FireEye!
With the help of Speakeasy output and a combination of dynamic and static analysis (done with x64gdb and Ghidra), I was able to reconstruct the main flows of the Malware. Consider that these flows are not complete, they are high level snapshot of what is going on for some (not all) the "states". I'm sure something is missing. This is the "main" flow
Then we have the "Persistency" flow (the yellow boxes are the interesting ones):
And the initial "C2" communication flow:
Not all the states were explored. I focused on persistence and initial C2. The great thing of this approach is that you can now alter the execution flow, by setting the
ECX value you want to explore or execute.
I added a lot of details in the Ghidra file, by renaming the API calls and inserting comments. Every number reported in the graphs (ex 19a) are in the comments, so you can easily track the code section.
I renamed the functions with this standard:
- a single underscore in front of API calls
- a double underscore in front of internal function calls
Interesting findings: encrypted strings
All the strings are encrypted in a BLOB, located, in this particular dumped sample, at
The green box is the XOR key and the yellow one is the length of the string. The function used to perform the decryption is the
Every single string is decrypted and then removed from memory after usage. This is true even for C format strings. So you will not find anything in memory if you try to inspect the mapped sections at runtime.
As said before, I added a specific section in the Speakeasy script to dump those strings.
Interesting findings: list of C2 servers
IP of C2 are dumped form the same BLOB (in this case at
0x1CA00) just after the decryption in step
As stated in Fortinet Analysis, this list is made of IP (green box) and port (yellow box). You can decode the whole list if you pass this part of the binary in the following python code:
import sys import struct b = bytearray(sys.stdin.buffer.read()) for x in range(0,len(b),8): print('%u.%u.%u.%u:%u' % (b[x+3],b[x+2],b[x+1],b[x],struct.unpack('<H',bytes(b[x+4:x+6]))))
You can find the full list extracted in IoC section.
Interesting findings: persistence
This particular sample obtain persistency by installing a System Service. This campaign deployed different versions of the DLL using also different techniques:
Run Registry Key is one of them.
The section installing the service is the 20a (state
0x204C3E9E). The high level steps are the following:
decrypt the format string
generates random chars to build the service name (which results in something like
get one random "Service Description" from the existing ones, and use it as description of the new service
Interesting findings: encrypted communications with C2
In section 8a (state
0x1C904052) we can spot out the load of a RSA public key
After this we have a call to
CryptGenKey with algo
CALG_AES_128. So it looks that the sample is going to use a symmetric key to encrypt communication.
In section 20a (state
0x386459ce) we see how the communication is encrypted:
- CryptEncrypt of the buffer to send, with the previous key
- CryptExportKey encrypted with the RSA public key
- the exported and encrypted symmetric key is then prepended to the buffer sent via HTTP
The analysis is far to be complete, there are a lot of unexplored part of the sample. At the end my goal was to build a procedure to make the analysis easier, even for different or future samples, where it would be faster to understand the overall picture.
C2 IP list:
18.104.22.168:80 22.214.171.124:80 126.96.36.199:443 188.8.131.52:8080 184.108.40.206:80 220.127.116.11:8080 18.104.22.168:443 22.214.171.124:8080 126.96.36.199:80 188.8.131.52:443 184.108.40.206:80 220.127.116.11:8080 18.104.22.168:443 22.214.171.124:8080 126.96.36.199:7080 188.8.131.52:80 184.108.40.206:80 220.127.116.11:443 18.104.22.168:443 22.214.171.124:8080 126.96.36.199:80 188.8.131.52:8080 184.108.40.206:8080 220.127.116.11:8080 18.104.22.168:80 22.214.171.124:443 126.96.36.199:8080 188.8.131.52:443 184.108.40.206:80 220.127.116.11:7080 18.104.22.168:8080 22.214.171.124:80 126.96.36.199:443 188.8.131.52:8080 184.108.40.206:7080 220.127.116.11:8080 18.104.22.168:443 22.214.171.124:80 126.96.36.199:443 188.8.131.52:80 184.108.40.206:8080 220.127.116.11:7080 18.104.22.168:7080 22.214.171.124:8080 126.96.36.199:80 188.8.131.52:80 184.108.40.206:80 220.127.116.11:80 18.104.22.168:8080 22.214.171.124:80 126.96.36.199:80 188.8.131.52:8080 184.108.40.206:80 220.127.116.11:8080 18.104.22.168:80 22.214.171.124:8080 126.96.36.199:8080 188.8.131.52:80 184.108.40.206:443 220.127.116.11:80 18.104.22.168:80 22.214.171.124:80 126.96.36.199:8080 188.8.131.52:80 184.108.40.206:7080 220.127.116.11:80 18.104.22.168:80 22.214.171.124:8080 126.96.36.199:80 188.8.131.52:80 184.108.40.206:7080 220.127.116.11:443 18.104.22.168:8080 22.214.171.124:443 126.96.36.199:8080 188.8.131.52:80 184.108.40.206:80 220.127.116.11:8080 18.104.22.168:443 22.214.171.124:4143 126.96.36.199:8080 188.8.131.52:80 184.108.40.206:80 220.127.116.11:8090 18.104.22.168:8080 22.214.171.124:8080 126.96.36.199:7080 188.8.131.52:7080 184.108.40.206:8080 220.127.116.11:80 18.104.22.168:443 22.214.171.124:80 126.96.36.199:80 188.8.131.52:443 184.108.40.206:80 220.127.116.11:443 18.104.22.168:35457 22.214.171.124:19752 126.96.36.199:47055