# Section 3: Attributes

Attributes are fundamental for the learning process, once that, after they become features and are labeled correctly, they are used as input to train a classifier. In this section we introduce how to effectively extract attributes from PE and APK files. There are two ways to extract attributes from softwares: statically and dynamically.

## Static Analysis

The static analysis consists in extracting attributes from a software without executing it. Generally, signature strings, bytes sequences, system calls, control flow graph, libraries, etc are extracted from a software statically [Gandotra et al. 2014]. In this course we are going to study the static analysis of Portable Executable (PE) and Android Package Kit (APK) files.

### PE Files

To extract static attributes from PE files, we are going to use the library [pefile](https://github.com/erocarrera/pefile). Most of the information contained in the PE headers, sections and their data are easily accessible by it [Yonts 2010, Saxe and Sanders 2018]. Here is an example of how to open a PE file, located at "/datasets/samples/pe/".

In [1]:
import pefile
# file location
file_location = "./datasets/samples/pe/WinRAR.exe"
# open file using pefile
pe = pefile.PE(file_location)

Now that we have our PE file loaded, it is possible to obtain its attributes, as shown bellow. In total, there are seven attributes that can be extracted: Machine, NumberOfSections, TimeDateStamp, PointerToSymbolTable, NumberOfSymbols, SizeOfOptionalHeader and Characteristics. Note that it is possible to get each attribute separately when accessing the variable FILE_HEADER from the pe object.

In [2]:
# get attributes from file
print(pe.FILE_HEADER)
# get timedate stamp from file header (if you want any other attribute, just change "NumberOfSections" to any other attribute listed in previous print)
print(pe.FILE_HEADER.NumberOfSections)

[IMAGE_FILE_HEADER]
0x114      0x0   Machine:                       0x8664    
0x116      0x2   NumberOfSections:              0x8       
0x118      0x4   TimeDateStamp:                 0x5C72EA4B [Sun Feb 24 19:02:35 2019 UTC]
0x11C      0x8   PointerToSymbolTable:          0x0       
0x120      0xC   NumberOfSymbols:               0x0       
0x124      0x10  SizeOfOptionalHeader:          0xF0      
0x126      0x12  Characteristics:               0x22      
8


At the same time, it is also possible to get attributes from the optional header, you just need to access the variable OPTIONAL_HEADER from the pe object.

In [3]:
# print attributes from optional header
print(pe.OPTIONAL_HEADER)
# get size of code from optional header (if you want any other attribute, just change "SizeOfCode" to any other attribute listed in previous print)
print(pe.OPTIONAL_HEADER.SizeOfCode)

[IMAGE_OPTIONAL_HEADER64]
0x128      0x0   Magic:                         0x20B     
0x12A      0x2   MajorLinkerVersion:            0xE       
0x12B      0x3   MinorLinkerVersion:            0x0       
0x12C      0x4   SizeOfCode:                    0x10C800  
0x130      0x8   SizeOfInitializedData:         0x1BDA00  
0x134      0xC   SizeOfUninitializedData:       0x0       
0x138      0x10  AddressOfEntryPoint:           0xF1588   
0x13C      0x14  BaseOfCode:                    0x1000    
0x140      0x18  ImageBase:                     0x140000000
0x148      0x20  SectionAlignment:              0x1000    
0x14C      0x24  FileAlignment:                 0x200     
0x150      0x28  MajorOperatingSystemVersion:   0x5       
0x152      0x2A  MinorOperatingSystemVersion:   0x2       
0x154      0x2C  MajorImageVersion:             0x0       
0x156      0x2E  MinorImageVersion:             0x0       
0x158      0x30  MajorSubsystemVersion:         0x5       
0x15A      0x32  MinorSubsyst

Besides accessing the header of a file, it is also possible to obtain its imported dynamic libraries. To do that, you just need to access the variable DIRECTORY_ENTRY_IMPORT from the pe object. This variable maps each DLL used by the program in an object, which name of the library can be accessed through the variable dll. Bellow, there is an example.

In [4]:
# get imported dlls:
dlls = []
# walk in DIRECTORY_ENTRY_IMPORT
for d in pe.DIRECTORY_ENTRY_IMPORT:
    # append dll to dlls list
    dlls.append(d.dll)
# print dlls list
print(dlls)

[b'KERNEL32.dll', b'USER32.dll', b'GDI32.dll', b'COMDLG32.dll', b'ADVAPI32.dll', b'SHELL32.dll', b'ole32.dll', b'OLEAUT32.dll', b'SHLWAPI.dll', b'POWRPROF.dll', b'COMCTL32.dll', b'UxTheme.dll', b'gdiplus.dll', b'MSIMG32.dll']


Through the same variable, it is possible to get all the system calls imported by each library used by the program, as shown bellow.

In [5]:
# get used api calls
symbols = []
# walk in DIRECTORY_ENTRY_IMPORT
for i in pe.DIRECTORY_ENTRY_IMPORT:
    # walk in imported symbols
    for s in i.imports:
        # check if symbol is valid
        if s.name != None:
            # append to symbols list
            symbols.append(s.name)
# print symbols
print(symbols)

[b'DeviceIoControl', b'BackupRead', b'BackupSeek', b'GetShortPathNameW', b'GetLongPathNameW', b'GetFileType', b'GetStdHandle', b'FlushFileBuffers', b'GetFileTime', b'GetDiskFreeSpaceExW', b'GetVersionExW', b'GetCurrentDirectoryW', b'GetFullPathNameW', b'FoldStringW', b'LoadResource', b'SizeofResource', b'FindResourceW', b'LoadLibraryExW', b'CompareStringA', b'GetCurrentThread', b'SetThreadPriority', b'SetThreadExecutionState', b'CreateEventW', b'GetSystemDirectoryW', b'SetCurrentDirectoryW', b'GetFullPathNameA', b'SetPriorityClass', b'GetProcessAffinityMask', b'CreateThread', b'InitializeCriticalSection', b'EnterCriticalSection', b'LeaveCriticalSection', b'DeleteCriticalSection', b'SetEvent', b'ResetEvent', b'ReleaseSemaphore', b'CreateSemaphoreW', b'GetSystemTime', b'TzSpecificLocalTimeToSystemTime', b'GetCPInfo', b'IsDBCSLeadByte', b'WideCharToMultiByte', b'CompareStringW', b'GetModuleHandleExW', b'GetCompressedFileSizeW', b'EnumResourceNamesW', b'EnumResourceLanguagesW', b'BeginUpda

Print all PE header information:

In [6]:
# print(pe.dump_info())

To get file header information in a simpler way, it is recommended to use the dict (*pe.dump_dict()*) which is used to print previous example. This step help us to build an attribute vector that can be used to extract features for our machine learning models further.

In [7]:
# pe.dump_dict() return a dict used to build the string from previous example
for k, v in pe.dump_dict()['FILE_HEADER'].items():
    # check if it is a dict to get its value
    if isinstance(v,dict):
        print("{}: {}".format(k, v["Value"]))

NumberOfSections: 8
PointerToSymbolTable: 0
Machine: 34404
NumberOfSymbols: 0
SizeOfOptionalHeader: 240
TimeDateStamp: 0x5C72EA4B [Sun Feb 24 19:02:35 2019 UTC]
Characteristics: 34


For more information about pefile, we recommend you to read this [article](https://axcheron.github.io/pe-format-manipulation-with-pefile/), which present more interesting examples.

#### Exercise:
Based on samples presented, create a Python method that extract attributes from a single PE file, returning a list of attributes. After that, extract the attributes from all files located at './datasets/samples/pe/' and create an attributes matrix.

In [8]:
from glob import glob
# data path
path = "./datasets/samples/pe/*.exe"
# get list of files
files = glob(path)
# method to extract attributes from a single file
def extract_attributes(file):
    # continue here...
    return

### APK Files

In this example, we are going to use the library [androguard](https://github.com/androguard/androguard) to obtain static attributes from APK. It is possible to get informations from the manifest, resources, disassembled DEX, and much more (the documentation can be found [here](https://androguard.readthedocs.io/). Similar to pefile, here is an example of how to open an APK sample, located at "/datasets/samples/apk/" (it may take time depending on the APK size). The main difference here is that this library returns three objects: an APK object (which provides all the APK information), a DalvikVMFormat list (which corresponds to the DEX files found inside the APK) and an Analysis object (which contains special classes to deal with multi-DEX apps). For more details about this library, we recommend to read its [documentation](https://androguard.readthedocs.io/en/latest/index.html). 

In [9]:
from androguard.misc import AnalyzeAPK
# file location
file_location = "./datasets/samples/apk/com.whatsapp_2.19.203-452877_minAPI15(armeabi-v7a)(nodpi)_apkmirror.com.apk"
# open file using AnalyzeAPK
a, d, dx = AnalyzeAPK(file_location)

After reading the APK, it is possible to get its permissions:

In [10]:
print(a.get_permissions())

['android.permission.AUTHENTICATE_ACCOUNTS', 'android.permission.CHANGE_WIFI_STATE', 'android.permission.ACCESS_FINE_LOCATION', 'android.permission.INTERNET', 'android.permission.SEND_SMS', 'android.permission.WRITE_SYNC_SETTINGS', 'com.sec.android.provider.badge.permission.WRITE', 'android.permission.WRITE_EXTERNAL_STORAGE', 'com.huawei.android.launcher.permission.WRITE_SETTINGS', 'com.whatsapp.permission.BROADCAST', 'android.permission.REQUEST_INSTALL_PACKAGES', 'android.permission.BROADCAST_STICKY', 'android.permission.RECEIVE_BOOT_COMPLETED', 'com.whatsapp.permission.MAPS_RECEIVE', 'android.permission.READ_SYNC_STATS', 'android.permission.NFC', 'com.huawei.android.launcher.permission.READ_SETTINGS', 'android.permission.VIBRATE', 'android.permission.WRITE_CONTACTS', 'android.permission.READ_SYNC_SETTINGS', 'android.permission.MANAGE_ACCOUNTS', 'android.permission.RECEIVE_SMS', 'com.sonymobile.home.permission.PROVIDER_INSERT_BADGE', 'android.permission.GET_TASKS', 'com.whatsapp.permi

It is also possible to get a list of all activites, which are defined in the AndroidManifest.xml:

In [11]:
print(a.get_activities())



With the same object, we can get the package and app name:

In [12]:
print(a.get_package())
print(a.get_app_name())

com.whatsapp


invalid decoded string length
invalid decoded string length


WhatsApp


We can get the numeric version, version string, the minimal, maximal, target and effective SDK version too:

In [13]:
print(a.get_androidversion_code())
print(a.get_androidversion_name())
print(a.get_min_sdk_version())
print(a.get_max_sdk_version())
print(a.get_target_sdk_version())

452877
2.19.203
15
None
28


To get the decoded XML for the AndroidManifest.xml (code commented due to its extension):

In [14]:
# print(a.get_android_manifest_axml().get_xml())

To get all classes from the dex files (code commented due to its extension):

In [15]:
# print(dx.get_classes())

#### Exercise:
Similiar as PE static attributes exercise, create a Python method that extract attributes from a single APK file, returning a list of attributes. After that, extract the attributes from all files located at './datasets/samples/apk/' and create an attributes matrix.

In [16]:
from glob import glob
# data path
path = "./datasets/samples/apk/*.apk"
# get list of files
files = glob(path)
# method to extract attributes from a single file
def extract_attributes(file):
    # continue here...
    return

## Dynamic Analysis

The dynamic analysis consists in extracting attributes from a software executing it in a controlled environment (virtual machine, simulator, emulator, sandbox, etc). During the execution, it is common to monitor the invoked system calls and their parameters, information flow, execution traces, resources used, etc [Gandotra et al. 2014]. At Windows, there are many strategies used to dynamically analyze a software in the literature [Botacin et al. 2018]. One of the most used is [Cuckoo Sandbox](https://cuckoo.sh/docs/), an open source tool that automatize all the dynamic analysis process of Windows files, presenting all their execution traces (changed ad read files, changed registries, etc), network traffic and even the changes in memory [Oktavianto and Muhardianto 2013]. As this is an extensive topic, we recommend the interested author to read Cuckoo's documentation. In the Android, this theme is too extensive too. We recommend you to read the surveys [Hoffmann et al. 2016, Tam et al. 2017] for more information about the dynamic analysis tools for this platform.

It is also possible to get dynamic execution reports through VirusTotal (Android and Windows). However, this is part of the private API (which is paid). An example of a Windows software report is shown bellow, an JSON extracted from a modified version (by VirusTotal) of Cuckoo sandbox, containing many informations about the behavior of the sample, such as executed functions (and their parameters) and hosts used in the network traffic (this is a compacted version of the original log, which can be found [here](https://tinyurl.com/y36uxamz)). Another example of dynamic analysis, this time for an Android app, can be found [here](https://tinyurl.com/y25wkow6).

>{ "info":{ "started":"2013-02-27 14:44:31", "duration":"15 seconds", "version":"v0.1", "ended":"2013-02-27 14:44:46" },
>
>   "network":{ "hosts":["0.0.0.0", "255.255.255.255", "10.0.2.2", "10.0.2.15", "239.255.255.250", "224.0.0.22", "65.55.21.14", "10.0.2.255"] },
>
>   "behavior":{  
>
>      "processes":[  
>
>         {  "parent_id":"1940",
>
>            "process_id":"2000",
>
>            "process_name":"6c7a2a4dae13df742a60c0fe3c1d319eaeb6f10eb63a10ea3cce234bbdc08c9e",
>
>            "first_seen":"20130227134444.940",
>
>            "calls":[ 
>
>               {  "category":"device", "status":"SUCCESS",
>
>                  "return":"",
>
>                  "timestamp":"20130227134444.940",
>
>                  "repeated":6, "api":"DeviceIoControl",
>
>                  "arguments":[  
>
>                     { "name":"hDevice", "value":"0x00000044" },
>
>                     { "name":"dwIoControlCode", "value":"0x00390008" },
>
>                     { "name":"lpInBuffer", "value":"0x77e46318" },
>
>                     { "name":"nInBufferSize", "value":"0x00000100" },
>
>                     { "name":"lpOutBuffer", "value":"0x0012fbbc" },
>
>                     { "name":"nOutBufferSize", "value":"0x00000100" },
>
>                     { "name":"lpBytesReturned", "value":"0x0012fbb4" },
>
>                     { "name":"lpOverlapped", "value":"0x00000000" } ]
>
>               } ] } ],
>
>      "summary":{  
>
>         "files":[ "C:\\6c7a2a4dae13df742a60c0fe3c1d319eaeb6f10eb63a10ea3cce234bbdc08c9e" ]
>
>} } }

## References

[Botacin et al. 2018] Botacin, M., de Geus, P. L., and Grégio, A. (2018). The other guys: automated analysis of marginalized malware. Journal of Computer Virology and Hacking Techniques, 14(1):87–98.

[Gandotra et al. 2014] Gandotra, E., Bansal, D., and Sofat, S. (2014). Malware analysis and classification: A survey. Journal of Information Security, 5(2):56–64.

[Hoffmann et al. 2016] Hoffmann, J., Rytilahti, T., Maiorca, D., Winandy, M., Giacinto, G., and Holz, T. (2016). Evaluating analysis tools for android apps: Status quo and robustness against obfuscation. pages 139–141.

[Oktavianto and Muhardianto 2013] Oktavianto, D. and Muhardianto, I. (2013). Cuckoo Malware Analysis. Packt Publishing.

[Saxe and Sanders 2018] Saxe, J. and Sanders, H. (2018). Malware Data Science: Attack Detection and Attribution. No Starch Press, San Francisco, CA, USA.

[Tam et al. 2017] Tam, K., Feizollah, A., Anuar, N. B., Salleh, R., and Cavallaro, L. (2017). The evolution of android malware and android analysis techniques. ACM Comput. Surv., 49(4):76:1–76:41.

[Yonts 2010] Yonts, J. (2010). Building a Malware Zoo. The SANS Institute