# Section 1.2: Datasets

In this section we introduce how to effectively create and read a dataset. We classified datasets in three formats:

* **Raw Data:** raw data are available the same way they were collected. For example: PE executables, ELF or APK packages;
* **Attributes:** filtered metadata extracted from the raw data with less noise and focus on the data that really matters. For example: CSV with metadata, execution logs of a sofware or data extracted from its header;
* **Features:** features extracted from the attributes or raw data ready to be used in a classifier. For example: feature vectors extracted from the attributes collected before.

With these definitions in mind, we will first introduce how to extract attributes from raw data. 

## Requirements:

* **Python Version**: make sure you are using **Python 3.5 or higher**.
* **Libraries:** all the python libraries used can be found in the file "requirements.txt". To install them, just run the following command (using pip): 
> *pip install -r requirements.txt*
* **Datasets:** the datasets located at folder "./datasets/" are going to be used in our entire course. They are all in .zip extension. When extracting, make sure the .csv are located in this same folder. To extract, use the following command from a terminal: 
> *unzip \<filename\>.zip*.

## Extracting Attributes from PE Files

There are two ways to extract attributes from softwares: statically and dynamically.

### Static Attributes

To extract static attributes from PE files, we are going to use the library [pefile](https://github.com/erocarrera/pefile). Most of the information contained in the PE headers, sections and their data are easily accessible by it.

Here is an example using one PE sample, located at "/datasets/samples/pe/".

In [1]:
import pefile
# file location
file_location = "./datasets/samples/pe/WinRAR.exe"
# open file using pefile
pe = pefile.PE(file_location)

Now that we have our PE file loaded, let's get its header attributes:

In [2]:
# get attributes from file
print(pe.FILE_HEADER)
# get timedate stamp from file header (if you want any other attribute, just change "NumberOfSections" to any other attribute listed in previous print)
print(pe.FILE_HEADER.NumberOfSections)

[IMAGE_FILE_HEADER]
0x114      0x0   Machine:                       0x8664    
0x116      0x2   NumberOfSections:              0x8       
0x118      0x4   TimeDateStamp:                 0x5C72EA4B [Sun Feb 24 19:02:35 2019 UTC]
0x11C      0x8   PointerToSymbolTable:          0x0       
0x120      0xC   NumberOfSymbols:               0x0       
0x124      0x10  SizeOfOptionalHeader:          0xF0      
0x126      0x12  Characteristics:               0x22      
8


It is also possible to get attributes from the optional header:

In [3]:
# print attributes from optional header
print(pe.OPTIONAL_HEADER)
# get size of code from optional header (if you want any other attribute, just change "SizeOfCode" to any other attribute listed in previous print)
print(pe.OPTIONAL_HEADER.SizeOfCode)

[IMAGE_OPTIONAL_HEADER64]
0x128      0x0   Magic:                         0x20B     
0x12A      0x2   MajorLinkerVersion:            0xE       
0x12B      0x3   MinorLinkerVersion:            0x0       
0x12C      0x4   SizeOfCode:                    0x10C800  
0x130      0x8   SizeOfInitializedData:         0x1BDA00  
0x134      0xC   SizeOfUninitializedData:       0x0       
0x138      0x10  AddressOfEntryPoint:           0xF1588   
0x13C      0x14  BaseOfCode:                    0x1000    
0x140      0x18  ImageBase:                     0x140000000
0x148      0x20  SectionAlignment:              0x1000    
0x14C      0x24  FileAlignment:                 0x200     
0x150      0x28  MajorOperatingSystemVersion:   0x5       
0x152      0x2A  MinorOperatingSystemVersion:   0x2       
0x154      0x2C  MajorImageVersion:             0x0       
0x156      0x2E  MinorImageVersion:             0x0       
0x158      0x30  MajorSubsystemVersion:         0x5       
0x15A      0x32  MinorSubsyst

Getting imported DLLs:

In [4]:
# get imported dlls:
dlls = []
# walk in DIRECTORY_ENTRY_IMPORT
for d in pe.DIRECTORY_ENTRY_IMPORT:
    # append dll to dlls list
    dlls.append(d.dll)
# print dlls list
print(dlls)

[b'KERNEL32.dll', b'USER32.dll', b'GDI32.dll', b'COMDLG32.dll', b'ADVAPI32.dll', b'SHELL32.dll', b'ole32.dll', b'OLEAUT32.dll', b'SHLWAPI.dll', b'POWRPROF.dll', b'COMCTL32.dll', b'UxTheme.dll', b'gdiplus.dll', b'MSIMG32.dll']


Getting imported API calls from these DLLs:

In [5]:
# get used api calls
symbols = []
# walk in DIRECTORY_ENTRY_IMPORT
for i in pe.DIRECTORY_ENTRY_IMPORT:
    # walk in imported symbols
    for s in i.imports:
        # check if symbol is valid
        if s.name != None:
            # append to sybols list
            symbols.append(s.name)
# print symbols
print(symbols)

[b'DeviceIoControl', b'BackupRead', b'BackupSeek', b'GetShortPathNameW', b'GetLongPathNameW', b'GetFileType', b'GetStdHandle', b'FlushFileBuffers', b'GetFileTime', b'GetDiskFreeSpaceExW', b'GetVersionExW', b'GetCurrentDirectoryW', b'GetFullPathNameW', b'FoldStringW', b'LoadResource', b'SizeofResource', b'FindResourceW', b'LoadLibraryExW', b'CompareStringA', b'GetCurrentThread', b'SetThreadPriority', b'SetThreadExecutionState', b'CreateEventW', b'GetSystemDirectoryW', b'SetCurrentDirectoryW', b'GetFullPathNameA', b'SetPriorityClass', b'GetProcessAffinityMask', b'CreateThread', b'InitializeCriticalSection', b'EnterCriticalSection', b'LeaveCriticalSection', b'DeleteCriticalSection', b'SetEvent', b'ResetEvent', b'ReleaseSemaphore', b'CreateSemaphoreW', b'GetSystemTime', b'TzSpecificLocalTimeToSystemTime', b'GetCPInfo', b'IsDBCSLeadByte', b'WideCharToMultiByte', b'CompareStringW', b'GetModuleHandleExW', b'GetCompressedFileSizeW', b'EnumResourceNamesW', b'EnumResourceLanguagesW', b'BeginUpda

Print all PE header information:

In [6]:
print(pe.dump_info())

----------DOS_HEADER----------

[IMAGE_DOS_HEADER]
0x0        0x0   e_magic:                       0x5A4D    
0x2        0x2   e_cblp:                        0x90      
0x4        0x4   e_cp:                          0x3       
0x6        0x6   e_crlc:                        0x0       
0x8        0x8   e_cparhdr:                     0x4       
0xA        0xA   e_minalloc:                    0x0       
0xC        0xC   e_maxalloc:                    0xFFFF    
0xE        0xE   e_ss:                          0x0       
0x10       0x10  e_sp:                          0xB8      
0x12       0x12  e_csum:                        0x0       
0x14       0x14  e_ip:                          0x0       
0x16       0x16  e_cs:                          0x0       
0x18       0x18  e_lfarlc:                      0x40      
0x1A       0x1A  e_ovno:                        0x0       
0x1C       0x1C  e_res:                         
0x24       0x24  e_oemid:                       0x0       
0x26       0x26

To get file header information in a simpler way, it is recommended to use the dict (*pe.dump_dict()*) which is used to print previous example. This step help us to build an attribute vector that can be used to extract features for our machine learning models further.

In [7]:
# pe.dump_dict() return a dict used to build the string from previous example
for k, v in pe.dump_dict()['FILE_HEADER'].items():
    # check if it is a dict to get its value
    if isinstance(v,dict):
        print("{}: {}".format(k, v["Value"]))

Characteristics: 34
NumberOfSections: 8
TimeDateStamp: 0x5C72EA4B [Sun Feb 24 19:02:35 2019 UTC]
PointerToSymbolTable: 0
Machine: 34404
NumberOfSymbols: 0
SizeOfOptionalHeader: 240


#### Exercise:
Based on samples presented, create a Python method that extract attributes from a single PE file, returning a list of attributes. After that, extract the attributes from all files located at './datasets/samples/pe/' and create an attributes matrix.

In [8]:
from glob import glob
# data path
path = "./datasets/samples/pe/*.exe"
# get list of files
files = glob(path)
# method to extract attributes fro ma single file
def extract_attributes(file):
    # continue here...
    return