## Computing hash of a sample

In [4]:
!curl -O https://www.python.org/ftp/python/3.7.2/python-3.7.2-amd64.exe

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.9M  100 24.9M    0     0  4231k      0  0:00:06  0:00:06 --:--:-- 5190k


we import hashlib, a standard Python library for hash computation.
We also specify the file we will be hashing—in this case, the file is
python-3.7.2-amd64.exe

In [5]:
import sys
import hashlib

filename = "python-3.7.2-amd64.exe"

Instantiate MD5 and SHA256 objects and specify the chunks we will be reading and specify the size of chunks we will be reading

In [6]:
BUF_SIZE = 65536
md5 = hashlib.md5()
sha256 = hashlib.sha256()

We can then read files in chunks of 64 KB and incremently make our hashes, we utilize the .update(data) method. This method allows us to compute the hash incrementally because it computes the hash of the
concatenation. In other words, hash.update(a) followed by hash.update(b) is equivalent to hash.update(a+b).

In [7]:
with open(filename, "rb") as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha256.update(data)

finally print the hashes in hexadecimal digits

In [8]:
print("MD5: ", md5.hexdigest())
print("SHA256: ", sha256.hexdigest())

MD5:  ff258093f0b3953c886192dec9f52763
SHA256:  0fe2a696f5a3e481fed795ef6896ed99157bcef273ef3c4a96f2905cbdb3aa13


## On YARA Rules

Yara rules can be extremely simple checks on file
like `rule over_100kb {condition: filesize> 100KB}`


In [18]:
YARA_RULES = """rule is_a_pdf
              {
                  strings: 
                      $pdf_magic = {25 50 44 46}
                  conditions:
                      $pdf_magic at 0
              }"""

In [19]:
YARA_RULES

'rule is_a_pdf\n              {\n                  strings: \n                      $pdf_magic = {25 50 44 46}\n                  conditions:\n                      $pdf_magic at 0\n              }'

in order to check the YARA Rule save it to a file and 
`Yara rule.yara PythonBrochure`

## Examining a PE Header

Portable Executable (PE) files are a common Windows file type. PE files includes the .exe , .dll and .sys files. 

In [22]:
!pip install pefile
import pefile

Collecting pefile
  Downloading pefile-2019.4.18.tar.gz (62 kB)
[K     |████████████████████████████████| 62 kB 211 kB/s eta 0:00:01
Building wheels for collected packages: pefile
  Building wheel for pefile (setup.py) ... [?25ldone
[?25h  Created wheel for pefile: filename=pefile-2019.4.18-py3-none-any.whl size=60822 sha256=2027eee08b832bbe19a45db433784fc394cccf374669c433cdc1cb9416860fbe
  Stored in directory: /root/.cache/pip/wheels/e4/0c/b1/8950a0d751fcd42dfd7943069545b33430408a50e5d8deef0c
Successfully built pefile
Installing collected packages: pefile
Successfully installed pefile-2019.4.18


In [23]:
pe = pefile.PE(filename)

Lets list the imports of the PE files

In [25]:
for entry in pe.DIRECTORY_ENTRY_IMPORT:
    print(entry.dll)
    for imp in entry.imports:
        print(hex(imp.address), imp.name)

b'ADVAPI32.dll'
0x44b000 b'RegCloseKey'
0x44b004 b'RegOpenKeyExW'
0x44b008 b'OpenProcessToken'
0x44b00c b'AdjustTokenPrivileges'
0x44b010 b'LookupPrivilegeValueW'
0x44b014 b'InitiateSystemShutdownExW'
0x44b018 b'GetUserNameW'
0x44b01c b'RegQueryValueExW'
0x44b020 b'RegDeleteValueW'
0x44b024 b'CloseEventLog'
0x44b028 b'OpenEventLogW'
0x44b02c b'ReportEventW'
0x44b030 b'ConvertStringSecurityDescriptorToSecurityDescriptorW'
0x44b034 b'DecryptFileW'
0x44b038 b'CreateWellKnownSid'
0x44b03c b'InitializeAcl'
0x44b040 b'SetEntriesInAclW'
0x44b044 b'ChangeServiceConfigW'
0x44b048 b'CloseServiceHandle'
0x44b04c b'ControlService'
0x44b050 b'OpenSCManagerW'
0x44b054 b'OpenServiceW'
0x44b058 b'QueryServiceStatus'
0x44b05c b'SetNamedSecurityInfoW'
0x44b060 b'CheckTokenMembership'
0x44b064 b'AllocateAndInitializeSid'
0x44b068 b'SetEntriesInAclA'
0x44b06c b'SetSecurityDescriptorGroup'
0x44b070 b'SetSecurityDescriptorOwner'
0x44b074 b'SetSecurityDescriptorDacl'
0x44b078 b'InitializeSecurityDescriptor'


Listing the sections of the file

In [27]:
for section in pe.sections:
    print(section.Name, hex(section.VirtualAddress), hex(section.Misc_VirtualSize), section.SizeOfRawData)

b'.text\x00\x00\x00' 0x1000 0x49937 301568
b'.rdata\x00\x00' 0x4b000 0x1ed60 126464
b'.data\x00\x00\x00' 0x6a000 0x1730 2560
b'.wixburn' 0x6c000 0x38 512
b'.rsrc\x00\x00\x00' 0x6d000 0x165f4 91648
b'.reloc\x00\x00' 0x84000 0x3dfc 15872


Listing a full dump of parsed information

In [28]:
print(pe.dump_info())

----------DOS_HEADER----------

[IMAGE_DOS_HEADER]
0x0        0x0   e_magic:                       0x5A4D    
0x2        0x2   e_cblp:                        0x90      
0x4        0x4   e_cp:                          0x3       
0x6        0x6   e_crlc:                        0x0       
0x8        0x8   e_cparhdr:                     0x4       
0xA        0xA   e_minalloc:                    0x0       
0xC        0xC   e_maxalloc:                    0xFFFF    
0xE        0xE   e_ss:                          0x0       
0x10       0x10  e_sp:                          0xB8      
0x12       0x12  e_csum:                        0x0       
0x14       0x14  e_ip:                          0x0       
0x16       0x16  e_cs:                          0x0       
0x18       0x18  e_lfarlc:                      0x40      
0x1A       0x1A  e_ovno:                        0x0       
0x1C       0x1C  e_res:                         
0x24       0x24  e_oemid:                       0x0       
0x26       0x26

## Featurizing the PE header

We will extract features from PE headerto be used in building a malware/benign samples classifier. We will continue utilizing the pefile module

In [43]:
from os import listdir
from os.path import isfile, join

defining a function to collect the names and sections of a file and preprocess them for readability and normalization

In [44]:
def get_section_names(pe):
    list_of_section_names = []
    for sec in pe.sections:
        normalized_name = sec.Name.decode().replace("\x00", "").lower()
        list_of_section_names.append(normalized_name)
    return list_of_section_names

creating a convenience function to preprocess and standardize our import

In [45]:
def preprocess_imports(list_of_DLLs):
    return [x.decode().split(".")[0].lower() for x in list_of_DLLs]

define a function to collect the importfrom a file using pefile

In [47]:
 def get_imports(pe):
    list_of_imports = []
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
        list_of_imports.append(entry.dll)
    return preprocess_imports(list_of_imports)

finally we prepare to iterate through all of our files and create lists to store our features

In [49]:
import_corpus = []
num_sections = []
section_names = []

pe = pefile.PE(filename)
imports = get_imports(pe)
n_sections = len(pe.sections)
sec_names = get_section_names(pe)

import_corpus.append(imports)
num_sections.append(n_sections)
section_names.append(sec_names)

In [50]:
print(import_corpus)
print(num_sections)
print(section_names)

[['advapi32', 'user32', 'oleaut32', 'gdi32', 'shell32', 'ole32', 'kernel32', 'rpcrt4']]
[6]
[['.text', '.rdata', '.data', '.wixburn', '.rsrc', '.reloc']]
