# Two-Tier Architecture - Data Pre-processing & Partitioning

## Data Pre-processing:
   This notebook serves to initiate the one-time data-processing pipeline of the 2-tier architecture. The activities involved here are:
* Generating the image-like representation of byte sequence data for whole PE sample as well as PE section level byte sequences, which are stored in separate pickle files for efficiency purposes.

* Collecting the list of all PE sections available in the given dataset ```available_sections.csv``` and generating their corresponding word-embeddings using Facebook's 'fasttext' library. The section name to word embedding mappings are stored as a lookup file which is used during Tier-2 operations - ```section_embeddings.csv```.
Module Name: src/prepare_dataset.py

### Pre-requisites:
1. Raw PE samples are placed in required (raw_pe) directory.
2. Virtual environment and ipykernel setup as per requirements file.
3. 2x additional storage space in the disk to store pre-processed data, where 'x' is size of the supplied dataset.
4. Supplied samples should be parse-able by 'pefile' python library.
5. Setup folder paths in ```settings.py```

**Input:**
* List of directory paths (under raw_pe) that contain Benign and Malware raw PE samples.
* Boolean flag to indicate whether samples present in each directory are Benign or Malware.

**Outcome:**
* Raw_To_Pickle.csv: Contains the following fields to aid mapping of a raw sample to a pickle file and vice-versa:

|Indexed_File_Name|Benign-0 / Malware-1|Original_File_Name|MD5|SHA1|SHA256|
|-----------------|--------------------|----------------|---|----|------|
|pe_<PICKLE_FILE_INDEX>.pkl|1|xxx.exe|sample_md5|sample_sha1|sample_sha256|


* Structure of files generated:
``` bash
    Each Indexed_File_Name points to:
                |
                └── PKL_SOURCE_PATH                           
                        └── t1
                            └── pe_<PICKLE_FILE_INDEX>.pkl
                        └── t2
                            └── pe_<PICKLE_FILE_INDEX>.pkl 
```
* Format of Tier-1 pickle file:
```bash
    {
        "whole_bytes"        : < IMAGE REPRESENTATION FOR WHOLE BYTE SEQUENCE >, 
        "benign"             : < IS BENIGN ? >
    }
```
* Format of Tier-2 pickle file:
```bash
    {
        "benign"           : < IS BENIGN ? >,
        "size_byte"        : <>,           
        "section_info"     : {
                                 "SECTION_NAME" : {
                                                      "section_data"      : <>,
                                                      "section_size_byte" : <>,
                                                      "section_bounds"    : {
                                                                                "start_offset" : <>,
                                                                                "end_offset"   : <>
    }}}}
```

*__Note:__* Any information that violates the confidentiality, is not retained in the final set of pickle files.

In [1]:
!python ../prepare_dataset.py

Detected Platform: win32
No fold index passed through CLI. Running all folds
Total Count: 1 Unprocessed/Skipped: 0
Total Count: 2 Unprocessed/Skipped: 0


Traceback (most recent call last):
  File "../prepare_dataset.py", line 130, in <module>
    total_unprocessed, total_processed = raw_pe_to_pkl(dir, cnst.RAW_SAMPLE_DIRS[dir], total_unprocessed, total_processed)
  File "../prepare_dataset.py", line 110, in raw_pe_to_pkl
    pd.DataFrame(list_idx).to_csv(cnst.DATASET_BACKUP_FILE, index=False, header=None, mode='a')
  File "C:\Users\anand\Anaconda3\envs\tf2\lib\site-packages\pandas\core\generic.py", line 3204, in to_csv
    formatter.save()
  File "C:\Users\anand\Anaconda3\envs\tf2\lib\site-packages\pandas\io\formats\csvs.py", line 188, in save
    compression=dict(self.compression_args, method=self.compression),
  File "C:\Users\anand\Anaconda3\envs\tf2\lib\site-packages\pandas\io\common.py", line 428, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, newline="")
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\03_GitWorks\\src\\data\\Raw_To_Pickle.csv'


## Data Partitioning

__Why Partitioning?__

The outcome of data pre-processing step above is a set of pickle files for Tier-1&2. Owing to their collectively huge size, they cannot be kept entirely in memory for processing. Also, it incurs huge disk-read overhead, when we store them as separate pickle files in disk and access them individually during runtime ad-hoc. 

To reduce disk-reads as well as to fully utilize available memory, we implement partitioning of the pickle files.

```bash
        .
        ├── partitions                             
           └── <Specify_Dir_Name>
               ├── master_partition_tracker.csv
               ├── p<partition_index>.csv
               ├── t1_p<partition_index>.pkl
               └── t2_p<partition_index>.pkl
```

__What we call as 'Partition'?__

The grouping of several pickle files' data into a single large pickle file is referred to as 'Partition' here and it follows the JSON format below for both Tier-1 and Tier-2 partitions:

```bash
<t1|t2>_p<partition_index>.pkl
    {
        key=pe_<PICKLE_FILE_INDEX 1>: value={<TIER_1_DATA | TIER_2_DATA>}
        key=pe_<PICKLE_FILE_INDEX 2>: value={<TIER_1_DATA | TIER_2_DATA>}
        . 
        . 
        key=pe_<PICKLE_FILE_INDEX n>: value={<TIER_1_DATA | TIER_2_DATA>}
    }
```

The data partitions are generated through a stratified sampling process using the list of DS1 samples, such that each partition contains fairly equal ratio of benign and malware. The size of the partitions can be controlled either by required partition size or by allowed number of files per partition, using below configuration parameters in ```settings.py```.

```bash
        PARTITION_BY_COUNT        # Set to True for partitioning by file count. Otherwise, partition by allowed size.
        MAX_PARTITION_SIZE        # Use value equivalent 2GB in bytes
        MAX_FILES_PER_PARTITION   # Ex: 7000
```

__Guideline:__
Assuming an available memory of 128GB & usage of batch_size between 64 and 128, We typically use partitions of size 2GB that can hold pickle data for approximately 7000 samples.


**Input:**
* Accepts the "Raw_To_Pickle.csv" generated at the end of pre-processing step as input.
* Set the follwing flags with provided values in src\config\settings.py

```bash
        REGENERATE_DATA = True        # To stratify input files list
        REGENERATE_PARTITIONS = True
        SKIP_CROSS_VALIDATION = True  # To perform pnly partitioning
```

**Outcomes:**
* master_partition_tracker.csv: Holds the total number of partitions generated.
* A directory called "partitions" outside project directory containing the actual partitioned data

In [1]:
!python main.py

Detected Platform: win32

2020-07-03 13:07:25.344868: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-07-03 13:07:25.345543: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-07-03 13:07:32.432618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-03 13:07:33.225297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1
coreClock: 1.493GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s
2020-07-03 13:07:33.230084: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-07-03 13:07:33.234403: W tensorflow/stream_executor/platfor


No fold number passed through CLI. Running all folds



2020-07-03 13:07:33.287876: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2cb3a0477f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-03 13:07:33.288620: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-03 13:07:33.289584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 13:07:33.290290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      
2020-07-03 13:07:33.302476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1
coreClock: 1.493GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s
2020-07-03 13:07:33.310344: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudar