# Summary

As the datasets are completely text based, extrapolating new features is a must have.

the following is the features that this script try to extract from the existing data, which are as follow:
- retrive extra information from the email headers:
  - content-type 
  - charset 
  - content_transfer_encoding 
- check if the email body contain the following:
  - html
  - javascript
  - css
  - html_form
  - html_iframe
- Count how many URLs found in the email body
- Calculated the lenght of 'Subject' used in the email
- Calculate the email subject and body entropy
- Count how many attachement are in the email
- Encode several category into numeric value
  - content_type
  - content_transfer_encoding
  - charset

## Import libraries

Here we import the libraries needed

In [36]:
import pandas as pd
from datetime import datetime
from dateutil import parser
import re
import email

## Import Datasets to Pandas

load the 3 datasets and store it in a variable 

In [37]:
fraudDataframe = pd.read_csv('datasets/clean/fraud-emails.csv')
phishingDataframe = pd.read_csv('datasets/clean/phishing-emails.csv')
enronDataframe = pd.read_csv('datasets/clean/enron-emails.csv')

Concatenate the 2 malicious datasets and stored it in one variable 

In [38]:
malicious_df = pd.concat([fraudDataframe, phishingDataframe], ignore_index=True)
enron_df = enronDataframe

## Extract Features from the datasets

In [39]:
# this method is really inefficient and will take too long for larger datasets

def getExtraInfo(row):
    try:
        message = email.message_from_string(row.raw_mail)
        row['content_type'] = message.get_content_type()      
        row['charset'] = message.get_content_charset()
        row['content_transfer_encoding'] = message['Content-Transfer-Encoding']
        return row
    except Exception as e:
        return row

the function above does feature extraction for for content_type, chartset and content_transfer_encoding

In [40]:
malicious_df = malicious_df.apply(getExtraInfo, axis=1)

Here we apply the function "getExtraInfo" to the malicious dataframe

the 3 cells bellow shows the unique categories that are taken from the malicious dataframe

In [42]:
malicious_df.content_type.unique()

array(['text/plain', 'multipart/mixed', 'multipart/alternative',
       'text/html', 'multipart/related',
       'text/html content-transfer-encoding: 8bit\\r\\n',
       'text/htmlcontent-transfer-encoding:8bitrn'], dtype=object)

In [41]:
malicious_df.charset.unique()

array(['us-ascii', 'iso-8859-1', 'windows-1252', 'ansi', None,
       'windows-1256', 'iso-8859-2', 'windows-1254', 'windows-1250',
       'gb2312', 'utf-8', 'x-user-defined', 'iso-8859-15', 'koi8-r',
       'windows-1251', 'windows-1253', 'unknown-8bit', 'windows-1257',
       'windows-125', 'tis-620', 'iso-2022-jp', '', 'iso-8859-9',
       'charset="iso-8859-1', 'euc-kr', 'ks_c_5601-1987', 'utf-7',
       'koi8-u'], dtype=object)

In [43]:
malicious_df.content_transfer_encoding.unique()

array(['8bit', '7bit', None, 'binary', 'quoted-printable', '7BIT',
       'base64', '8BIT', 'QUOTED-PRINTABLE', '7Bit', 'Quoted-Printable',
       '7Bit ', 'BASE64', '7bit ', '8bit\\r\\n',
       '7Bit\n\tboundary="--VHOABG67774"'], dtype=object)

As you can see alot of the categories are a copies of each other which need to be clean before it can be used

If we used the method above for the enron_df it will take a lot longer (30+ minutes). Which is why the implementation of the feature extraction will be done by taking applying a function to a series, this will perform much faster than what is done to the malicious dataframe

Bellow we apply an anonymous function to a series which are than save to the original dataframe. This method perform better and faster.

In [52]:
enron_df['content_type'] = enron_df.raw_mail.apply(lambda raw_mail: email.message_from_string(raw_mail).get_content_type())

In [55]:
enron_df['content_type'].unique()

array(['text/plain'], dtype=object)

the cell above extract the content_type of the email for each email instances and no cleaning is needed for this column

In [57]:
enron_df['charset'] = enron_df.raw_mail.apply(lambda raw_mail: email.message_from_string(raw_mail).get_content_charset())

In [63]:
enron_df['charset'].unique()

array(['us-ascii', 'ansi_x3.4-1968', None], dtype=object)

the cell above extract the charset of each email instances

In [53]:
malicious_df.loc[malicious_df.charset == ''] = None

the cell above replace any empty string for charset to a None value

In [49]:
enron_df['content_transfer_encoding'] = enron_df.raw_mail.apply(
    lambda raw_mail: email.message_from_string(raw_mail)['Content-Transfer-Encoding'])

the cell above extract the content_transfer_encoding use by the email

In [61]:
enron_df['content_transfer_encoding'].unique()

array(['7bit', 'quoted-printable', None, 'base64'], dtype=object)

## Cleaning duplicates

As you've seen before for malicious dataframe, the categories contains alot of duplicates to clean this we will manually replace the duplicates with the real values.

In [64]:
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['7BIT', '7Bit ', '7bit ','7Bit\n\tboundary="--VHOABG67774"', '7Bit'])] = '7bit'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['8bit\\r\\n', '8BIT', ])] = '8bit'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['QUOTED-PRINTABLE', 'Quoted-Printable'])] = 'quoted-printable'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['BASE64'])] = 'base64'

From the datasets we also can extract some other information such as what is the email address domain used to send and receive the email.

In [15]:
malicious_df['from_domain'] = malicious_df.parsed_from.str.split('@', expand=True)[1]
malicious_df['to_domain'] = malicious_df.parsed_from.str.split('@', expand=True)[1]

In [16]:
enron_df['from_domain'] = enron_df.parsed_from.str.split('@', expand=True)[1]
enron_df['to_domain'] = enron_df.parsed_from.str.split('@', expand=True)[1]

the 2 cell above extract the email domain using the split function and use "@" as the delimeter, then take the 2nd value from the array.

In [65]:
malicious_df.content_transfer_encoding.loc[
    (malicious_df.content_type == 'text/htmlcontent-transfer-encoding:8bitrn') | 
    (malicious_df.content_type == 'text/html content-transfer-encoding: 8bit\\r\\n')] = '8bit'

In [66]:
malicious_df.content_type.loc[
    (malicious_df.content_type == 'text/htmlcontent-transfer-encoding:8bitrn') | 
    (malicious_df.content_type == 'text/html content-transfer-encoding: 8bit\\r\\n')] = "text/html"

the 2 cell above clean some error cause by the parser and dirty data that the parser can intrepet correctly

## Extracting Numberic and Boolean features

the 2 code cells bellow extract the numeric features by using regex pattern

In [19]:
# Get email that contains html
malicious_df['html'] = malicious_df.content_type.str.count('(text/html)')
# Get email that contains javascript
malicious_df['javascript'] = malicious_df.raw_mail.str.count('(<script|.js)')
# Get email that contains css
malicious_df['css'] = malicious_df.raw_mail.str.count('(<style|\.css)')
# Get email that contains html form
malicious_df['html_form'] = malicious_df.raw_mail.str.count('(<form)')
malicious_df['html_iframe'] = malicious_df.raw_mail.str.count('<iframe')

In [20]:
# Get email that contains html
enron_df['html'] = enron_df.content_type.str.count('(text/html)')
# Get email that contains javascript
enron_df['javascript'] = enron_df.raw_mail.str.count('(<script|\.js)')
# Get email that contains css
enron_df['css'] = enron_df.raw_mail.str.count('(<style|\.css)')
# Get email that contains html form
enron_df['html_form'] = enron_df.raw_mail.str.count('(<form)')
enron_df['html_iframe'] = enron_df.raw_mail.str.count('(<iframe)')

In [21]:
def getURLs(text):
    count = len(re.findall(r'(https?://\S+)', text))
    return count

the function above count how many string that match the pattern specified, the pattern is for finding valid URL in the email

In [22]:
malicious_df['URLs_in_message'] = malicious_df.body.apply(getURLs)

In [23]:
enron_df['URLs_in_message'] = enron_df.body.apply(getURLs)

here in the 2 cells the function that were creted to count the number of URLS is then applied to the "body" column

In [24]:
malicious_df['subject_len'] = malicious_df.subject.apply(lambda x: len(f'{x}'))

In [25]:
enron_df['subject_len'] = enron_df.subject.apply(lambda x: len(f'{x}'))

Calculate the Entropy of the Body & Subject 

In [26]:
from collections import Counter
from math import log

def shannon(string):
    s = f'{string}'
    counts = Counter(s)
    frequencies = ((i / len(s)) for i in counts.values())
    return - sum(f * log(f, 2) for f in frequencies)

In [27]:
malicious_df['subject_entropy'] = malicious_df.subject.apply(shannon)
malicious_df['body_entropy'] = malicious_df.body.apply(shannon)

In [28]:
enron_df['subject_entropy'] = enron_df.subject.apply(shannon)
enron_df['body_entropy'] = enron_df.body.apply(shannon)

Check if email contains attachement

In [29]:
pattern = '\.(doc|exe|msi|pdf|docx|doc|docm|ppt|pps|ppa|ppam|xls|xlsx|zip|rar|tar|gzip)'

malicious_df['attachement'] = malicious_df.body.str.count(pattern)
enron_df['attachement'] = enron_df.body.str.count(pattern)

Transform categorical text data into numeric category

In [30]:
malicious_df['content_transfer_encoding'] = pd.factorize(malicious_df['content_transfer_encoding'])[0] + 1
malicious_df['content_type'] = pd.factorize(malicious_df['content_type'])[0] + 1
malicious_df['charset'] = pd.factorize(malicious_df['charset'])[0] + 1

In [31]:
enron_df['content_transfer_encoding'] = pd.factorize(enron_df['content_transfer_encoding'])[0] + 1
enron_df['content_type'] = pd.factorize(enron_df['content_type'])[0] + 1
enron_df['charset'] = pd.factorize(enron_df['charset'])[0] + 1

save results

In [32]:
enron_df.to_csv('datasets/explored/enron-emails-explored.csv', index=False)
malicious_df.to_csv('datasets/explored/malicious-emails-explored.csv', index=False)