# Summary

As the datasets are completely text based, extrapolating new features is a must have.

the following is the features that this script try to extract from the existing data, which are as follow:
- retrive extra information from the email headers:
  - content-type 
  - charset 
  - content_transfer_encoding 
- check if the email body contain the following:
  - html
  - javascript
  - css
  - html_form
  - html_iframe
- Count how many URLs found in the email body
- Calculated the lenght of 'Subject' used in the email

## Import libraries

In [35]:
import pandas as pd
from datetime import datetime
from dateutil import parser
import re
import email

## Import Datasets to Pandas

In [36]:
fraudDataframe = pd.read_csv('datasets/clean/fraud-emails.csv')
phishingDataframe = pd.read_csv('datasets/clean/phishing-emails.csv')
enronDataframe = pd.read_csv('datasets/clean/enron-emails.csv')

In [37]:
malicious_df = pd.concat([fraudDataframe, phishingDataframe], ignore_index=True)
enron_df = enronDataframe

Uncomment the codes bellow and run this if the index is save to the CSV

In [38]:
# malicious_df = malicious_df.drop(columns='Unnamed: 0')
# enron_df = enron_df.drop(columns=['Unnamed: 0','Unnamed: 0.1'])

Extract extra informations that can be used as a features

In [39]:
# this method is really inefficient and will take too long for larger datasets

def getExtraInfo(row):
    try:
        message = email.message_from_string(row.raw_mail)
        row['content_type'] = message.get_content_type()      
        row['charset'] = message.get_content_charset()
        row['content_transfer_encoding'] = message['Content-Transfer-Encoding']
        return row
    except Exception as e:
        return row

In [None]:
malicious_df = malicious_df.apply(getExtraInfo, axis=1)

If we used the method above for the enron_df it will take a lot longer (30+ minutes) then doing it like in the code bellow 

In [40]:
enron_df['content_type'] = enron_df.raw_mail.apply(lambda raw_mail: email.message_from_string(raw_mail).get_content_type())

In [41]:
enron_df['charset'] = enron_df.raw_mail.apply(lambda raw_mail: email.message_from_string(raw_mail).get_content_charset())

In [42]:
enron_df['charset'].unique()

array(['us-ascii', 'ansi_x3.4-1968', None], dtype=object)

In [43]:
malicious_df.charset.loc[malicious_df.charset == ''] = None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.charset.loc[malicious_df.charset == ''] = None


In [44]:
enron_df['content_transfer_encoding'] = enron_df.raw_mail.apply(lambda raw_mail: email.message_from_string(raw_mail)['Content-Transfer-Encoding'])

Clean some inconsistensy on 'content_transfer_encoding' columns for malicious_df

In [45]:
malicious_df.content_transfer_encoding.unique()

array(['8bit', '7bit', None, 'binary', 'quoted-printable', '7BIT',
       'base64', '8BIT', 'QUOTED-PRINTABLE', '7Bit', 'Quoted-Printable',
       '7Bit ', 'BASE64', '7bit ', '8bit\\r\\n',
       '7Bit\n\tboundary="--VHOABG67774"'], dtype=object)

In [46]:
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['7BIT', '7Bit ', '7bit ','7Bit\n\tboundary="--VHOABG67774"', '7Bit'])] = '7bit'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['8bit\\r\\n', '8BIT', ])] = '8bit'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['QUOTED-PRINTABLE', 'Quoted-Printable'])] = 'quoted-printable'
malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['BASE64'])] = 'base64'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['7BIT', '7Bit ', '7bit ','7Bit\n\tboundary="--VHOABG67774"', '7Bit'])] = '7bit'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isin(['8bit\\r\\n', '8BIT', ])] = '8bit'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.content_transfer_encoding.loc[malicious_df.content_transfer_encoding.isi

Get the email domain used to send the email and the domain of the email for the receiver

In [47]:
malicious_df['from_domain'] = malicious_df.parsed_from.str.split('@', expand=True)[1]
malicious_df['to_domain'] = malicious_df.parsed_from.str.split('@', expand=True)[1]

In [48]:
enron_df['from_domain'] = enron_df.parsed_from.str.split('@', expand=True)[1]
enron_df['to_domain'] = enron_df.parsed_from.str.split('@', expand=True)[1]

Clean some inconsistensy in the content_type and content_transfer_encoding columns

In [49]:
malicious_df.content_transfer_encoding.loc[
    (malicious_df.content_type == 'text/htmlcontent-transfer-encoding:8bitrn') | 
    (malicious_df.content_type == 'text/html content-transfer-encoding: 8bit\\r\\n')] = '8bit'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.content_transfer_encoding.loc[


In [50]:
malicious_df.content_type.loc[
    (malicious_df.content_type == 'text/htmlcontent-transfer-encoding:8bitrn') | 
    (malicious_df.content_type == 'text/html content-transfer-encoding: 8bit\\r\\n')] = "text/html"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  malicious_df.content_type.loc[


Adding Numeric, Boolean and others features

In [51]:
# Get email that contains html
malicious_df['html'] = malicious_df.content_type.str.contains('text/html', case=False, regex=True)
# Get email that contains javascript
malicious_df['javascript'] = malicious_df.raw_mail.str.contains('(<script|.js)', case=False, regex=True)
# Get email that contains css
malicious_df['css'] = malicious_df.raw_mail.str.contains('(<style|\.css)', case=False, regex=True)
# Get email that contains html form
malicious_df['html_form'] = malicious_df.raw_mail.str.contains('(<form)', case=False, regex=True)
malicious_df['html_iframe'] = malicious_df.raw_mail.str.contains('<iframe', case=False, regex=True)

  malicious_df['javascript'] = malicious_df.raw_mail.str.contains('(<script|.js)', case=False, regex=True)
  malicious_df['css'] = malicious_df.raw_mail.str.contains('(<style|\.css)', case=False, regex=True)
  malicious_df['html_form'] = malicious_df.raw_mail.str.contains('(<form)', case=False, regex=True)


In [52]:
# Get email that contains html
enron_df['html'] = enron_df.content_type.str.contains('text/html', case=False, regex=True)
# Get email that contains javascript
enron_df['javascript'] = enron_df.raw_mail.str.contains('(<script|.js)', case=False, regex=True)
# Get email that contains css
enron_df['css'] = enron_df.raw_mail.str.contains('(<style|\.css)', case=False, regex=True)
# Get email that contains html form
enron_df['html_form'] = enron_df.raw_mail.str.contains('(<form)', case=False, regex=True)
enron_df['html_iframe'] = enron_df.raw_mail.str.contains('<iframe', case=False, regex=True)

  enron_df['javascript'] = enron_df.raw_mail.str.contains('(<script|.js)', case=False, regex=True)
  enron_df['css'] = enron_df.raw_mail.str.contains('(<style|\.css)', case=False, regex=True)
  enron_df['html_form'] = enron_df.raw_mail.str.contains('(<form)', case=False, regex=True)


Count how many URLs is in the message body

In [53]:
def getURLs(text):
    count = len(re.findall(r'(https?://\S+)', text))
    return count

In [54]:
malicious_df['URLs_in_message'] = malicious_df.body.apply(getURLs)

In [55]:
enron_df['URLs_in_message'] = enron_df.body.apply(getURLs)

Check if email contains attachement

In [56]:
pattern = '\.(doc|exe|msi|pdf|docx|doc|docm|ppt|pps|ppa|ppam|xls|xlsx|zip|rar|tar|gzip)'

malicious_df['attachement'] = malicious_df.body.str.contains(pattern, case=False)
enron_df['attachement'] = enron_df.body.str.contains(pattern, case=False)

  malicious_df['attachement'] = malicious_df.body.str.contains(pattern, case=False)
  enron_df['attachement'] = enron_df.body.str.contains(pattern, case=False)


save results

In [57]:
enron_df.to_csv('datasets/explored/enron-emails-explored.csv', index=False)
malicious_df.to_csv('datasets/explored/malicious-emails-explored.csv', index=False)