# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的 **sample_emails.txt** 來進行練習

In [1]:
import re

In [2]:
# 讀取文本資料
file = open('sample_emails.txt')
sample_corpus = file.read()
file.close()
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [3]:
pattern = r"(From:.+\".+\"\ <.+>)"
match = re.findall(pattern, sample_corpus, flags=re.M|re.I)
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [4]:
matchStr = '\n'.join(match)
pattern = r"(\".+\")"
matchName = re.findall(pattern, matchStr, flags=re.M)
for name in matchName:
    print(name)

"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [5]:
matchStr = '\n'.join(match)
pattern = r"((?<=<).+(?=>))"
matchEmail = re.findall(pattern, matchStr, flags=re.M)
for email in matchEmail:
    print(email)

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [6]:
pattern = r"((?<=@).+(?=\.))"
matchEmail2 = re.findall(pattern, matchStr, flags=re.M)
for email in matchEmail2:
    print(email)

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [7]:
pattern = r"((?<=<).+(?=@)|(?<=@).+(?=\.))"
matchEmail3 = re.findall(pattern, matchStr, flags=re.M)
matchEmail3
for email in matchEmail3:
    print(email)

james_ngola2002
maktoob
bensul2004nng
spinfinder
obong_715
epatra


### 使用正規表達式對 email 資料進行處理
這裡我們會使用到 python 其他的套件協助處理 (ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [8]:
import re
import pandas as pd
import email

### 讀取文本資料: fradulent_emails.txt
file = open('fradulent_emails.txt', encoding='latin-1')
fradulent = file.read()
file.close()

### 切割讀入的資料成一封一封的 email
### 我們可以使用 list 來儲存每一封 email
### 注意！這裡請仔細觀察 sample 資料，看資料是如何切分不同 email

pattern = r"(?<=.)From r"
emails = re.split(pattern, fradulent, flags=re.S)
emails

# len(emails) # 查看有多少封 email

['From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCID

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [9]:
emails_list = []    # 創建空 list 來儲存所有 email 資訊
mail = emails[2]
# print(mail)
# for mail in emails[:20]:    # 只取前 20 筆資料 (處理速度比較快)
emails_dict = dict()    # 創建空字典儲存資訊

### 取得寄件者姓名與地址
# Step1: 取得寄件者資訊 (hint: From:)
pattern = r"From: .+"
sender = re.findall(pattern, mail, flags=re.M)
print(sender)

# Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
if sender:
    i = 0
    name = []
    address = []
    while (i < len(sender)) and (len(name) == 0 or len(address) == 0):
        if len(name) == 0:
            pattern = r"((?<=: )\w+ \w* *\w+$)"
            name = re.findall(pattern, sender[i])
        print('name: ', name)
        if len(address)==0:
            pattern = r"(\b\w+@.+\b)"
            address = re.findall(pattern, sender[i])
        print(address)
        i = i + 1
print(name, address)
# Step3: 將取得的姓名與地址存入字典中
emails_dict['name'] = name
emails_dict['address'] = address


['From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']
name:  []
['obong_715@epatra.com']
[] ['obong_715@epatra.com']


In [10]:
emails_list = []    # 創建空 list 來儲存所有 email 資訊
for mail in emails[:20]:    # 只取前 20 筆資料 (處理速度比較快)
    emails_dict = dict()    # 創建空字典儲存資訊
    
    ### 取得寄件者姓名與地址
    # Step1: 取得寄件者資訊 (hint: From:)
    pattern = r"From: .+"
    sender = re.findall(pattern, mail, flags=re.M)
    
    # Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if sender:
        i = 0
        name = []
        address = []
        while (i < len(sender)) and (len(name) == 0 or len(address) == 0):
            if len(name) == 0:
                pattern = r"(?<=: ).+(?= <)|(?<=: )\w+ \w* *\w+$"
#                 pattern = r"(?<=: )\w+ \w+ \w+$"
                senderName = re.findall(pattern, sender[i])
            
            if len(address)==0:
                pattern = r"(\b\w+@.+\b)"
                senderAddress = re.findall(pattern, sender[i])
            i = i + 1
    
    # Step3: 將取得的姓名與地址存入字典中
    try: 
        emails_dict['senderName'] = senderName[0]
    except:
        emails_dict['senderName'] = None
    try:
        emails_dict['senderAddress'] = senderAddress[0]
    except:
        emails_dict['senderAddress'] = None
    
    ### 取的收件者姓名與地址
    # Step1: 取得寄件者資訊 (hint: To:)
    pattern = r"^To: .+"
    receiver = re.findall(pattern, mail, flags=re.M)
    
    # Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if receiver:
        pattern = r"(?<=: ).+@.+"
        receiverAddress = re.findall(pattern, receiver[0], flags=re.M)
    
    # Step3: 將取得的姓名與地址存入字典中
#     emails_dict['receiverName'] = receiverName
    try:
        emails_dict['receiverAddress'] = receiverAddress
    except:
        emails_dict['receiverAddress'] = None
        
    ### 取得信件日期
    # Step1: 取得日期資訊 (hint: To:)
    pattern = r"(?<=  ).+:\d+:\d+.+"
    time = re.findall(pattern, mail, flags=re.M)
    
    # Step2: 取得詳細日期 (只需取得 DD MMM YYYY)
    pattern = r"(?<= )\w{3}(?= )"
    day = re.findall(pattern, time[0], flags=re.I)
    
    pattern = r"(?<= )\d+(?= )"
    month = re.findall(pattern, time[0])
    
    pattern = r"(?<= )\d{4}"
    year = re.findall(pattern, time[0])
    
    # Step3: 將取得的日期資訊存入字典中
    emails_dict['day'] = day[0]
    emails_dict['month'] = month[0]
    emails_dict['year'] = year[0]
        
    ### 取得信件主旨
    # Step1: 取得主旨資訊 (hint: Subject:)
    pattern = r"^Subject: .+"
    subject = re.findall(pattern, mail, flags=re.M)
#     print(subject)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    subject = subject[0][9:]
#     print(subject)
    
    #Step3: 將取得的主旨存入字典中
    emails_dict['subject'] = subject
    
    ### 取得信件內文
    # 這裡我們使用 email package 來取出 email 內文
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ### 將字典加入 list
    emails_list.append(emails_dict)
#     print()

In [11]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,senderName,senderAddress,receiverAddress,day,month,year,subject,email_body
0,"""MR. JAMES NGOLA.""",james_ngola2002@maktoob.com,[webmaster@aclweb.org],Oct,30,2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,"""Mr. Ben Suleman""",bensul2004nng@spinfinder.com,[R@M],Oct,31,2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,"""PRINCE OBONG ELEME""",obong_715@epatra.com,[webmaster@aclweb.org],Oct,31,2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,"""PRINCE OBONG ELEME""",obong_715@epatra.com,[webmaster@aclweb.org],Oct,31,2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,"""Maryam Abacha""",m_abacha03@www.com,[R@M],Nov,1,2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,Kuta David,davidkuta@postmark.net,[davidkuta@yahoo.com],Nov,2,2002,Partnership,ATTENTION: ...
6,"""Barrister tunde dosumu""",tunde_dosumu@lycos.com,[davidkuta@yahoo.com],Nov,2,2002,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,"""William Drallo""",william2244drallo@maktoob.com,[webmaster@aclweb.org],Nov,3,2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,"""MR USMAN ABDUL""",abdul_817@rediffmail.com,[R@M],Nov,4,2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,"""Tunde Dosumu""",barrister_td@lycos.com,[R@M],Nov,5,2002,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
