# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
import re

#讀取文本資料
#<your code>#
with open("sample_emails.txt", 'r') as f:
    sample_corpus = f.read()
print(sample_corpus)

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WH

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [2]:
pattern_obj = re.compile(pattern=r"From:.+\"(.+)\".+<((.+)@(.+).com)>")

result = pattern_obj.findall(sample_corpus) 
print(result)

[('MR. JAMES NGOLA.', 'james_ngola2002@maktoob.com', 'james_ngola2002', 'maktoob'), ('Mr. Ben Suleman', 'bensul2004nng@spinfinder.com', 'bensul2004nng', 'spinfinder'), ('PRINCE OBONG ELEME', 'obong_715@epatra.com', 'obong_715', 'epatra')]


### 只讀取寄件者姓名

In [3]:
for r in result:
    print(r[0])

MR. JAMES NGOLA.
Mr. Ben Suleman
PRINCE OBONG ELEME


### 只讀取寄件者電子信箱

In [4]:
for r in result:
    print(r[1])

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [5]:
for r in result:
    print(r[3])

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [6]:
for r in result:
    print(r[2], end=", ")
    print(r[3])

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [7]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
#<your code>#
with open("all_emails.txt", "r", encoding="utf8", errors='ignore') as f:
    text = f.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = re.split(r"From r", text, flags=re.M)
emails = emails[1:] #移除第一項的空元素

len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [8]:
emails_list = [] #創建空list來儲存所有email資訊


#send = re.compile(r"From:(.+)<((.+)@(.+)\..+)>")
send = re.compile(r"From:([.+ ])((.+)@.+.+)")

recieve = re.compile(r"To: ((.+)@.+)")

data = re.compile(r"^To: ((.+)@.+)")
for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###

    sender = re.search(r"From:.*", mail)

    #Step1: 取的寄件者資訊 (hint: From:)


    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    name = re.search(r"(?<=\").+(?=\")", sender.group())
    address = re.search(r"\w\S*@.*\b", sender.group())
    if name == " ":
        name = None
    
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#

    if name:
        emails_dict['sender_name'] = name.group()
    else:
        emails_dict['sender_name'] = None
    if address:
        emails_dict['sender_email'] = address.group()
    else:
        emails_dict['sender_email'] = None
    
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    reciever = re.search(r"To:.*", mail)

    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    name = re.search(r"(?<=\").+(?=\")", sender.group())
    address = re.search(r"\w\S*@.*\b", sender.group())
    if name:
        emails_dict['recipient_name'] = name.group()
    else:
        emails_dict['recipient_name'] = None
    if address:
        emails_dict['recipient_email'] = address.group()
    else:
        emails_dict['recipient_email'] = None
    
    
    emails_dict['recipient_email'] = address.group()
    
    #Step3: 將取得的姓名與地址存入字典中
    emails_dict[name] = address
        
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    date_info = re.search(r"Date:.*", mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    if date_info:
        date = re.search(r"\d+\s\w+\s\d+", date_info.group())
    else:
        date = None
        
    #Step3: 將取得的日期資訊存入字典中
    if date:
        emails_dict['date'] = date.group()
    else:
        emails_dict['date'] = None    
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    subject_info = re.search(r"Subject:\s.*", mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    if subject_info:
        subject = subject_info.group().replace("Subject: ", "")
    else:
        subject = None
        
    #Step3: 將取得的主旨存入字典中
    if subject:
        emails_dict['subject'] = subject
    else:
        emails_dict['subject'] = None    
    
    print(emails_dict)
    print("--------")
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    #<your code>#
    emails_list.append(emails_dict)

{'sender_name': 'MR. JAMES NGOLA.', 'sender_email': 'james_ngola2002@maktoob.com', 'recipient_name': 'MR. JAMES NGOLA.', 'recipient_email': 'james_ngola2002@maktoob.com', <re.Match object; span=(7, 23), match='MR. JAMES NGOLA.'>: <re.Match object; span=(26, 53), match='james_ngola2002@maktoob.com'>, 'date': '31 Oct 2002', 'subject': 'URGENT BUSINESS ASSISTANCE AND PARTNERSHIP'}
--------
{'sender_name': 'Mr. Ben Suleman', 'sender_email': 'bensul2004nng@spinfinder.com', 'recipient_name': 'Mr. Ben Suleman', 'recipient_email': 'bensul2004nng@spinfinder.com', <re.Match object; span=(7, 22), match='Mr. Ben Suleman'>: <re.Match object; span=(25, 53), match='bensul2004nng@spinfinder.com'>, 'date': '31 Oct 2002', 'subject': 'URGENT ASSISTANCE /RELATIONSHIP (P)'}
--------
{'sender_name': 'PRINCE OBONG ELEME', 'sender_email': 'obong_715@epatra.com', 'recipient_name': 'PRINCE OBONG ELEME', 'recipient_email': 'obong_715@epatra.com', <re.Match object; span=(7, 25), match='PRINCE OBONG ELEME'>: <re.M

In [9]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_name,sender_email,recipient_name,recipient_email,"<re.Match object; span=(7, 23), match='MR. JAMES NGOLA.'>",date,subject,email_body,"<re.Match object; span=(7, 22), match='Mr. Ben Suleman'>","<re.Match object; span=(7, 25), match='PRINCE OBONG ELEME'>",...,"<re.Match object; span=(7, 21), match='William Drallo'>","<re.Match object; span=(7, 21), match='MR USMAN ABDUL'>","<re.Match object; span=(7, 20), match='Tunde Dosumu'>","<re.Match object; span=(7, 20), match='Dr.Sam jordan'>","<re.Match object; span=(7, 25), match='COL. MICHAEL BUNDU'>","<re.Match object; span=(7, 24), match='MRS MARIAM ABACHA'>","<re.Match object; span=(7, 23), match=' DR. ANAYO AWKA '>","<re.Match object; span=(7, 23), match=' DR. ANAYO AWKA '>.1","<re.Match object; span=(7, 19), match='Victor Aloma'>","<re.Match object; span=(7, 19), match='Victor Aloma'>.1"
0,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,"<re.Match object; span=(26, 53), match='james_...",31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...,,,...,,,,,,,,,,
1,Mr. Ben Suleman,bensul2004nng@spinfinder.com,Mr. Ben Suleman,bensul2004nng@spinfinder.com,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ...","<re.Match object; span=(25, 53), match='bensul...",,...,,,,,,,,,,
2,PRINCE OBONG ELEME,obong_715@epatra.com,PRINCE OBONG ELEME,obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,,"<re.Match object; span=(28, 48), match='obong_...",...,,,,,,,,,,
3,PRINCE OBONG ELEME,obong_715@epatra.com,PRINCE OBONG ELEME,obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,,,...,,,,,,,,,,
4,Maryam Abacha,m_abacha03@www.com,Maryam Abacha,m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope...",,,...,,,,,,,,,,
5,,davidkuta@postmark.net,,davidkuta@postmark.net,,02 Nov 2002,Partnership,ATTENTION: ...,,,...,,,,,,,,,,
6,Barrister tunde dosumu,tunde_dosumu@lycos.com,Barrister tunde dosumu,tunde_dosumu@lycos.com,,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)...",,,...,,,,,,,,,,
7,William Drallo,william2244drallo@maktoob.com,William Drallo,william2244drallo@maktoob.com,,3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...,,,...,"<re.Match object; span=(24, 53), match='willia...",,,,,,,,,
8,MR USMAN ABDUL,abdul_817@rediffmail.com,MR USMAN ABDUL,abdul_817@rediffmail.com,,04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\...",,,...,,"<re.Match object; span=(24, 48), match='abdul_...",,,,,,,,
9,Tunde Dosumu,barrister_td@lycos.com,Tunde Dosumu,barrister_td@lycos.com,,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)...",,,...,,,"<re.Match object; span=(23, 45), match='barris...",,,,,,,
