# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [1]:
#讀取文本資料
f = open('sample_emails.txt', 'r')
sample_corpus = f.read()
f.close()

In [4]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [13]:
import re
patter = r'From:.+'
match = re.findall(patter, sample_corpus)

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

In [None]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [42]:
sender = re.findall( r'From: "(.+)"', sample_corpus)
for s in sender:
    print(s)

MR. JAMES NGOLA.
Mr. Ben Suleman
PRINCE OBONG ELEME


### 只讀取寄件者電子信箱

In [43]:
sendermail = re.findall( r'From: ".+" <(.+)>', sample_corpus)
for s in sendermail:
    print(s)

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [65]:
for m in sendermail:
    mailorg = re.findall(r'@(.+).com', m)
    print(mailorg)

['maktoob']
['spinfinder']
['epatra']


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [66]:
for txt in sendermail:
    print(re.findall(r'(.+)@(.+).com',txt))
    

[('james_ngola2002', 'maktoob')]
[('bensul2004nng', 'spinfinder')]
[('obong_715', 'epatra')]


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [74]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
f = open('all_emails.txt', 'r', encoding='windows-1252')
email_corpus = f.read()
f.close()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = re.split(r'From r.+\n', email_corpus)
emails = emails[1:] #分割後第一個是空白，將其移除
len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [120]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[1000:1020]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    try:
        
        #Step1: 取的寄件者資訊 (hint: From:)
        sender = re.search(r'From:.+', mail).group()

        #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
        SenderName = re.findall(r'From:\s("?.*"?)\s?<?\w*@\w*\.\w*>?', sender)
        SenderEmail = re.findall(r'\s?<?(\w*@\w*\.\w*)>?', sender)

        #Step3: 將取得的姓名與地址存入字典中
        if (len(SenderName) > 0):
            emails_dict["SenderName"] = SenderName[0]
        else:
            emails_dict["SenderName"] = None


        if (len(SenderEmail) > 0):
            emails_dict["SenderEmail"] = SenderEmail[0]
        else:
            emails_dict["SenderEmail"] = None
    except:
        emails_dict["SenderName"] = None
        emails_dict["SenderEmail"] = None

    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    # 只有取得email
    recipient = re.findall(r'.*To: (.*)', mail)
        
    #Step3: 將取得的姓名與地址存入字典中
    emails_dict["recipientEmail"] = recipient
        
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    #<your code>#
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    try: 
        SentDate = re.findall(r'Date: .* (\d{1,2} \w{3} \d{4})', mail)[0]
    except:
        SentDate = None
        
    #Step3: 將取得的日期資訊存入字典中
    emails_dict["Date"] = SentDate
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    try:
        subject = re.findall(r'Subject: (.*)', mail)[0]
    except:
        subject = None
    
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    
    #Step3: 將取得的主旨存入字典中
    emails_dict["Subject"] = subject
    
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)



In [121]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,SenderName,SenderEmail,recipientEmail,Date,Subject,email_body
0,jamestshabalala,jamestshabalala@netscape.net,[jamestshabalala@netscape.net],15 Jul 2004,URGENT RESPONSE,FROM:MR JAMES TSHABALALA\nTEL:+ 27-83-424-7661...
1,"""SAMSON K.MANI"" <samson_k_mani01",samson_k_mani01@voila.fr,[samson_k_mani02@voila.fr],16 Jul 2004,BUSINESS PROPOSAL,Dear Sir=2C \nI know that this proposal letter...
2,"""Mr frank"" <frank13_",frank13_@mailsurf.com,"[frank2@mailsurf.com, R@M]",16 Jul 2004,Request for assistance Next of kin Claims,>From The Desk of Independent Committee of \nE...
3,"""oliver""<oliverfpaul",oliverfpaul@aib.com,"[R@E, feltop11@starmail.co.za, ofpaul6@yahoo.com]",19 Jul 2004,,=20\n\n=20\nMy name i...
4,"""oliver""<oliverfpaul",oliverfpaul@aib.com,"[R@E, feltop11@starmail.co.za, ofpaul6@yahoo.com]",19 Jul 2004,,=20\n\n=20\nMy name i...
5,"""Terry"" <gimlabpnuipfi",gimlabpnuipfi@rocketmail.com,"[""Joey"" <R@M>]",20 Jul 2004,business proposal,"<HTML><html>\n<body>\nDear Fr<!ar>ie<!me>nd,<B..."
6,sitholebaloy,sitholebaloy@netscape.net,[sitholebaloy@netscape.net],21 Jul 2004,|||||||| ASKING FOR YOUR ASSISTANCE (FOR US$21...,\nFROM: BALOY SITHOLE.\nTELL: 27-835-184-080\n...
7,sitholebaloy,sitholebaloy@netscape.net,[sitholebaloy@netscape.net],21 Jul 2004,|||||||| ASKING FOR YOUR ASSISTANCE (FOR US$21...,\nFROM: BALOY SITHOLE.\nTELL: 27-835-184-080\n...
8,sitholebaloy,sitholebaloy@netscape.net,[sitholebaloy@netscape.net],21 Jul 2004,|||||||| ASKING FOR YOUR ASSISTANCE (FOR US$21...,\nFROM: BALOY SITHOLE.\nTELL: 27-835-184-080\n...
9,"""yaya"" <dr_tyh",dr_tyh@hotmail.com,"[thomas-nimely@lycos.es, R@M]",21 Jul 2004,IN GOD WE TRUST,Dearest=2C\n\nI am Chief Thomas Nimely yaya=2C...
