# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [160]:
#讀取文本資料
import re
with open('sample_emails.txt','r') as sample_corpus :
    txt = sample_corpus.read()

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [161]:
pattern = r"From: .*"
match = re.findall(pattern,txt)
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [162]:
pattern = r"From:\s\"(.*)\""
match = re.findall(pattern,txt)
match

['MR. JAMES NGOLA.', 'Mr. Ben Suleman', 'PRINCE OBONG ELEME']

### 只讀取寄件者電子信箱

In [163]:
pattern = r"From:.*\<(.*)\>"
match = re.findall(pattern,txt)
match

['james_ngola2002@maktoob.com',
 'bensul2004nng@spinfinder.com',
 'obong_715@epatra.com']

### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [164]:
pattern = r"From:.*\@(\w+).com>"
match = re.finditer(pattern,txt)
for ma in match:
    print(f'Match text: {ma.group(1)}',end='\n') #使用.group() or .group(0)返回配對的字串

Match text: maktoob
Match text: spinfinder
Match text: epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [165]:
pattern = r"From:.*<(\w+)\@(\w+).com>"
match = re.finditer(pattern,txt)
for ma in match:
    print(f'Match text: {ma.group(1)}') #使用.group() or .group(0)返回配對的字串
    print(f'Match text: {ma.group(2)}')
    print('\n----分隔線----')

Match text: james_ngola2002
Match text: maktoob

----分隔線----
Match text: bensul2004nng
Match text: spinfinder

----分隔線----
Match text: obong_715
Match text: epatra

----分隔線----


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [207]:
import re
import pandas as pd
import email

###讀取文本資料: all_emails.txt###
with open('all_emails.txt', encoding='utf-8') as all_emails:
    text = all_emails.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = re.split(r'From r', text)[1:]

len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [226]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    send = r'From:.*'
    send_from = re.search(send,mail) 
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if send_from :
        send_name = re.search(r"\s\"(.*)\"",send_from.group())
        send_address =  re.search(r"\<(.*)\>", send_from.group())
    else :
        send_name , send_address = None, None
    #Step3: 將取得的姓名與地址存入字典中
    emails_dict['send_name'] = send_name.group() if send_name else None
    emails_dict['send_address'] = send_address.group() if send_address else None
        
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    receive = r'To:.*'
    receive_by = re.search(receive,mail) 
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if receive_by:
        receive_name = re.search(r"\s(\w+)",receive_by.group())
        receive_address =  re.search(r"", receive_by.group())
    else :
        receive_name , receive_address = None, None
        
    #Step3: 將取得的姓名與地址存入字典中
    emails_dict['receive_name'] = receive_name.group() if receive_name else None
    emails_dict['receive_address'] = receive_address.group() if receive_address else None
        
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: Date:)
    date = re.search(r'Date:.*', mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    date_info = re.search(r'\d+\s\w+\s\d+', date.group()) if date else None
        
    #Step3: 將取得的日期資訊存入字典中
    emails_dict['Date'] = date_info.group() if date_info else None
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    subject_info = re.search(r'Subject:.*', mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    subject = re.sub(r'Subject: ', '', subject_info.group()) if subject_info else None
    
    #Step3: 將取得的主旨存入字典中
    emails_dict['Subject'] = subject
    
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)

In [227]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,send_name,send_address,receive_name,receive_address,Date,Subject,email_body
0,"""MR. JAMES NGOLA.""",<james_ngola2002@maktoob.com>,james_ngola2002@,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,"""Mr. Ben Suleman""",<bensul2004nng@spinfinder.com>,R@,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,"""PRINCE OBONG ELEME""",<obong_715@epatra.com>,obong_715@,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,"""PRINCE OBONG ELEME""",<obong_715@epatra.com>,webmaster@,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,"""Maryam Abacha""",<m_abacha03@www.com>,m_abacha03@,,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,,<davidkuta@postmark.net>,davidkuta@,,02 Nov 2002,Partnership,ATTENTION: ...
6,"""Barrister tunde dosumu""",<tunde_dosumu@lycos.com>,tunde_dosumu@,,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,"""William Drallo""",<william2244drallo@maktoob.com>,william2244drallo@,,3 Nov 2002,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,"""MR USMAN ABDUL""",<abdul_817@rediffmail.com>,R@,,04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,"""Tunde Dosumu""",<barrister_td@lycos.com>,barrister_td@,,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
