# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
file_path = './drive/My Drive/NLP/day04/sample_emails.txt'

In [11]:
# ref:https://www.opencli.com/python/4-ways-write-file-line-by-line-in-python

with open(file_path , mode='r') as f:
  line = f.readlines()

In [13]:
print(line)

['From r  Wed Oct 30 21:41:56 2002\n', 'Return-Path: <james_ngola2002@maktoob.com>\n', 'X-Sieve: cmu-sieve 2.0\n', 'Return-Path: <james_ngola2002@maktoob.com>\n', 'Message-Id: <200210310241.g9V2fNm6028281@cs.CU>\n', 'From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\n', 'Reply-To: james_ngola2002@maktoob.com\n', 'To: webmaster@aclweb.org\n', 'Date: Thu, 31 Oct 2002 02:38:20 +0000\n', 'Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\n', 'X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\n', 'MIME-Version: 1.0\n', 'Content-Type: text/plain; charset="us-ascii"\n', 'Content-Transfer-Encoding: 8bit\n', 'X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\n', 'Status: O\n', '\n', 'FROM:MR. JAMES NGOLA.\n', 'CONFIDENTIAL TEL: 233-27-587908.\n', 'E-MAIL: (james_ngola2002@maktoob.com).\n', '\n', 'URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n', '\n', '\n', 'DEAR FRIEND,\n', '\n', 'I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGO

In [18]:
txt = ""
for i in range(len(line)):
  txt += line[i]
print(txt)

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WH

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [9]:
import re

In [21]:
#<your code>#
pattern = r'\w+:\s+"\w+\.*\s\w+\s\w*\.*"\s<\S+>'
text = txt

match = re.findall(pattern = pattern , string = text)

In [22]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [23]:
#<your code>#
pattern = r'"\w+\.*\s\w+\s\w+\.*"'
text = txt

match = re.findall(pattern=pattern , string = text)

In [24]:
match

['"MR. JAMES NGOLA."', '"Mr. Ben Suleman"', '"PRINCE OBONG ELEME"']

### 只讀取寄件者電子信箱

In [30]:
#<your code>#
pattern = r"(?<!Return-Path: )<\w*@\S+>"
text = txt

match = re.findall(pattern = pattern , string = text , flags= re.M)

In [31]:
match

['<james_ngola2002@maktoob.com>',
 '<bensul2004nng@spinfinder.com>',
 '<obong_715@epatra.com>']

### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [32]:
#<your code>#
pattern = r"(?<=@)\S*(?=.com>\nReply)|(?<=@)\S*(?=.com>\nD)"
text = txt

match = re.findall(pattern=pattern , string = text)

In [33]:
match

['maktoob', 'spinfinder', 'epatra']

### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [34]:
#<your code>#
pattern = r"((?<!Return-Path: )<\w*)@(\S+(?=.com))"
text = txt

match = re.findall(pattern=pattern , string = text)

In [35]:
match

[('<james_ngola2002', 'maktoob'),
 ('<bensul2004nng', 'spinfinder'),
 ('<obong_715', 'epatra')]

### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [42]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
#<your code>#
file_path = './drive/My Drive/NLP/day04/all_emails.txt'
# https://github.com/Currie32/Spell-Checker/issues/14
with open(file_path , mode = 'r' , encoding = 'utf8' , errors='ignore') as f:
  line = f.read()
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
#<your code>#
pattern = r'From r'
emails = re.split(pattern = pattern , string = line)
emails = emails[1:]

len(emails) #查看有多少封email

3977

In [43]:
print(emails[0])

  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [55]:
mail = emails[0]
pattern = r'From:.+'
info = re.search(pattern= pattern, string = mail)
print(info.group())
name = re.search(pattern=r'".*"' , string=info.group())
address = re.search(pattern=r'<.*>' , string = info.group())
print(name.group())
print(address.group())

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
"MR. JAMES NGOLA."
<james_ngola2002@maktoob.com>


In [56]:
mail = emails[0]
pattern = r'To:.+'
info = re.search(pattern= pattern, string = mail)
print(info.group())
name = re.search(pattern=r'(?<=\").*(?=\")' , string=info.group())
address = re.search(pattern=r'(?<=\s).*' , string = info.group())
print(name)
print(address.group())

To: james_ngola2002@maktoob.com
None
james_ngola2002@maktoob.com


In [57]:
mail = emails[0]
pattern = r'Date:.+'
info = re.search(pattern= pattern, string = mail)
print(info.group())
date = re.search(pattern=r'(?<=Date: ).*' , string=info.group())
print(date.group())

Date: Thu, 31 Oct 2002 02:38:20 +0000
Thu, 31 Oct 2002 02:38:20 +0000


In [70]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    #<your code>#
    info_send = re.search(pattern=r'From:.+' , string = mail)
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if (info_send == None):
      sender_name = None
      sender_address = None
    else:
      sender_name = re.search(pattern=r'".*"' , string=info_send.group())
      sender_address = re.search(pattern=r'<.*>' , string = info_send.group())
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    if (sender_name != None):
      emails_dict['sender_name'] = sender_name.group()
    else:
      emails_dict['sender_name'] = 'None'
    if (sender_address != None):
      emails_dict['sender_address'] = sender_address.group()
    else:
      emails_dict['sender_address'] = 'None'
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    #<your code>#
    info_recv = re.search(pattern=r'To:.+' , string = mail)
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if (info_recv == None):
      recv_name = None
      recv_address = None
    else:
      recv_name = re.search(pattern=r'(?<=\").*(?=\")' , string=info_recv.group())
      recv_address = re.search(pattern=r'(?<=\s).*' , string = info_recv.group())
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    if (recv_name != None):
      emails_dict['recv_name'] = recv_name.group()
    else:
      emails_dict['recv_name'] = 'None'
    if (recv_address != None):
      emails_dict['recv_address'] = recv_address.group()
    else:
      emails_dict['recv_address'] = 'None'
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: Date:)
    #<your code>#
    recv_date = re.search(pattern=r'Date' , string= mail)
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #<your code>#
    if (recv_date!=None):
      date = re.search(pattern=r'\d+\s+\w*\s+\d+' , string=recv_date.group())
    else:
      date = None
    #Step3: 將取得的日期資訊存入字典中
    #<your code>#
    if date != None:
      emails_dict['Date']=date.group()
    else:
      emails_dict['Date']='None'
  
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #<your code>#
    subject_info = re.search(pattern=r'Subject:.*' , string=mail)
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    if subject_info !=None:
      Subject = re.sub(pattern='Subject: ' , repl='' , string = subject_info.group())
    else:
      Subject = 'None'
    #Step3: 將取得的主旨存入字典中
    #<your code>#
    if (Subject != 'None'):
      emails_dict['Subject'] = Subject
    else:
      emails_dict['Subject'] = 'None'
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    #<your code>#
    emails_list.append(emails_dict)

In [71]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_name,sender_address,recv_name,recv_address,Date,Subject,email_body
0,"""MR. JAMES NGOLA.""",<james_ngola2002@maktoob.com>,,james_ngola2002@maktoob.com,,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,"""Mr. Ben Suleman""",<bensul2004nng@spinfinder.com>,,R@M,,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,"""PRINCE OBONG ELEME""",<obong_715@epatra.com>,,obong_715@epatra.com,,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,"""PRINCE OBONG ELEME""",<obong_715@epatra.com>,,webmaster@aclweb.org,,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,"""Maryam Abacha""",<m_abacha03@www.com>,,m_abacha03@www.com,,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,,<davidkuta@postmark.net>,,davidkuta@yahoo.com,,Partnership,ATTENTION: ...
6,"""Barrister tunde dosumu""",<tunde_dosumu@lycos.com>,,tunde_dosumu@lycos.com,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,"""William Drallo""",<william2244drallo@maktoob.com>,,william2244drallo@maktoob.com,,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,"""MR USMAN ABDUL""",<abdul_817@rediffmail.com>,,R@M,,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,"""Tunde Dosumu""",<barrister_td@lycos.com>,,barrister_td@lycos.com,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
