## Ana Maria Renno Pocai 

### **Understanding the Dataset: Phishing Email Detection**

For this project, I worked with the **Phishing Email Dataset**, which contains metadata and content from various emails. The goal of this dataset is to help identify fraudulent or phishing emails—messages designed to deceive recipients into sharing sensitive information, like passwords or financial details, by pretending to come from legitimate sources.

#### **Dataset Features**
The dataset includes several key features:

- **Sender & Receiver** – Email addresses involved in the communication.  
- **Date & Time** – When the email was sent.  
- **Subject & Body** – The text content of the email.  
- **URLs in Emails** – Links that could indicate phishing attempts.  
- **Fraud Label** – A binary label (`1 = Fraudulent`, `0 = Legitimate`) that classifies each email.  


## Exercise 1: Identify Suspicious Email Domains
- Find the top 5 most frequent email domains in fraudulent transactions.
- Write a function to flag transactions from less common domains.


In [123]:
import pandas as pd
import re
import random
import numpy as np
from datetime import timedelta
import json

### Most frequent senders 

### **Identifying Suspicious Emails and Analyzing Common Fake Senders**

To better understand fraudulent emails in the dataset, I first filtered out the phishing emails by selecting only the ones labeled as `1` (fraudulent). Additionally, I ensured that only valid senders were considered by removing missing or empty values, as well as placeholder entries like `<>`.

#### **Data Cleaning and Normalization**
To analyze email patterns more effectively, I standardized the sender addresses:
- Converted all characters to lowercase.
- Removed spaces and hyphens to unify different formats of the same sender.


In [124]:
df = pd.read_csv("CEAS_08.csv")
suspicious_email = df[df["label"] == 1]
suspicious_email = suspicious_email[suspicious_email["sender"].notna()]  
suspicious_email = suspicious_email[suspicious_email["sender"].str.strip().ne('')]  
suspicious_email = suspicious_email[suspicious_email["sender"].ne('<>')]  

suspicious_email["normalized_sender"] = (
    suspicious_email["sender"]
    .str.lower()  
    .str.replace(r'[-\s]', '')  
)

### **Analyzing Common and Rare Fraudulent Email Senders**

To better understand phishing patterns, I looked at which fake senders appeared the most and which were the least common. This helps to identify email addresses that scammers use repeatedly, as well as those that might be unique or less frequent attempts.

#### **Most Common Fake Senders**
I counted how often each sender appeared in the dataset and picked out the top five most frequently used ones.


In [125]:
top_emails_types = (
    suspicious_email["sender"]
    .value_counts()
    .head(5)
)
print("Most commum fake emails")
print(top_emails_types)

Most commum fake emails
sender
Google AdWords-noreply <adwords-noreply@google.com>    36
Google AdWords <reactivation@google.com>               30
Google-AdWords-noreply <adwords-noreply@google.com>    29
Google-AdWords <reactivation@google.com>               25
Google-AdWords-Noreply <support@google.com>            23
Name: count, dtype: int64


#### **Less Common Fake Senders**  
I did the same as I did for the most common ones but this time looked at the five least frequent fake senders using `.tail(5)`.

In [126]:
less_emails_type = (
    suspicious_email["sender"]
    .value_counts()
    .tail(5)
)
print("Less commum fake emails")
print(less_emails_type)

Less commum fake emails
sender
Daily Top 10 <Lemettre-anotsab@impressmedia.com>    1
Richard Mcrae <dwvarelintlm@varelintl.com>          1
Daily Top 10 <rozannne-fissavan@0120874874.com>     1
sharp2@pba1873.com                                  1
CNN Alerts <idgetily1971@careplusnj.org>            1
Name: count, dtype: int64


### **Identifying Common Fake Email Domains**  
In this step, I extracted the domain of each fraudulent email address by applying a regular expression to the normalized sender's email. This allowed me to focus on the domain name (after the '@' symbol) and track which domains were frequently used in phishing attempts. 

In [127]:

suspicious_email["email_domain"] = suspicious_email["normalized_sender"].apply(
    lambda x: re.search(r"@([\w.-]+)", x).group(1) if re.search(r"@([\w.-]+)", x) else None
)
suspect_domains_s = suspicious_email.copy()

top_email_domains = suspicious_email["email_domain"].value_counts().head(5)
print("Most common fake email domains:")
print(top_email_domains)

Most common fake email domains:
email_domain
google.com             208
merriam-webster.com     83
wikipedia.org           75
yahoo.com               70
foxnews.com             50
Name: count, dtype: int64


#### **Identifying Less Common Fake Email Domains**  
For this part, I used the same technique as before to identify the least common fake email domains. 

In [128]:

least_email_domains = suspicious_email["email_domain"].value_counts().tail(5)
print("Least common fake email domains:")
print(least_email_domains)

Least common fake email domains:
email_domain
hamiltonsystems.us    1
3idinc.com            1
islands.vi            1
oyps.com              1
careplusnj.org        1
Name: count, dtype: int64


### **Detecting Fraudulent Emails Based on Rare Domains**

For this part of the project, I focused on detecting potentially fraudulent emails based on rare email domains.

1. **Identifying Rare Domains:**
   - First, I counted the frequency of each email domain from the suspicious emails dataset (`label == 1`).
   - Then, I created a list of rare domains by selecting those that appeared only once in the dataset.
   - These rare domains were stored in a set called `rare_domains`.

2. **Creating the Fraud Detection Function:**
   - I then created a function `is_fraudulent_email()` that checks whether an email's sender domain is in the `rare_domains` set and is not already part of the top phishing domains (`top_email_domains`).
   - If both conditions are satisfied, the email is flagged as potentially fraudulent.

In [129]:
domain_counts = suspicious_email["email_domain"].value_counts()
rare_domains = set(domain_counts[domain_counts == 1].index)
def is_fraudulent_email(sender):
    domain = re.search(r"@([\w.-]+)", str(sender))
    
    if domain:
        domain = domain.group(1)
        if domain in rare_domains and domain not in top_email_domains:
            return True 
    return False  

#### **Checking a Random Sender’s Email for Fraud**

In this step, I randomly selected an email sender from the dataset and used the `is_fraudulent_email()` function to check if the sender's domain is rare and part of the less common fraudulent domains. 

In [130]:
random_sender = df["sender"].dropna().sample(1).values[0]

is_fraud = is_fraudulent_email(random_sender)

print(f"Email: {random_sender}")
print(f"Is part of the less commun fraudulent emails domain? {is_fraud}")

Email: dew caryn <cortez@cancorpam.com>
Is part of the less commun fraudulent emails domain? False


### Most Frequent Receivers

For this part, I applied the same method used for analyzing the sender's emails to the receivers. 

In [131]:
suspicious_email = suspicious_email[suspicious_email["receiver"].notna()]  
suspicious_email = suspicious_email[suspicious_email["receiver"].str.strip().ne('')]  
suspicious_email = suspicious_email[suspicious_email["receiver"].ne('<>')]  

suspicious_email["normalized_receiver"] = (
    suspicious_email["receiver"]
    .str.lower()  
    .str.replace(r'[-\s]', '')  
)

top_emails_types = (
    suspicious_email["receiver"]
    .value_counts()
    .head(5)
)
print("Most commum fake emails")
print(top_emails_types)

Most commum fake emails
receiver
user2.2@gvc.ceas-challenge.cc       889
user2.4@gvc.ceas-challenge.cc       718
user2.13@gvc.ceas-challenge.cc      684
user7@gvc.ceas-challenge.cc         663
user7-ext4@gvc.ceas-challenge.cc    655
Name: count, dtype: int64


In [132]:
less_emails_type = (
    suspicious_email["receiver"]
    .value_counts()
    .tail(5)
)
print("Less commum fake emails")
print(less_emails_type)

Less commum fake emails
receiver
chanelledurrance@gvc.ceas-challenge.cc      1
josie@gvc.ceas-challenge.cc                 1
unwillingannita101@gvc.ceas-challenge.cc    1
dennise.willibrand@gvc.ceas-challenge.cc    1
elmia.plankey@gvc.ceas-challenge.cc         1
Name: count, dtype: int64


In [133]:

suspicious_email["email_domain"] = suspicious_email["normalized_receiver"].apply(
    lambda x: re.search(r"@([\w.-]+)", x).group(1) if re.search(r"@([\w.-]+)", x) else None
)
suspect_domains_r = suspicious_email.copy()
top_email_domains = suspicious_email["email_domain"].value_counts().head(5)
print("Most common fake email domains:")
print(top_email_domains)

Most common fake email domains:
email_domain
gvc.ceas-challenge.cc               21741
1c24b71122b5ade14892d25325fdf3c0       19
lists.lwv.org                           6
lists.debian.org                        4
speelzolder.com                         2
Name: count, dtype: int64


In [134]:

least_email_domains = suspicious_email["email_domain"].value_counts().tail(5)
print("Least common fake email domains:")
print(least_email_domains)

Least common fake email domains:
email_domain
online.no                 1
operamail.com             1
childreneverywhere.org    1
speedy.uwaterloo.c        1
poczta.onet.pl            1
Name: count, dtype: int64


In [135]:
domain_counts = suspicious_email["email_domain"].value_counts()
rare_domains = set(domain_counts[domain_counts == 1].index)
def is_fraudulent_email(receiver):
    domain = re.search(r"@([\w.-]+)", str(receiver))
    
    if domain:
        domain = domain.group(1)
        if domain in rare_domains and domain not in top_email_domains:
            return True 
    return False  

In [136]:
random_receiver = df["receiver"].dropna().sample(1).values[0]

is_fraud = is_fraudulent_email(random_receiver)

print(f"Email: {random_receiver}")
print(f"Is part of the less commun fraudulent emails domain? {is_fraud}")

Email: user8.1@gvc.ceas-challenge.cc
Is part of the less commun fraudulent emails domain? False


## Exercise 2: Regular Expressions for Data Validation
- Validate that email addresses in the dataset are correctly formatted.
- Identify and extract all numeric values appearing in descriptions.

### **Validating Email Addresses and Extracting Numbers**

In this step, I created a function `validate_email()` to validate the email addresses of both senders and receivers. The function uses regular expressions to first extract the email address and then check if it matches a valid email format. 

Additionally, I used a regular expression to extract numbers (such as amounts or values) from the email subject line.

In [137]:
extract_email_pattern = r"<([^<>]+)>"  

email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

def validate_email(email):
    email = str(email).strip() 
    match = re.search(extract_email_pattern, email)  
    if match:
        email = match.group(1)  
    return bool(re.match(email_pattern, email))  

df["valid_sender"] = df["sender"].apply(validate_email)
df["valid_receiver"] = df["receiver"].apply(validate_email)
df["extracted_numbers"] = df["subject"].astype(str).apply(lambda x: re.findall(r"\d+(?:\.\d+)?", x))
df

Unnamed: 0,sender,receiver,date,subject,body,label,urls,valid_sender,valid_receiver,extracted_numbers
0,Young Esposito <Young@iworld.de>,user4@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 16:31:02 -0700",Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1,1,True,True,[]
1,Mok <ipline's1983@icable.ph>,user2.2@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 18:31:03 -0500",Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1,1,False,True,[]
2,Daily Top 10 <Karmandeep-opengevl@universalnet...,user2.9@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 20:28:00 -1200",CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1,True,True,[10]
3,Michael Parker <ivqrnai@pobox.com>,SpamAssassin Dev <xrh@spamassassin.apache.org>,"Tue, 05 Aug 2008 17:31:20 -0600",Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0,1,True,True,[619753]
4,Gretchen Suggs <externalsep1@loanofficertool.com>,user2.2@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 19:31:21 -0400",SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1,1,True,True,[]
...,...,...,...,...,...,...,...,...,...,...
39149,CNN Alerts <charlene-detecton@btcmarketing.com>,email1007@gvc.ceas-challenge.cc,"Fri, 08 Aug 2008 10:34:50 -0400",CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0,True,True,[]
39150,CNN Alerts <idgetily1971@careplusnj.org>,email104@gvc.ceas-challenge.cc,"Fri, 08 Aug 2008 10:35:11 -0400",CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0,True,True,[]
39151,Abhijit Vyas <xpojhbz@gmail.com>,fxgmqwjn@triptracker.net,"Fri, 08 Aug 2008 22:00:43 +0800",Slideshow viewer,Hello there ! \nGreat work on the slide show v...,0,0,True,True,[]
39152,Joseph Brennan <vupzesm@columbia.edu>,zqoqi@spamassassin.apache.org,"Fri, 08 Aug 2008 09:00:46 -0500",Note on 2-digit years,"\nMail from sender , coming from intuit.com\ns...",0,0,True,True,[2]


In [138]:
print("Invalid Emails (Sender):")
print(df[~df["valid_sender"]][["sender"]])

Invalid Emails (Sender):
                                                  sender
1                           Mok <ipline's1983@icable.ph>
9          Daily Top 10 <orn|dent_1973@musicaedischi.it>
35                   Daily Top 10 <k|bet2005@0166.co.jp>
36                                               Amir <>
46                                               Nhan <>
...                                                  ...
38933                                           Agent <>
38997          CNN Alerts <}ndkl{de_1981@buy-a-home.com>
39012                Molly lightle <imetess{@joggers.es>
39120  CNN Alerts <shaida-elokuvan@1rjz_d3xft.cbqdf3z...
39142                                                 <>

[1166 rows x 1 columns]


In [139]:
print("\nInvalid Emails (Receiver):")
print(df[~df["valid_receiver"]][["receiver"]])


Invalid Emails (Receiver):
                                                receiver
60     "bsfegmjrr-hnq@python.org" <'bsfegmjrr-hnq@pyt...
139    gflanagan01@gvc.ceas-challenge.cc, user2.10@gv...
140    user2.3@gvc.ceas-challenge.cc, charline3738@gv...
175    hrkqm@gvc.ceas-challenge.cc, user2.14@gvc.ceas...
177    clement494@gvc.ceas-challenge.cc, user7@gvc.ce...
...                                                  ...
38971                                                NaN
38975                                                NaN
38998                                                NaN
39035                                                NaN
39111  rgkwpdahmhmail@cs.cmu.edu, faq@engr.orst.edu, ...

[1880 rows x 1 columns]


In [140]:
print("\nExtracted Numbers from Descriptions:")
df_filtered = df[df["extracted_numbers"].apply(lambda x: len(x) > 0)]
print(df_filtered[["subject", "extracted_numbers"]])


Extracted Numbers from Descriptions:
                                                 subject extracted_numbers
2                                   CNN.com Daily Top 10              [10]
3      Re: svn commit: r619753 - in /spamassassin/tru...          [619753]
7                                   CNN.com Daily Top 10              [10]
8      [Bug 5780] URI processing turns uuencoded stri...            [5780]
9                                   CNN.com Daily Top 10              [10]
...                                                  ...               ...
39143                                   Woman 38 Richter              [38]
39145                 Be larger than ever after 2 months               [2]
39148  Patients can access Our online health shop is ...           [24, 7]
39152                              Note on 2-digit years               [2]
39153                      [Python-Dev] PEP 370 heads up             [370]

[12771 rows x 2 columns]


## Exercise 3: Optimize the Algorithm
- Improve fraud detection by incorporating past customer transaction history.
- Implement an efficient way to flag repeated transactions within a short period.

### **Converting Date Strings to Datetime Format**

In this step, I created a function `convert_to_datetime()` to convert the date strings in the 'date' column into a proper `datetime` format. I used the `pd.to_datetime()` function from pandas, specifying the date format `%a, %d %b %Y %H:%M:%S %z` to handle the timezone information as well. 

In [141]:
def convert_to_datetime(date_string):
    try:
        return pd.to_datetime(date_string, format='%a, %d %b %Y %H:%M:%S %z', errors='raise')
    except ValueError:
        return pd.NaT

df['date'] = df['date'].apply(convert_to_datetime)
df


Unnamed: 0,sender,receiver,date,subject,body,label,urls,valid_sender,valid_receiver,extracted_numbers
0,Young Esposito <Young@iworld.de>,user4@gvc.ceas-challenge.cc,2008-08-05 16:31:02-07:00,Never agree to be a loser,"Buck up, your troubles caused by small dimensi...",1,1,True,True,[]
1,Mok <ipline's1983@icable.ph>,user2.2@gvc.ceas-challenge.cc,2008-08-05 18:31:03-05:00,Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...,1,1,False,True,[]
2,Daily Top 10 <Karmandeep-opengevl@universalnet...,user2.9@gvc.ceas-challenge.cc,2008-08-05 20:28:00-12:00,CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1,True,True,[10]
3,Michael Parker <ivqrnai@pobox.com>,SpamAssassin Dev <xrh@spamassassin.apache.org>,2008-08-05 17:31:20-06:00,Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...,0,1,True,True,[619753]
4,Gretchen Suggs <externalsep1@loanofficertool.com>,user2.2@gvc.ceas-challenge.cc,2008-08-05 19:31:21-04:00,SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...,1,1,True,True,[]
...,...,...,...,...,...,...,...,...,...,...
39149,CNN Alerts <charlene-detecton@btcmarketing.com>,email1007@gvc.ceas-challenge.cc,2008-08-08 10:34:50-04:00,CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0,True,True,[]
39150,CNN Alerts <idgetily1971@careplusnj.org>,email104@gvc.ceas-challenge.cc,2008-08-08 10:35:11-04:00,CNN Alerts: My Custom Alert,\n\nCNN Alerts: My Custom Alert\n\n\n\n\n\n\n ...,1,0,True,True,[]
39151,Abhijit Vyas <xpojhbz@gmail.com>,fxgmqwjn@triptracker.net,2008-08-08 22:00:43+08:00,Slideshow viewer,Hello there ! \nGreat work on the slide show v...,0,0,True,True,[]
39152,Joseph Brennan <vupzesm@columbia.edu>,zqoqi@spamassassin.apache.org,2008-08-08 09:00:46-05:00,Note on 2-digit years,"\nMail from sender , coming from intuit.com\ns...",0,0,True,True,[2]


### **Flagging Repeated Transactions**

For this part, I created a function `flag_repeated_transactions()` that identifies suspicious repeated transactions based on the sender's behavior. The function sorts the data by the sender and date to make sure transactions are in chronological order. It then loops through the rows, checking if the same sender has made another transaction within a specified time window (10 minutes). If a repeated transaction is detected, it flags the row by setting the column `repeated_transaction` to `True`.

I applied this function to the `suspicious_transactions` DataFrame, which contains only the fraudulent emails (label == 1). This helps to flag potential fraudulent behavior where multiple transactions happen in a short amount of time from the same sender.


In [142]:
suspicious_transactions = df[df["label"] == 1]
def flag_repeated_transactions(df, time_window_minutes=10):
    df = df.sort_values(by=['sender', 'date'])
    df['repeated_transaction'] = False
    
    for i in range(1, len(df)):
        if df.iloc[i]['sender'] == df.iloc[i-1]['sender'] and (df.iloc[i]['date'] - df.iloc[i-1]['date']) <= timedelta(minutes=time_window_minutes):
            df.at[i, 'repeated_transaction'] = True
    
    return df
df = flag_repeated_transactions(suspicious_transactions)

### **Flagging Customers with a Fraud History**

In this step, I created a function `flag_customer_fraud_history()` to identify customers who have a history of fraudulent transactions within a specified time window (30 days). The function iterates through the rows of the DataFrame, checking if a transaction is marked as fraudulent (label == 1). For each fraudulent transaction, it filters the `suspicious_transactions` DataFrame to find previous transactions made by the same sender within the past 30 days.

If the sender has more than two fraudulent transactions within this window, their transaction is flagged with the column `fraud_history` set to `True`. This helps to identify customers with a recurring pattern of fraudulent behavior.


In [143]:

def flag_customer_fraud_history(df, days_window=30):
    df['fraud_history'] = False
    for idx, row in df.iterrows():
        if row['label'] == 1:  
            fraud_history = suspicious_transactions[(suspicious_transactions['sender'] == row['sender']) &
                                                    (suspicious_transactions['date'] <= row['date']) &
                                                    (suspicious_transactions['date'] >= row['date'] - timedelta(days=days_window))]
            if len(fraud_history) > 2:
                df.at[idx, 'fraud_history'] = True
    
    return df

df = flag_customer_fraud_history(df)

### **Flagging Suspicious Transactions**

In this part, I created the function `flag_suspicious_transaction()` to flag suspicious transactions based on two factors: whether the transaction is repeated and whether the sender has a history of fraud. If either of these conditions is met, the transaction is flagged as suspicious, and the `is_suspicious` column is set to `True`. 

After applying this function to the DataFrame, I used `dropna()` to remove any rows with missing values (cleaning the dataset). Finally, I filtered out the suspicious transactions using `df_cleaned` to only show transactions where `is_suspicious` is `True`. 

In [144]:
def flag_suspicious_transaction(df):
    df['is_suspicious'] = False
    df['is_suspicious'] = df['repeated_transaction'] | df['fraud_history']
    return df

df = flag_suspicious_transaction(df)
df_cleaned = df.dropna(axis=0)

df_cleaned [(df_cleaned ['is_suspicious'] == True)]


Unnamed: 0,sender,receiver,date,subject,body,label,urls,valid_sender,valid_receiver,extracted_numbers,repeated_transaction,fraud_history,is_suspicious
7521,"""Arlene H. Mansfield"" <arlene.h_mansfieldtq@mo...","user2.14@gvc.ceas-challenge.cc, user2.14@gvc.c...",2008-08-05 22:39:29-05:00,Spice up your senses in bed,\n\n\n\n\n\nCanadian Rx-Medz at your fingertip...,1.0,0.0,True,False,[],True,False,True
16098,"""Live. Love. Take Our Pills"" <psjay.spann@cybe...",user2.1@gvc.ceas-challenge.cc,2008-08-06 21:08:28+02:00,The Most Trustworthy Online Drugstore,At Canadian Healthcare you will be able to fin...,1.0,1.0,True,True,[],True,False,True
12705,"""Ulyana M."" <info@jaimeferrer.com>",Hans <kalinduv@gvc.ceas-challenge.cc>,2008-08-06 16:28:46+04:00,First Touch - First Kiss,"\nPrivet, my gentleman!\n\nI am very much inte...",1.0,1.0,True,True,[],True,False,True
22607,<>,user2.8@gvc.ceas-challenge.cc,2008-08-07 04:24:55-05:00,"Ticket: plane, train. Sale",Food to your door step http://www.bandquick.co...,1.0,1.0,False,True,[],False,True,True
22748,<>,user3@gvc.ceas-challenge.cc,2008-08-07 04:33:57-05:00,We fix your organism,New stand up comedy show http://www.straplike....,1.0,1.0,False,True,[],False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3810,sinh <>,user2.8@gvc.ceas-challenge.cc,2008-08-06 06:29:53+03:00,Helping Men with low sex drives,"\nDoes your manhood feels heavy, we can make i...",1.0,1.0,False,True,[],True,False,True
9980,stelma <>,user2.2@gvc.ceas-challenge.cc,2008-08-06 10:46:40+02:00,I love to bang her,\nMen who ignore this report will regret http:...,1.0,1.0,False,True,[],True,False,True
9753,thayera@farnell.com,user2.13@gvc.ceas-challenge.cc,2008-08-06 01:04:30-07:00,Greetings from...?,\nYour neighbor is delivering you an Ecard fro...,1.0,1.0,True,True,[],True,False,True
102,user2.13@gvc.ceas-challenge.cc,user2.13@gvc.ceas-challenge.cc,2008-08-05 14:36:16-04:00,Angelina Jolie Free Video.,\n\n\n\n\n\n\n\n\n\n\n\nFree Video Nude Anjeli...,1.0,0.0,True,True,[],True,False,True


#### **Checking if a Receiver has Suspicious Transactions**

I randomly selected a receiver from the dataset and checked if they have a history of suspicious transactions. 

In [145]:
random_receiver = df["receiver"].dropna().sample(1).values[0]
print(f"Email: {random_receiver}")
has_suspicious_transactions = df_cleaned[df_cleaned['receiver'] == random_receiver]['is_suspicious'].any()
print(f"Has a history of suspicions transactions?  {has_suspicious_transactions}")

Email: user2.6@gvc.ceas-challenge.cc
Has a history of suspicions transactions?  True


## Exercise 4: File Handling and Reporting
- Generate a summary report of fraudulent transactions and save it to a JSON file.
- Create a function that reads the JSON report and prints key insights.

### **Generating a Fraudulent Transaction Report**

In this section, I generated a report summarizing the key insights regarding fraudulent transactions in the dataset. First, I filtered the DataFrame `df_cleaned` to isolate only the fraudulent transactions (where the label equals 1). Then, I calculated the following metrics:

- **Total Fraudulent Transactions**: The total number of fraudulent transactions.
- **Fraudulent Senders**: The number of occurrences of each sender involved in fraudulent transactions.
- **Fraudulent Receivers**: The number of occurrences of each receiver involved in fraudulent transactions.
- **Suspicious Transactions**: The total number of fraudulent transactions that were flagged as suspicious.
- **Repeated Transactions**: The total number of fraudulent transactions flagged as repeated.

Finally, I saved the results into a JSON file (`fraudulent_report.json`) for further analysis or reporting purposes.


In [146]:
fraudulent_transactions = df_cleaned[df_cleaned['label'] == 1]

report = {
    'total_fraudulent_transactions': int(len(fraudulent_transactions)),  
    'fraudulent_senders': fraudulent_transactions['sender'].value_counts().to_dict(),
    'fraudulent_receivers': fraudulent_transactions['receiver'].value_counts().to_dict(),
    'suspicious_transactions': int(fraudulent_transactions['is_suspicious'].sum()),  
    'repeated_transactions': int(fraudulent_transactions['repeated_transaction'].sum()),  
}

with open('fraudulent_report.json', 'w') as f:
    json.dump(report, f, indent=4)

### **Displaying the Fraudulent Transaction Report**

In this part of the project, I implemented a function `print_fraud_report()` to read and display the details of the fraudulent transaction report generated earlier.

In [147]:
def print_fraud_report(file_path):
    with open(file_path, 'r') as f:
        report = json.load(f)
    
    print("Fraudulent Transactions Report:")
    print(f"Total Fraudulent Transactions: {report['total_fraudulent_transactions']}")
    print(f"Suspicious Transactions: {report['suspicious_transactions']}")
    print(f"Repeated Transactions: {report['repeated_transactions']}")
    print("\nFraudulent Senders:")
    for sender, count in report['fraudulent_senders'].items():
        print(f"{sender}: {count} occurrences")
    
    print("\nFraudulent Receivers:")
    for receiver, count in report['fraudulent_receivers'].items():
        print(f"{receiver}: {count} occurrences")

print_fraud_report('fraudulent_report.json')

Fraudulent Transactions Report:
Total Fraudulent Transactions: 21812
Suspicious Transactions: 1074
Repeated Transactions: 466

Fraudulent Senders:
<>: 44 occurrences
Google AdWords-noreply <adwords-noreply@google.com>: 36 occurrences
Google AdWords <reactivation@google.com>: 30 occurrences
Google-AdWords-noreply <adwords-noreply@google.com>: 29 occurrences
Google-AdWords <reactivation@google.com>: 25 occurrences
Google-AdWords-Noreply <support@google.com>: 23 occurrences
AdWords-NoReplay <adwords-noreply@google.com>: 21 occurrences
Google AdWords <adwords-noreply@google.com>: 16 occurrences
Google-AdWords <adwords-noreply@google.com>: 16 occurrences
Google-AdWords <support-adwords@google.com>: 12 occurrences
Chase Bridges <Chase@sakai.zaq.ne.jp>: 10 occurrences
Jason Stern <Jason@ijablonec.cz>: 8 occurrences
Mac Justice <Mac@prin.edu>: 7 occurrences
Tonya Pierson <Tonya@comp.state.md.us>: 7 occurrences
Polly Posey <Polly@etma.pt>: 7 occurrences
Heidi Neely <Heidi@fang.enta.net>: 7 occu

### **Exercise 5: Improve Fraud Detection using Data Patterns**
- **Topics:** Fraud detection, anomaly detection, historical analysis

### **Identifying Fraudulent Email Domains**

In this section, I extracted the domain from the sender and receiver email addresses in the suspicious transactions using a function called `extract_domain()`. I then checked whether these domains had a history of fraud by examining their associated transactions. A list of domains with fraud history was created and stored in a new DataFrame, `domain_fraud_df`, ensuring there were no duplicates. Finally, I filtered this DataFrame to include only domains linked to fraudulent activity, allowing me to identify the domains that were more likely to be involved in phishing attempts.


In [149]:

def extract_domain(email):
    if isinstance(email, str) and email.strip():
        match = re.search(r"@([\w.-]+)", email)
        if match:
            return match.group(1)
    return None

domain_fraud_history = []

for index, row in df.iterrows():
    if row['is_suspicious']:  
        sender_domain = extract_domain(row['sender'])
        receiver_domain = extract_domain(row['receiver'])

        sender_fraud = row['fraud_history'] if sender_domain else False
        receiver_fraud = row['fraud_history'] if receiver_domain else False

        if sender_domain:
            domain_fraud_history.append((sender_domain, sender_fraud))
        if receiver_domain:
            domain_fraud_history.append((receiver_domain, receiver_fraud))

domain_fraud_df = pd.DataFrame(domain_fraud_history, columns=['domain', 'fraud_history'])

domain_fraud_df = domain_fraud_df.drop_duplicates()

domain_fraud_df


Unnamed: 0,domain,fraud_history
0,mothercare.co.uk,False
1,gvc.ceas-challenge.cc,False
2,cyberspann.com,False
4,jaimeferrer.com,False
6,gvc.ceas-challenge.cc,True
...,...,...
2057,ashaun.com,False
2059,wfreezesite.com,True
2067,mortgages-direct.com,False
2072,farnell.com,False
