# Part 1

What will be done below:

#### Step1: Function Definition: extract_emails

#### Step2: Email Extraction and Data Processing

#### Step3: Creating DataFrame and Saving to CSV

#### Step4: Printing Status Message

In [1]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O 

In [2]:
import os
import re
import pandas as pd  # Make sure to import pandas

def extract_emails(root_dir):
    """
    Traverse the directory containing the raw Enron dataset, specifically focusing on the _sent_mail folder,
    and extract email file paths and content.
    
    Args:
    - root_dir: The root directory of the Enron dataset.
    
    Returns:
    - A DataFrame with columns for file paths and Message-IDs, specifically from _sent_mail folders.
    """
    data = []  # This will store tuples of (file_path, message_id)
    for root, dirs, files in os.walk(root_dir):
        if '_sent_mail' in root:
            for file in files:
                try:
                    file_path = os.path.join(root, file)
                    relative_path = os.path.relpath(file_path, start=root_dir).replace('\\', '/')
                    with open(file_path, 'r', encoding='latin1') as email_file:
                        email_content = email_file.read()
                        message_id_match = re.search(r"Message-ID: <(\S+)>", email_content)
                        if message_id_match:
                            message_id = message_id_match.group(1)
                        else:
                            message_id = "Unknown"
                    data.append((relative_path, message_id))
                except Exception as e:
                    print(f"Error reading {file_path}: {e}")
    
    # Create a DataFrame from the collected data
    df = pd.DataFrame(data, columns=['file', 'message'])
    return df

# Example usage
root_dir = 'C:\\Users\\birhanubt\\Desktop\\enron_mail_20150507'
df_emails = extract_emails(root_dir)

# Extract emails with sender information
emails = extract_emails(root_dir)

# Create a DataFrame
df_emails = pd.DataFrame(emails)

# Save the DataFrame to CSV
csv_file_path = 'C:\\Users\\birhanubt\\Desktop\\Enron_emails.csv'  # Update this to where you want to save the CSV
df_emails.to_csv(csv_file_path, index=False)

print(f"CSV file has been saved to {csv_file_path}")



CSV file has been saved to C:\Users\birhanubt\Desktop\Enron_emails.csv


In [3]:

# Display the first few rows of the DataFrame
print(df_emails.head())

                       file                                      message
0  allen-p/_sent_mail/1000_  13505866.1075863688222.JavaMail.evans@thyme
1  allen-p/_sent_mail/1001_  30922949.1075863688243.JavaMail.evans@thyme
2  allen-p/_sent_mail/1002_  30965995.1075863688265.JavaMail.evans@thyme
3  allen-p/_sent_mail/1003_  16254169.1075863688286.JavaMail.evans@thyme
4  allen-p/_sent_mail/1004_  17189699.1075863688308.JavaMail.evans@thyme


#### Checking if the dataset saved right

In [4]:
import pandas as pd

# Specify the path to my CSV file
csv_file_path = 'C:\\Users\\birhanubt\\Desktop\\Enron_emails.csv'

# Read the CSV file into a DataFrame
df_emails = pd.read_csv(csv_file_path)

# Display the first few rows of the DataFrame
print(df_emails.head())


                       file                                      message
0  allen-p/_sent_mail/1000_  13505866.1075863688222.JavaMail.evans@thyme
1  allen-p/_sent_mail/1001_  30922949.1075863688243.JavaMail.evans@thyme
2  allen-p/_sent_mail/1002_  30965995.1075863688265.JavaMail.evans@thyme
3  allen-p/_sent_mail/1003_  16254169.1075863688286.JavaMail.evans@thyme
4  allen-p/_sent_mail/1004_  17189699.1075863688308.JavaMail.evans@thyme


In [7]:
print(df_emails.isnull().sum())


file       0
message    0
dtype: int64


In [8]:
# Identifying empty messages
df_emails['is_empty'] = df_emails['message'].apply(lambda x: x.strip() == '')

# Summarizing or filtering based on 'is_empty'
print(f"Number of empty messages: {df_emails['is_empty'].sum()}")


Number of empty messages: 0


On the above, I've created a function called 'extract_emails' that traverses a directory containing email files from the Enron dataset. For each email file, the function reads its content and extracts the sender's email address using a regular expression. It then creates a list of dictionaries, where each dictionary contains the file path, email content, and sender's email address. Finally, the extracted email data is converted into a pandas dataframe, and the dataframe is saved as a CSV file called 'Enorn_emails.csv'.

Towards my project, the above process serves as a crucial initial step in preparing my dataset. By extracting email content and sender information, I am building the foundation for training a Seq2Seq model to generate responses to incoming emails. The extracted data will be further processed and cleaned as necessary before being used to train my machine learning model. And, the CSV file saved will be loaded into my training pipeline to feed data into my Seq2Seq model.

Overall, the above work helps automate the extraction and organization of email data, which is essential for training a model to automate email responses using Seq2Seq learning.

# From here please refer 'Final Code Part 2'