# Email Address Scrapper

## Understanding the Imports

**Purpose:** This code imports necessary libraries for interacting with email, handling data, and performing text manipulation.

### Imported Libraries:

* **imaplib:** Used for interacting with IMAP email servers. This library allows you to connect to an email account, search for emails, and retrieve email data.
* **email:** Provides tools for parsing and handling email messages. It helps in extracting information like sender, recipient, subject, and message content from emails.
* **email.header:** Specifically used for decoding email headers, which might contain encoded characters or special formatting.
* **yaml:** Used for working with YAML (YAML Ain't Markup Language) data format. YAML is often used for configuration files or storing structured data.
* **re:** Provides regular expression functionalities for pattern matching and text manipulation.
* **csv:** Used for reading and writing data in CSV (Comma-Separated Values) format.
* **datetime:** Offers classes for manipulating dates and times, allowing you to work with date and time-related operations.

**Overall, these imports suggest that the code likely involves:**

* Fetching emails from an IMAP server
* Parsing email content
* Processing email data using regular expressions
* Handling configuration data (possibly from a YAML file)
* Creating or processing CSV files
* Performing date and time calculations

**Would you like to provide a code snippet that uses these imports?** I can then provide more specific explanations and examples.


In [43]:
import imaplib
import email
from email.header import decode_header
import yaml
import re
import csv
from datetime import datetime

## Loading Credentials

### Purpose:
This code block loads credentials for accessing an email account from a YAML file.

### Steps:
1. **Import necessary library:**
   * `import yaml`: Imports the YAML library for loading configuration data from a YAML file.

2. **Specify credentials file path:**
   * `credentials_path = r"C:\Users\ASUS\OneDrive\Pictures\credentials Manufacturing.yml"`: Sets the path to the YAML file containing the credentials. The `r` prefix indicates a raw string, preventing potential escape sequence issues.

3. **Open the credentials file:**
   * `with open(credentials_path) as f:`: Opens the specified YAML file in read mode using a `with` statement, ensuring proper file closure even in case of exceptions.

4. **Load YAML content:**
   * `my_credentials = yaml.load(f, Loader=yaml.FullLoader)`: Loads the contents of the YAML file into the `my_credentials` dictionary using the `yaml.load` function. The `Loader=yaml.FullLoader` argument is essential for safely loading YAML data.

5. **Extract user and password:**
   * `user, password = my_credentials["user"], my_credentials["password"]`: Extracts the `user` and `password` values from the loaded YAML data and assigns them to respective variables.

### Assumptions:
* The YAML file (`credentials Manufacturing.yml`) is located at the specified path and contains a dictionary with keys `user` and `password`.
* The values for `user` and `password` are valid credentials for an email account.

### Potential Improvements:
* **Error handling:** Consider adding error handling to gracefully handle cases where the file doesn't exist, is inaccessible, or the YAML content is malformed.
* **Security:** For production environments, storing credentials in plain text YAML files is generally not recommended. Consider using encrypted storage or environment variables.
* **Code clarity:** Using more descriptive variable names like `credentials_file_path` instead of `credentials_path` can improve code readability.

In [44]:
# Load credentials
credentials_path = Enter_your_pathway_here
with open(credentials_path) as f:
    my_credentials = yaml.load(f, Loader=yaml.FullLoader)

user, password = my_credentials["user"], my_credentials["password"]

## Connecting to the IMAP Server

### Code Explanation:

This code establishes a connection to the Gmail IMAP server using the provided credentials.

1. **Specify IMAP URL:**
   * `imap_url = 'imap.gmail.com'`: Defines the IMAP server address for Gmail.

2. **Create IMAP4_SSL object:**
   * `my_mail = imaplib.IMAP4_SSL(imap_url)`: Creates an IMAP4_SSL object to interact with the specified IMAP server using SSL encryption.

3. **Login to the email account:**
   * `my_mail.login(user, password)`: Logs in to the email account using the previously loaded `user` and `password` credentials.

### Important Notes:

* **Security:** Using plain text passwords is generally not recommended for production environments. Consider using encrypted storage or environment variables to protect sensitive information.
* **Two-Factor Authentication:** If you have two-factor authentication enabled for your Gmail account, you'll need to generate an app password to use with IMAP.
* **Error Handling:** It's essential to implement error handling to catch potential exceptions like connection errors or authentication failures.

### Example with Error Handling:

```python
import imaplib
import email
from email.header import decode_header
import yaml
import re
import csv
from datetime import datetime

# Load credentials
credentials_path = enter_your_pathway_here
with open(credentials_path) as f:
    my_credentials = yaml.load(f, Loader=yaml.FullLoader)

user, password = my_credentials["user"], my_credentials["password"]

imap_url = 'imap.gmail.com'

try:
    my_mail = imaplib.IMAP4_SSL(imap_url)
    my_mail.login(user, password)
    print("Logged in successfully")
except imaplib.IMAP4.error as e:
    print("Error logging in:", e)
```


In [45]:
imap_url = 'imap.gmail.com'
my_mail = imaplib.IMAP4_SSL(imap_url)
my_mail.login(user, password)

('OK', [b'suranjan.d@flipcarbon.info authenticated (Success)'])

## Understanding the Code

### Purpose:
This code snippet selects the "Sent Mail" folder in a Gmail account and fetches a list of email IDs for all emails within that folder.

### Breakdown:
1. **Selecting the Sent Mail folder:**
   * `my_mail.select('"[Gmail]/Sent Mail"')`: Selects the "Sent Mail" folder in the connected email account. The `select` method returns a tuple containing the status and a list of flags.

2. **Searching for all emails:**
   * `_, data = my_mail.search(None, 'ALL')`: Searches for all emails in the currently selected folder. The `search` method returns a tuple containing the status and a list of email IDs. The first element (status) is ignored using `_`, and the second element (data) containing the email IDs is assigned to the `data` variable.
   * `mail_ids = data[0].split()`: Converts the list of email IDs (which is a single string) into a list of individual email IDs by splitting it at whitespace characters.

### Key Points:
* The code assumes that the user has successfully logged in to the email account using the `imaplib.IMAP4_SSL` and `login` methods in the previous steps.
* The `ALL` search criteria retrieves all emails in the selected folder. More specific search criteria can be used to filter emails based on various conditions.
* The `mail_ids` list now contains a list of email IDs for all emails in the "Sent Mail" folder. These IDs can be used to fetch the actual email content later.


In [46]:
# Select the Sent Mail folder
my_mail.select('"[Gmail]/Sent Mail"')

# Search for all emails in Sent Mail
_, data = my_mail.search(None, 'ALL')
mail_ids = data[0].split()

## Understanding the `extract_email_addresses` Function

### Purpose:
This function extracts email addresses from a given text string.

### Breakdown:
1. **Import Regular Expression Library:**
   * `import re`: Imports the `re` module, which provides regular expression functionalities.

2. **Define the Function:**
   * `def extract_email_addresses(text):`: Defines a function named `extract_email_addresses` that takes a text string as input.

3. **Regular Expression Pattern:**
   * `pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'`: Defines a regular expression pattern to match email addresses. This pattern is designed to capture email addresses in a standard format.

4. **Finding Email Addresses:**
   * `return re.findall(pattern, text)`: Uses the `re.findall` function to find all occurrences of the specified pattern within the input text and returns them as a list of email addresses.

### Explanation of the Regular Expression:
* `\b`: Matches a word boundary, ensuring that the email address is a standalone word.
* `[A-Za-z0-9._%+-]`: Matches one or more characters from the specified set (letters, numbers, dot, underscore, percent, plus, and hyphen).
* `@`: Matches the literal at symbol.
* `[A-Za-z0-9.-]+`: Matches one or more characters from the specified set (letters, numbers, dot, and hyphen).
* `\.`: Matches a literal dot.
* `[A-Z|a-z]{2,}`: Matches two or more uppercase or lowercase letters (top-level domain).
* `\b`: Matches a word boundary, ensuring the email address ends properly.

### Potential Improvements:
* **Handle different email formats:** The current pattern might not handle all possible email address formats, such as those with special characters or different top-level domains.
* **Validate email addresses:** Consider adding a validation step to verify the extracted email addresses using a more robust email validation library.
* **Optimize performance:** For large text inputs, explore optimization techniques like using compiled regular expressions or alternative approaches.


In [47]:
# Function to extract email addresses from a string
def extract_email_addresses(text):
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return re.findall(pattern, text)

## Improving the Email Extraction Function

**The provided code is a good starting point, but it can be enhanced.**

### Issues with the Current Regex:
* **Limited email format coverage:** It might miss valid email addresses with special characters, different TLDs, or subdomains.
* **Potential false positives:** It could incorrectly match strings that resemble email addresses but aren't valid.

### Improved Function:

```python
import re

def extract_email_addresses(text):
  """Extracts email addresses from a given text string.

  Args:
    text: The input text string.

  Returns:
    A list of extracted email addresses.
  """

  email_regex = r"""
  \b
  [a-zA-Z0-9._%+-]+
  @
  [a-zA-Z0-9.-]+
  \.
  [a-zA-Z]{2,4}
  \b
  """

  return re.findall(email_regex, text, re.VERBOSE)
```

### Explanation of Improvements:
* **Verbose mode:** Using `re.VERBOSE` allows for better readability and commenting within the regular expression.
* **Enhanced pattern:** The pattern is more inclusive, allowing for a wider range of email formats, including subdomains and different TLD lengths.
* **Clearer function definition:** The function now includes a docstring explaining its purpose and parameters.

### Additional Considerations:
* **Email validation:** For critical applications, consider using a dedicated email validation library to verify the extracted email addresses.
* **Performance optimization:** If dealing with large text volumes, explore compiled regular expressions or alternative approaches for efficiency.
* **Complex email formats:** Be aware that some email addresses might have unusual formats that this regex might not capture.


In [48]:
# Function to extract email addresses from a string
def extract_email_addresses(text):
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return re.findall(pattern, text)

## The `extract_name` Function

### Function: `extract_name(body)`

**Purpose:**
Extracts the name from the first line of an email body.

**Arguments:**
* `body`: The email body as a string.

**Returns:**
* The extracted name as a string, or an empty string if no name is found.

**Explanation:**
1. **Extracts the first line:** Splits the email body into lines and takes the first one.
2. **Removes common salutations:** Removes common salutations like "Dear", "Hi", and "Hello" from the first line.
3. **Applies a regular expression:** Uses a regular expression to match a name pattern, which includes one or more words at the beginning of the line.
4. **Returns the extracted name:** If a match is found, returns the extracted name in title case. Otherwise, returns an empty string.

**Example Usage:**
```python
email_body = "Dear John Doe, \n\nHow are you?"
name = extract_name(email_body)
print(name)  # Output: John Doe
```

**Note:**
* This function assumes a common email format and might not accurately extract names in all cases.
* For more complex name extraction scenarios, consider using natural language processing techniques or machine learning.



In [49]:
# Function to extract name from the first line of the email body
def extract_name(body):
    first_line = body.split('\n')[0]
    match = re.search(r'Dear\s+(\w+)', first_line, re.IGNORECASE)
    if match:
        return match.group(1)
    return ''

## Function: `decode_subject(subject)`

**Purpose:** Decodes email subject headers.

**Args:**
* `subject`: Encoded email subject string.

**Returns:**
* Decoded subject string.

**Explanation:**
1. **Decode header:** Breaks down the subject into parts and their encodings using `decode_header`.
2. **Iterate and decode:** Loops through each part. If it's bytes, decodes it using the specified encoding or UTF-8. Otherwise, appends the part as is.
3. **Return decoded subject:** Combines all decoded parts into a single string and returns it.


In [50]:
# Function to decode the subject
def decode_subject(subject):
    decoded_parts = decode_header(subject)
    decoded_subject = ''
    for part, encoding in decoded_parts:
        if isinstance(part, bytes):
            decoded_subject += part.decode(encoding or 'utf-8')
        else:
            decoded_subject += part
    return decoded_subject

## Processing Emails in Batches

### Purpose:
This code processes emails in batches to improve efficiency and handle large email volumes. It extracts information like sender, subject, and body from each email and stores it in a list.

### Step-by-Step Explanation:

**1. Import necessary libraries:**
   * This step is assumed to be done before this code block.

**2. Set batch size:**
   * `batch_size = 100`: Defines the number of emails to process in each batch.

**3. Initialize email data list:**
   * `email_data = []`: Creates an empty list to store extracted email data.

**4. Process emails in batches:**
   * `for i in range(0, len(mail_ids), batch_size):`: Iterates through the list of email IDs in steps of `batch_size`.
     * `batch = mail_ids[i:i+batch_size]`: Creates a list of email IDs for the current batch.
     * `batch_set = ','.join(batch.decode('utf-8') for batch in batch)`: Converts the batch of email IDs into a comma-separated string for fetching.
     * `_, msg_data = my_mail.fetch(batch_set, '(RFC822)')`: Fetches emails for the current batch using IMAP.
     * `for response_part in msg_data:`: Iterates through the fetched email data.
       * `if isinstance(response_part, tuple):`: Checks if the current response part is an email.
         * `email_body = response_part[1]`: Extracts the email body.
         * `email_msg = email.message_from_bytes(email_body)`: Parses the email body into an email message object.
         * **Extract email information:**
           * `subject = decode_subject(email_msg['Subject'])`: Decodes the email subject.
           * `to_addresses = extract_email_addresses(email_msg['To'])`: Extracts email addresses from the 'To' field.
           * `name = ''`, `body = ''`: Initialize variables for name and body.
           * **Handle multipart emails:**
             * `if email_msg.is_multipart():`: Checks if the email is multipart.
               * Iterates through email parts, looking for the plain text part.
               * Extracts the body and name from the plain text part.
           * **Handle single-part emails:**
             * Extracts the body and name from the email directly.
         * **Store email data:**
           * Iterates through extracted email addresses.
           * Creates a dictionary for each email with address, name, and subject.
           * Appends the dictionary to the `email_data` list.
     * `print(f"Processed {len(batch)} emails. Total entries: {len(email_data)}")`: Prints a progress message.

**5. Break:**
   * The `break` statement is likely for testing purposes and can be removed.

### Key Points:
* Processes emails in batches to optimize performance.
* Extracts essential email information: sender, subject, and body.
* Handles multipart emails to extract plain text content.
* Stores extracted data in a list for further processing.


In [51]:
# Process emails in batches
batch_size = 100
email_data = []

for i in range(0, len(mail_ids), batch_size):
    batch = mail_ids[i:i+batch_size]
    batch_set = ','.join(batch.decode('utf-8') for batch in batch)
    
    _, msg_data = my_mail.fetch(batch_set, '(RFC822)')
    
    for response_part in msg_data:
        if isinstance(response_part, tuple):
            email_body = response_part[1]
            email_msg = email.message_from_bytes(email_body)
            
            # Extract and decode subject
            subject = decode_subject(email_msg['Subject'])
            
            # Extract email addresses from 'To' field
            to_addresses = extract_email_addresses(email_msg['To'])
            
            # Extract name and body
            name = ''
            body = ''
            if email_msg.is_multipart():
                for part in email_msg.walk():
                    if part.get_content_type() == 'text/plain':
                        body = part.get_payload(decode=True).decode('utf-8', errors='ignore')
                        name = extract_name(body)
                        break
            else:
                body = email_msg.get_payload(decode=True).decode('utf-8', errors='ignore')
                name = extract_name(body)
            
            # Store data for each email
            for address in to_addresses:
                email_data.append({
                    'email': address,
                    'name': name,
                    'subject': subject
                })

    print(f"Processed {len(batch)} emails. Total entries: {len(email_data)}")

Processed 100 emails. Total entries: 100
Processed 100 emails. Total entries: 200
Processed 100 emails. Total entries: 300
Processed 100 emails. Total entries: 400
Processed 100 emails. Total entries: 500
Processed 100 emails. Total entries: 600
Processed 100 emails. Total entries: 700
Processed 100 emails. Total entries: 800
Processed 100 emails. Total entries: 900
Processed 100 emails. Total entries: 1000
Processed 100 emails. Total entries: 1100
Processed 100 emails. Total entries: 1200
Processed 100 emails. Total entries: 1300
Processed 100 emails. Total entries: 1400
Processed 100 emails. Total entries: 1500
Processed 100 emails. Total entries: 1600
Processed 100 emails. Total entries: 1700
Processed 100 emails. Total entries: 1800
Processed 100 emails. Total entries: 1900
Processed 100 emails. Total entries: 2000
Processed 100 emails. Total entries: 2100
Processed 100 emails. Total entries: 2200
Processed 100 emails. Total entries: 2300
Processed 100 emails. Total entries: 2400
P

## Generating a Filename with Timestamp

### Purpose:
Generates a filename with the current date and time in the format "sent_email_data_YYYYMMDD_HHMMSS.csv".

### Explanation:
1. **Import necessary library:**
   * `import datetime`: Imports the `datetime` module for working with dates and times.

2. **Get current time:**
   * `current_time = datetime.now().strftime("%Y%m%d_%H%M%S")`: Gets the current date and time and formats it as a string in the specified format.
     * `datetime.now()`: Gets the current datetime object.
     * `strftime("%Y%m%d_%H%M%S")`: Formats the datetime object into a string with the year, month, day, hour, minute, and second separated by underscores.

3. **Create filename:**
   * `filename = f"sent_email_data_{current_time}.csv"`: Creates the filename by combining the prefix "sent_email_data_", the formatted current time, and the ".csv" extension using f-strings for string interpolation.

### Example Output:
If the code is run at 11:30 AM on April 25, 2024, the `filename` variable will contain the string "sent_email_data_20240425_113000.csv".

### Additional Considerations:
* **Time zone:** If you need to specify a particular time zone, use the `tzinfo` argument in `datetime.now()`.
* **File path:** To specify a complete file path, combine the filename with a directory path using `os.path.join`.
* **File format:** You can change the file extension (e.g., `.xlsx`, `.json`) to suit your needs.

**Would you like to explore any of these options further?** 


In [52]:
# Generate a filename with current date and time
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"sent_email_data_{current_time}.csv"

## Code Breakdown: Writing Data to CSV File

### Purpose:
Writes the extracted email data to a CSV file with specified fieldnames.

### Explanation:
1. **Open CSV file:**
   * `with open(filename, 'w', newline='', encoding='utf-8') as csvfile:`: Opens a CSV file for writing in the specified filename with appropriate encoding and newline handling.

2. **Define fieldnames:**
   * `fieldnames = ['email', 'name', 'subject']`: Defines the column headers for the CSV file.

3. **Create CSV writer:**
   * `writer = csv.DictWriter(csvfile, fieldnames=fieldnames)`: Creates a CSV writer object to write data to the CSV file based on the defined fieldnames.

4. **Write header:**
   * `writer.writeheader()`: Writes the header row to the CSV file.

5. **Write data:**
   * `for data in email_data:`: Iterates through the list of email data dictionaries.
     * `writer.writerow(data)`: Writes each dictionary as a row to the CSV file.

6. **Print confirmation:**
   * `print(f"Email data has been saved to {filename}")`: Prints a confirmation message indicating the saved file.

### Key Points:
* Uses `csv.DictWriter` for efficient writing of dictionary data to CSV.
* Specifies `newline=''` to handle potential newline issues.
* Includes UTF-8 encoding for handling different character sets.
* Writes a header row for better data organization.


In [53]:
# Write data to CSV file
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['email', 'name', 'subject']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in email_data:
        writer.writerow(data)

print(f"Email data has been saved to {filename}")

Email data has been saved to sent_email_data_20240712_131802.csv


## Closing the IMAP Connection

### Code:
```python
my_mail.close()
my_mail.logout()
```

### Explanation:
These lines ensure a proper and clean termination of the IMAP connection. 

* **`my_mail.close()`:** This command closes the currently selected mailbox. It's essential to release resources associated with the mailbox.
* **`my_mail.logout()`:** This command terminates the connection to the IMAP server, releasing all resources and closing the network connection.

### Importance:
* **Resource management:** Closing the connection prevents resource leaks and ensures efficient use of system resources.
* **Connection stability:** Proper closing of the connection helps maintain a stable and reliable connection for future interactions.
* **Error handling:** In case of exceptions or errors, it's crucial to close the connection to avoid resource issues and potential connection problems.

By including these lines at the end of your script, you guarantee that the IMAP connection is closed correctly, regardless of whether the script execution was successful or encountered errors.


In [54]:
my_mail.close()
my_mail.logout()

('BYE', [b'LOGOUT Requested'])