From cba9de3e21e8e044af710376e94638239a158b0c Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sun, 15 Oct 2023 05:23:27 +0000 Subject: [PATCH 1/2] Updated Inbox Readme --- sources/inbox/README.md | 162 ++++++++++++++-------------------------- 1 file changed, 57 insertions(+), 105 deletions(-) diff --git a/sources/inbox/README.md b/sources/inbox/README.md index d2d0327be..ea00952cf 100644 --- a/sources/inbox/README.md +++ b/sources/inbox/README.md @@ -1,133 +1,85 @@ # Inbox Source -This source provides functionalities to collect inbox emails, get the attachment as a [file items](../filesystem/README.md#the-fileitem-file-representation), -and store all relevant email information in a destination. It utilizes the `imaplib` library to -interact with the IMAP server, and `dlt` library to handle data processing and transformation. +This source collects inbox emails, retrieves attachments, and stores relevant email data. It uses +the imaplib library for IMAP interactions and the dlt library for data processing. -## Prerequisites +Sources and resources that can be loaded using this verified source are: -- Python 3.x -- `dlt` library (you can install it using `pip install dlt`) -- destination dependencies, e.g. `duckdb` (`pip install duckdb`) +| Name | Description | |-------------------|----------------------------------------------------| | +inbox_source | Gathers inbox emails and saves attachments locally | | get_messages_uids | Retrieves +messages UUIDs from the mailbox | | get_messages | Retrieves emails from the mailbox using given +UIDs | | get_attachments | Downloads attachments from emails using given UIDs | -## Installation +## Initialize the pipeline -Make sure you have Python 3.x installed on your system. - -Install the required library by running the following command: - -```shell -pip install dlt[duckdb] +```bash +dlt init inbox duckdb ``` -## Initialize the source +Here, we chose duckdb as the destination. Alternatively, you can also choose redshift, bigquery, or +any of the other [destinations.](https://dlthub.com/docs/dlt-ecosystem/destinations/) -Initialize the source with dlt command: +## Grab Inbox credentials -```shell -dlt init inbox duckdb -``` +To learn about grabbing the Inbox credentials and configuring the verified source, please refer to +the +[full documentation here.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/inbox#grab-credentials) + +## Add credential -## Set email account credentials +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can + securely store your access tokens and other sensitive information. It's important to handle this + file with care and keep it safe. Here's what the file looks like: -1. Open `.dlt/secrets.toml`. -1. Enter the email account secrets: ```toml + # put your secret values and credentials here + # do not share this file and do not push it to github [sources.inbox] - host = 'imap.example.com' - email_account = "example@example.com" - password = 'set me up!' + host = "Please set me up!" # The host address of the email service provider. + email_account = "Please set me up!" # Email account associated with the service. + password = "Please set me up!" # # APP Password for the above email account. ``` -Use [App password](#getting-gmail-app-password) to set the password for a Gmail account. - -## Usage - -1. Ensure that the email account you want to access allows access by less secure apps (or use an - [app password](#getting-gmail-app-password)). -2. Replace the placeholders in `.dlt/secrets.toml` with your IMAP server hostname, email account - credentials. - - -## Functionality -You access the messages and attachments via `inbox_source` `dlt` source. This source exposes the resources -as described below. Typically you'll pick one of those (ie. **messages**) and load it into a table with -a specific name: -```python -# get messages resource from the source -messages = inbox_source( - filter_emails=("astra92293@gmail.com", "josue@sehnem.com") -).messages -# configure the messages resource to not get bodies of the messages -messages = messages(include_body=False).with_name("my_inbox") -# load messages to "my_inbox" table -load_info = pipeline.run(messages) -``` -This way you can create several extract pipelines with different combination of filters and processing steps from a single `inbox_source`. - -### Additional `inbox_source` arguments -Please refer to `inbox_source()` docstring for options to filter email messages and attachments by sender, date or mime type. +1. Replace the host, email and password value with the [previously copied one](#grab-credentials) to + ensure secure access to your Inbox resources. -### messages resource + > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be + > "abcdefghijklmnop". -This resource fetches the corresponding email messages using their UID. It yields a dictionary containing email -metadata such as message UID, message ID, sender, subject, date, modification date, content type, -and email body. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to + add credentials for your chosen destination, ensuring proper routing of your data to the final + destination. -Please refer to **imap_read_messages()** example pipeline in **inbox_pipeline.py** that loads messages from particulars into -**duckdb**. +## Run the pipeline -### attachments resource +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by + running the command: -This resource extracts attachments from the email message using their UID. It connects to -the IMAP server, fetches the email message by its UID, parses body and looks for attachments. -It yields [file items](../filesystem/README.md#the-fileitem-file-representation) where attachments -are loaded in the **file_content** field. The original email message is present in **message** filed. - -Please refer to **imap_get_attachments()** example pipeline in **inbox_pipeline.py** that get **pdf** attachments -from emails from particular senders, parsed pdfs and writes content into **duckdb**. - -### uids resource - -This is a dlt resource that connects to the IMAP server, logs in to the email account, and fetches -email messages from the specified folder ('INBOX' by default). It yields a dictionary containing -only message UID. You typically do not need to use this resource. - -## Accessing Gmail Inbox - -To connect to the Gmail server, we need the below information. + ```bash + pip install -r requirements.txt + ``` -- SMTP server DNS. Its value will be 'imap.gmail.com' in our case. -- SMTP server port. The value will be 993. This port is used for Internet message access protocol - over TLS/SSL. + Prerequisites for fetching messages differ by provider. For Gmail: -### Set up Gmail with a third-party email client + - Python 3.x + - dlt library: pip install dlt + - PyPDF2: pip install PyPDF2 + - Specific destinations, e.g., duckdb: pip install duckdb + - (Note: Confirm based on your service provider.) -An app password is a 16-digit passcode that gives a less secure app or device permission to access -your Google Account. App passwords can only be used with accounts that have 2-Step Verification -turned on. +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using + the following command: -Step 1: Create and use App Passwords -1. Go to your Google Account. -2. Select Security. -3. Under "How you sign in to Google", select **2-Step Verification** -> Turn it on. -4. Select again **2-Step Verification**. -5. At the bottom of the page, select App passwords. -6. Enter a name of device that helps you remember where you’ll use the app password. -7. Select Generate. -8. To enter the app password, follow the instructions on your screen. The app password is the - 16-character code that generates on your device. -9. Select Done. + ```bash + dlt pipeline show + ``` -Read more in -[this article](https://pythoncircle.com/post/727/accessing-gmail-inbox-using-python-imaplib-module/) -or -[Google official documentation.](https://support.google.com/mail/answer/185833#zippy=%2Cwhy-you-may-need-an-app-password) + For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also + use any custom name instead. -Step 2: Turn on IMAP in Gmail -1. In Gmail, in the top right, click Settings -> See all settings. -2. At the top, click the Forwarding and POP/IMAP tab. -3. In the IMAP Access section, select Enable IMAP. -4. At the bottom, click Save Changes. +For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline) -Read more in [official Google documentation.](https://support.google.com/a/answer/9003945#zippy=%2Cstep-turn-on-imap-in-gmail) \ No newline at end of file +💡 To explore additional customizations for this pipeline, we recommend referring to the official DLT +Inbox documentation. It provides comprehensive information and guidance on how to further customize +and tailor the pipeline to suit your specific needs. You can find the DLT Inbox documentation in +the [Setup Guide: Inbox.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/inbox) From 3158276668758b770dfbe3614916cb7b316997a0 Mon Sep 17 00:00:00 2001 From: AstrakhantsevaAA Date: Thu, 19 Oct 2023 14:44:57 +0200 Subject: [PATCH 2/2] update --- sources/inbox/README.md | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-) diff --git a/sources/inbox/README.md b/sources/inbox/README.md index ea00952cf..1dca753c5 100644 --- a/sources/inbox/README.md +++ b/sources/inbox/README.md @@ -1,14 +1,18 @@ # Inbox Source -This source collects inbox emails, retrieves attachments, and stores relevant email data. It uses -the imaplib library for IMAP interactions and the dlt library for data processing. + +This source provides functionalities to collect inbox emails, get the attachment as a [file items](../filesystem/README.md#the-fileitem-file-representation), +and store all relevant email information in a destination. It utilizes the `imaplib` library to +interact with the IMAP server, and `dlt` library to handle data processing and transformation. Sources and resources that can be loaded using this verified source are: -| Name | Description | |-------------------|----------------------------------------------------| | -inbox_source | Gathers inbox emails and saves attachments locally | | get_messages_uids | Retrieves -messages UUIDs from the mailbox | | get_messages | Retrieves emails from the mailbox using given -UIDs | | get_attachments | Downloads attachments from emails using given UIDs | +| Name | Description | +|-------------------|-------------------------------------------| +| inbox_source | Gathers inbox emails and saves attachments locally | +| uids | Retrieves messages UUIDs from the mailbox | +| messages | Retrieves emails from the mailbox using given UIDs | +| attachments | Downloads attachments from emails using given UIDs | ## Initialize the pipeline @@ -16,7 +20,7 @@ UIDs | | get_attachments | Downloads attachments from emails using given UIDs | dlt init inbox duckdb ``` -Here, we chose duckdb as the destination. Alternatively, you can also choose redshift, bigquery, or +Here, we chose `duckdb` as the destination. Alternatively, you can also choose `redshift`, `bigquery`, or any of the other [destinations.](https://dlthub.com/docs/dlt-ecosystem/destinations/) ## Grab Inbox credentials @@ -40,7 +44,7 @@ the password = "Please set me up!" # # APP Password for the above email account. ``` -1. Replace the host, email and password value with the [previously copied one](#grab-credentials) to +1. Replace the host, email and password value to ensure secure access to your Inbox resources. > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be @@ -59,13 +63,19 @@ the pip install -r requirements.txt ``` - Prerequisites for fetching messages differ by provider. For Gmail: + Prerequisites for fetching messages differ by provider. + + - For Gmail: + + `pip install google-api-python-client>=2.86.0` + + `pip install google-auth-oauthlib>=1.0.0` + + `pip install google-auth-httplib2>=0.1.0` + + - For pdf parsing: - - Python 3.x - - dlt library: pip install dlt - - PyPDF2: pip install PyPDF2 - - Specific destinations, e.g., duckdb: pip install duckdb - - (Note: Confirm based on your service provider.) + `pip install PyPDF2` 1. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command: