diff --git a/sources/inbox/README.md b/sources/inbox/README.md index d2d0327be..1dca753c5 100644 --- a/sources/inbox/README.md +++ b/sources/inbox/README.md @@ -1,133 +1,95 @@ # Inbox Source + This source provides functionalities to collect inbox emails, get the attachment as a [file items](../filesystem/README.md#the-fileitem-file-representation), and store all relevant email information in a destination. It utilizes the `imaplib` library to interact with the IMAP server, and `dlt` library to handle data processing and transformation. -## Prerequisites - -- Python 3.x -- `dlt` library (you can install it using `pip install dlt`) -- destination dependencies, e.g. `duckdb` (`pip install duckdb`) - -## Installation +Sources and resources that can be loaded using this verified source are: -Make sure you have Python 3.x installed on your system. +| Name | Description | +|-------------------|-------------------------------------------| +| inbox_source | Gathers inbox emails and saves attachments locally | +| uids | Retrieves messages UUIDs from the mailbox | +| messages | Retrieves emails from the mailbox using given UIDs | +| attachments | Downloads attachments from emails using given UIDs | -Install the required library by running the following command: +## Initialize the pipeline -```shell -pip install dlt[duckdb] +```bash +dlt init inbox duckdb ``` -## Initialize the source +Here, we chose `duckdb` as the destination. Alternatively, you can also choose `redshift`, `bigquery`, or +any of the other [destinations.](https://dlthub.com/docs/dlt-ecosystem/destinations/) -Initialize the source with dlt command: +## Grab Inbox credentials -```shell -dlt init inbox duckdb -``` +To learn about grabbing the Inbox credentials and configuring the verified source, please refer to +the +[full documentation here.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/inbox#grab-credentials) -## Set email account credentials +## Add credential + +1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can + securely store your access tokens and other sensitive information. It's important to handle this + file with care and keep it safe. Here's what the file looks like: -1. Open `.dlt/secrets.toml`. -1. Enter the email account secrets: ```toml + # put your secret values and credentials here + # do not share this file and do not push it to github [sources.inbox] - host = 'imap.example.com' - email_account = "example@example.com" - password = 'set me up!' + host = "Please set me up!" # The host address of the email service provider. + email_account = "Please set me up!" # Email account associated with the service. + password = "Please set me up!" # # APP Password for the above email account. ``` -Use [App password](#getting-gmail-app-password) to set the password for a Gmail account. - -## Usage - -1. Ensure that the email account you want to access allows access by less secure apps (or use an - [app password](#getting-gmail-app-password)). -2. Replace the placeholders in `.dlt/secrets.toml` with your IMAP server hostname, email account - credentials. - - -## Functionality -You access the messages and attachments via `inbox_source` `dlt` source. This source exposes the resources -as described below. Typically you'll pick one of those (ie. **messages**) and load it into a table with -a specific name: -```python -# get messages resource from the source -messages = inbox_source( - filter_emails=("astra92293@gmail.com", "josue@sehnem.com") -).messages -# configure the messages resource to not get bodies of the messages -messages = messages(include_body=False).with_name("my_inbox") -# load messages to "my_inbox" table -load_info = pipeline.run(messages) -``` -This way you can create several extract pipelines with different combination of filters and processing steps from a single `inbox_source`. - -### Additional `inbox_source` arguments -Please refer to `inbox_source()` docstring for options to filter email messages and attachments by sender, date or mime type. +1. Replace the host, email and password value to + ensure secure access to your Inbox resources. -### messages resource + > When adding the App Password, remove any spaces. For instance, "abcd efgh ijkl mnop" should be + > "abcdefghijklmnop". -This resource fetches the corresponding email messages using their UID. It yields a dictionary containing email -metadata such as message UID, message ID, sender, subject, date, modification date, content type, -and email body. +1. Next, follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to + add credentials for your chosen destination, ensuring proper routing of your data to the final + destination. -Please refer to **imap_read_messages()** example pipeline in **inbox_pipeline.py** that loads messages from particulars into -**duckdb**. +## Run the pipeline -### attachments resource +1. Before running the pipeline, ensure that you have installed all the necessary dependencies by + running the command: -This resource extracts attachments from the email message using their UID. It connects to -the IMAP server, fetches the email message by its UID, parses body and looks for attachments. -It yields [file items](../filesystem/README.md#the-fileitem-file-representation) where attachments -are loaded in the **file_content** field. The original email message is present in **message** filed. + ```bash + pip install -r requirements.txt + ``` -Please refer to **imap_get_attachments()** example pipeline in **inbox_pipeline.py** that get **pdf** attachments -from emails from particular senders, parsed pdfs and writes content into **duckdb**. + Prerequisites for fetching messages differ by provider. -### uids resource + - For Gmail: -This is a dlt resource that connects to the IMAP server, logs in to the email account, and fetches -email messages from the specified folder ('INBOX' by default). It yields a dictionary containing -only message UID. You typically do not need to use this resource. + `pip install google-api-python-client>=2.86.0` -## Accessing Gmail Inbox + `pip install google-auth-oauthlib>=1.0.0` -To connect to the Gmail server, we need the below information. + `pip install google-auth-httplib2>=0.1.0` -- SMTP server DNS. Its value will be 'imap.gmail.com' in our case. -- SMTP server port. The value will be 993. This port is used for Internet message access protocol - over TLS/SSL. + - For pdf parsing: -### Set up Gmail with a third-party email client + `pip install PyPDF2` -An app password is a 16-digit passcode that gives a less secure app or device permission to access -your Google Account. App passwords can only be used with accounts that have 2-Step Verification -turned on. +1. Once the pipeline has finished running, you can verify that everything loaded correctly by using + the following command: -Step 1: Create and use App Passwords -1. Go to your Google Account. -2. Select Security. -3. Under "How you sign in to Google", select **2-Step Verification** -> Turn it on. -4. Select again **2-Step Verification**. -5. At the bottom of the page, select App passwords. -6. Enter a name of device that helps you remember where you’ll use the app password. -7. Select Generate. -8. To enter the app password, follow the instructions on your screen. The app password is the - 16-character code that generates on your device. -9. Select Done. + ```bash + dlt pipeline show + ``` -Read more in -[this article](https://pythoncircle.com/post/727/accessing-gmail-inbox-using-python-imaplib-module/) -or -[Google official documentation.](https://support.google.com/mail/answer/185833#zippy=%2Cwhy-you-may-need-an-app-password) + For example, the `pipeline_name` for the above pipeline example is `standard_inbox`, you may also + use any custom name instead. -Step 2: Turn on IMAP in Gmail -1. In Gmail, in the top right, click Settings -> See all settings. -2. At the top, click the Forwarding and POP/IMAP tab. -3. In the IMAP Access section, select Enable IMAP. -4. At the bottom, click Save Changes. +For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline) -Read more in [official Google documentation.](https://support.google.com/a/answer/9003945#zippy=%2Cstep-turn-on-imap-in-gmail) \ No newline at end of file +💡 To explore additional customizations for this pipeline, we recommend referring to the official DLT +Inbox documentation. It provides comprehensive information and guidance on how to further customize +and tailor the pipeline to suit your specific needs. You can find the DLT Inbox documentation in +the [Setup Guide: Inbox.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/inbox)