# Mailbagit Workshop

Welcome to the mailbagit demo workshop! We'll learn what a mailbag is, and how to create one using mailbagit!

Slides at [gregwiedeman.com/slides/mailbagitDLF2022.html](https://gregwiedeman.com/slides/mailbagitDLF2022.html)


# Installing mailbagit with Python


In [2]:
! pip install mailbagit[pst]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mailbagit[pst]
  Downloading mailbagit-0.4.1-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.3 MB/s 
[?25hCollecting extract-msg<1,>=0.34.3
  Downloading extract_msg-0.36.4-py2.py3-none-any.whl (160 kB)
[K     |████████████████████████████████| 160 kB 10.7 MB/s 
[?25hCollecting requests<3,>=2.27.1
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.5 MB/s 
[?25hCollecting python-json-logger<3,<3pyparsing>=2.1.0,>=2.0.2
  Downloading python_json_logger-2.0.4-py3-none-any.whl (7.8 kB)
Collecting packaging<21.3,>=21.0
  Downloading packaging-21.2-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 5.3 MB/s 
[?25hCollecting cssutils<3,>=2.4.2
  Downloading cssutils-2.6.0-py3-none-any.whl (399 kB)
[K     |████████████████████████████████| 399 kB 46.1 MB/s 
[?25hCollecting p

## Did it work?

In [1]:
! mailbagit -v

mailbagit 0.4.1


## Get some sample data

In [2]:
# mount your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
gdrive_path = '/content/gdrive/My Drive/'

# Make a folder called "mailbag_demo"
import os
mailbag_demo = os.path.join(gdrive_path, "mailbag_demo")
if not os.path.isdir(mailbag_demo):
  os.mkdir(mailbag_demo)

# Download sample data to folder
import urllib.request
data_url = "https://archives.albany.edu/static/mailbagWorkshopData.zip"
data_zip = os.path.join(mailbag_demo, "mailbagWorkshopData.zip")
urllib.request.urlretrieve(data_url, data_zip)

Mounted at /content/gdrive


('/content/gdrive/My Drive/mailbag_demo/mailbagWorkshopData.zip',
 <http.client.HTTPMessage at 0x7fda765e0750>)

In [None]:
# unzip
import zipfile
with zipfile.ZipFile('/content/gdrive/My Drive/mailbag_demo/mailbagWorkshopData.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/gdrive/My Drive/mailbag_demo')

## Sample Data

You should have a folder in you Google Drive called mailbag_demo. It in should be:
* Inbox (a folder of .EMLs)
* msgs (a folder of .MSGs)
* account.mbox
* enron.pst

## Working directory setup

In [4]:
! pwd
%cd '/content/gdrive/My Drive/mailbag_demo'
! pwd
! ls

/content/gdrive/MyDrive/mailbag_demo
/content/gdrive/My Drive/mailbag_demo
/content/gdrive/My Drive/mailbag_demo
account.mbox  enron.pst  Inbox	mailbagWorkshopData.zip  msgs


## What mailbagit does

* Takes email export files
  * PST, MBOX, MSG, EML
  * Single files or directory of files
  * "companion" files option
* Packages them into a mailbag
* Creates derivative files
  * TXT, HTML, EML, MBOX
  * PDF, PDF-chrome, WARC

## Speed

* Things that are fast
  * MBOX, EML sources
  * TXT, HTML, EML, MBOX derivatives
* Things that are slow
  * PST, MSG sources
  * PDF, WARC derivatives

## CLI Options

* -r --dry-run
* --css path/to/styles.css
* -c --compress zip
* -f --companion_files

## CLI Options

* --capture-date
* --capture-agent
* --capture-agent-version
* Most bagit-python options
  * --processes
  * Checksums, --md5, --sha512
  * --source-organization
  * not -quiet, -validate, -fast

## Privacy & Security concerns

* Email trackers
* PDF and WARC derivatives ping all URLs
* File inclusions

## Try a "dry run"
This would package `account.mbox` into a mailbag called "test1" with HTML and EML derivatives

In [5]:
! mailbagit 'account.mbox' -i mbox -d html eml -m test1 -r

INFO: Reading: account.mbox
INFO: Found 331 messages.
100.0% [Processed 331 of 331 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Finished packaging mailbag.


## Try it for real!
This will actually create the mailbag

In [6]:
! mailbagit 'account.mbox' -i mbox -d html eml -m test1

INFO: Reading: account.mbox
INFO: Found 331 messages.
100.0% [Processed 331 of 331 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


### Make sure to run this to reset the sample data

In [7]:
! cp 'test1/data/mbox/account.mbox' '.'
! ls

account.mbox  enron.pst  Inbox	mailbagWorkshopData.zip  msgs  test1


## Lets try a some MSG files
Do a "dry run" first again

In [8]:
! mailbagit 'msgs' -i msg -d eml html txt -m test2 -r

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Finished packaging mailbag.


### We got some errors!
Dry run still creates an error report
Take a look a the warnings report in msgs/test2_warnings)

## Mailbagit error reports

* external to mailbag
* created on -r --dry-run
  * more errors will show without dry-run
  * errors from subprocesses
    * wkhtmltopdf
    * chrome
* errors.csv
* .txt file for each message with error and full stack trace

## Lets try it for real
This will package all the .MSG files in `msgs` into a mailbag called "test2" with EML, HTML, and plain text derivatives

In [9]:
! mailbagit 'msgs' -i msg -d eml html txt -m test2 

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


### Reset the data

In [10]:


! mv msgs/test2/data/msg/* msgs/
! ls

account.mbox  enron.pst  Inbox	mailbagWorkshopData.zip  msgs  test1


## What's in a mailbag?

* bagit.txt, manifiests
* bag-info.txt
* mailbag.csv
* data (payload)


## bag-info.txt

```
Bag-Size: 34 MB
Bag-Software-Agent: bagit.py v1.8.1 <https://github.com/LibraryOfCongress/bagit-python>
Bag-Type: Mailbag
Bagging-Date: 2022-05-26
Bagging-Timestamp: 2022-05-26T16:15:48
EML-Agent: email
EML-Agent-Version: 3.9.12
External-Identifier: adeb0ab6-59b8-494c-be6a-de066f5c8f23
MSG-Agent: extract_msg
MSG-Agent-Version: 0.30.12
Mailbag-Agent: mailbagit
Mailbag-Agent-Version: 0.2.1
Mailbag-Source: msg
Original-Included: True
PDF-Agent: wkhtmltopdf
PDF-Agent-Version: wkhtmltopdf 0.12.6 (with patched qt)
Payload-Oxum: 36529495.87
WARC-Agent: warcio
WARC-Agent-Version: 1.7.4

```


## mailbag.csv headers

```
Error
Mailbag-Message-ID
Message-ID
Original-File
Message-Path
Derivatives-Path
Attachments (int)
Date
From
To
Cc
Bcc
Subject
Content-Type
```


## data (payload)

* attachments
  * [Mailbag-Message-ID]
    * attachments.csv
    * test.pdf
* pst
  * export.pst
* eml
  * 
* pdf
    

## PDFs won't work in Colab :(

You'll have to skip this part.

## Let's try a PST!

In [15]:
! mailbagit 'enron.pst' -i pst -d eml html -m test4

INFO: Reading: enron.pst
INFO: Found 71 messages.
100.0% [Processed 71 of 71 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


### Reset the data

In [17]:
! cp 'test4/data/pst/enron.pst' '.'
! ls

enron.pst     mailbagWorkshopData.zip  test5


## Let's try to make  some WARCs!


In [24]:
! mailbagit 'msgs' -i msg -d eml html warc -m test5

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


### Reset the data

In [None]:
! mv msgs/test5/data/msg/* msgs/
! ls

## How mailbagit makes WARCs

* uses [warcio](https://github.com/webrecorder/warcio) and `requests`
* For WARC-Record-ID uses "mailbag", Mailbag-Message-ID
  * "http://mailbag/11/body.html"
* body.html
* headers.json
* duplicated in metadata record not used by replayweb.page
* includes attachments

## Lossiness in derivatives

* Most derivatives only contain part of the data
* EML and MBOX derivatives
  * tries to write complete object
  * uses original encoding
  * only possible for EML/MBOX sources
  * sometimes fails due to encoding issues
    * raises warning and generates from Model

## EML / MBOX from Model

* [mailbagit Model](https://github.com/UAlbanyArchives/mailbagit/blob/main/mailbagit/models.py)
* Contains all headers, HTML and TXT bodies, attachments
* Writes folder structure to X-Folder header
* multipart/mixed
  * multipart/alternative (text/plain if present)
  * multipart/alternative (text/html if present)
  * application/pdf (application/octet-stream if missing)
  

## Future Questions

* Packaging from IMAP
* Packaging over APIs
  * Gmail
  * Office365/Exchange
  * would require OAuth
* Exclusions/filtering  