# Mailbagit Workshop

Welcome to the mailbagit demo workshop! We'll learn what a mailbag is, and how to create one using mailbagit!

Slides at [gregwiedeman.com/slides/mailbagitDLF2022.html](https://gregwiedeman.com/slides/mailbagitDLF2022.html)


# Installing mailbagit with Python


In [1]:
! pip install mailbagit[pst]



## Did it work?

In [2]:
! mailbagit -v

mailbagit 0.7.0


## Setup a folder in your Google Drive

In [3]:
# mount your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
gdrive_path = '/content/gdrive/My Drive/'

# Make a folder called "mailbag_demo"
import os
mailbag_demo = os.path.join(gdrive_path, "mailbag_demo")
if not os.path.isdir(mailbag_demo):
  os.mkdir(mailbag_demo)

# Change to mailbag_demo directory
%cd '/content/gdrive/My Drive/mailbag_demo'
! pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/mailbag_demo
/content/gdrive/My Drive/mailbag_demo


## Get some sample data


In [4]:
# Download sample data to folder
import urllib.request
data_url = "https://archives.albany.edu/static/mailbagWorkshopData.zip"
data_zip = os.path.join(mailbag_demo, "mailbagWorkshopData.zip")
urllib.request.urlretrieve(data_url, data_zip)

# unzip
import zipfile
with zipfile.ZipFile('/content/gdrive/My Drive/mailbag_demo/mailbagWorkshopData.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/gdrive/My Drive/mailbag_demo')

# list contents
! ls

account.mbox  enron.pst  Inbox	mailbagWorkshopData.zip  msgs


You should have a folder in you Google Drive called mailbag_demo. It in should be:
* Inbox (a folder of .EMLs)
* msgs (a folder of .MSGs)
* account.mbox
* enron.pst

## What mailbagit does

* Takes email export files
  * PST, MBOX, MSG, EML
  * Single files or directory of files
  * "companion" files option
* Packages them into a mailbag
* Creates derivative files
  * TXT, HTML, EML, MBOX
  * PDF, PDF-chrome, WARC

## Speed

* Things that are fast
  * MBOX, EML sources
  * TXT, HTML, EML, MBOX derivatives
* Things that are slow
  * PST, MSG sources
  * PDF, WARC derivatives

## CLI Options

* -r --dry-run
* --css path/to/styles.css
* -c --compress zip
* -f --companion_files

## CLI Options

* --capture-date
* --capture-agent
* --capture-agent-version
* Most bagit-python options
  * --processes
  * Checksums, --md5, --sha512
  * --source-organization
  * not -quiet, -validate, -fast

## Privacy & Security concerns

* Email trackers
* PDF and WARC derivatives ping all URLs
* File inclusions

## Try a "dry run"
This would package `account.mbox` into a mailbag called "test1" with HTML and EML derivatives

In [5]:
! mailbagit 'account.mbox' -i mbox -d html eml -m test1 -r

INFO: Reading: account.mbox
INFO: Found 331 messages.
100.0% [Processed 331 of 331 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Finished packaging mailbag.


## Try it for real!
This will actually create the mailbag

In [9]:
! mailbagit 'account.mbox' -i mbox -d html eml -m test1 -k

usage: mailbagit [-h] [--processes PROCESSES] [--log LOG] [--shake_128] [--sha3_256] [--sha384]
                 [--blake2b] [--blake2s] [--sha3_384] [--sha256] [--md5] [--sha3_224] [--sha3_512]
                 [--shake_256] [--sha1] [--sha512] [--sha224]
                 [--source-organization SOURCE-ORGANIZATION]
                 [--organization-address ORGANIZATION-ADDRESS] [--contact-name CONTACT-NAME]
                 [--contact-phone CONTACT-PHONE] [--contact-email CONTACT-EMAIL]
                 [--external-description EXTERNAL-DESCRIPTION]
                 [--external-identifier EXTERNAL-IDENTIFIER]
                 [--bag-group-identifier BAG-GROUP-IDENTIFIER] [--bag-count BAG-COUNT]
                 [--internal-sender-identifier INTERNAL-SENDER-IDENTIFIER]
                 [--internal-sender-description INTERNAL-SENDER-DESCRIPTION]
                 [--bagit-profile-identifier BAGIT-PROFILE-IDENTIFIER] -m MAILBAG -i
                 {eml,mbox,msg,pst} [-d {eml,html,mbox,txt,w

## What's in a mailbag?

* bagit.txt, manifiests
* bag-info.txt
* mailbag.csv
* data (payload)


## bag-info.txt

```
Bag-Size: 34 MB
Bag-Software-Agent: bagit.py v1.8.1 <https://github.com/LibraryOfCongress/bagit-python>
Bag-Type: Mailbag
Bagging-Date: 2022-05-26
Bagging-Timestamp: 2022-05-26T16:15:48
EML-Agent: email
EML-Agent-Version: 3.9.12
External-Identifier: adeb0ab6-59b8-494c-be6a-de066f5c8f23
MSG-Agent: extract_msg
MSG-Agent-Version: 0.30.12
Mailbag-Agent: mailbagit
Mailbag-Agent-Version: 0.2.1
Mailbag-Source: msg
Original-Included: True
PDF-Agent: wkhtmltopdf
PDF-Agent-Version: wkhtmltopdf 0.12.6 (with patched qt)
Payload-Oxum: 36529495.87
WARC-Agent: warcio
WARC-Agent-Version: 1.7.4

```


## mailbag.csv headers

```
Error
Mailbag-Message-ID
Message-ID
Original-File
Message-Path
Derivatives-Path
Attachments (int)
Date
From
To
Cc
Bcc
Subject
Content-Type
```


## data (payload)

* attachments
  * [Mailbag-Message-ID]
    * attachments.csv
    * test.pdf
* pst
  * export.pst
* eml
  *
* pdf
    

## Lets try a some MSG files
Do a "dry run" first again

In [None]:
! mailbagit 'msgs' -i msg -d eml html txt -m test2 -r

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Finished packaging mailbag.


### We got some errors!
Dry run still creates an error report
Take a look a the warnings report in msgs/test2_warnings

## Mailbagit error reports

* external to mailbag
* created on -r --dry-run
  * more errors will show without dry-run
  * errors from subprocesses
    * wkhtmltopdf
    * chrome
* errors.csv
* .txt file for each message with error and full stack trace

## Lets try MSGs for real
This will package all the .MSG files in `msgs` into a mailbag called "test2" with EML, HTML, and plain text derivatives

In [None]:
! mailbagit 'msgs' -i msg -d eml html txt -m test2 -k

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


## PDFs won't work in Colab :(

You'll have to skip this part as the PDF dependencies are not available in Google Colab.

## Let's try a PST!

In [None]:
! mailbagit 'enron.pst' -i pst -d eml html mbox -m test4 -k

INFO: Reading: enron.pst
INFO: Found 71 messages.
ERROR: Error writing MBOX derivative: OSError(38, 'Function not implemented')
ERROR: Error writing MBOX derivative: OSError(38, 'Function not implemented')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: OSError(38, 'Function not implemented')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, 'File exists')
ERROR: Error writing MBOX derivative: FileExistsError(17, '

## Let's try to make  some WARCs!


In [10]:
! mailbagit 'msgs' -i msg -d eml html warc -m test5 -k

INFO: Reading: msgs
INFO: Found 21 messages.
100.0% [Processed 21 of 21 messages] 0.0s remaining
INFO: Writing CSV reports...
INFO: Saving manifests...
INFO: Finished packaging mailbag.


## WARCs from mailbagit

* uses [warcio](https://github.com/webrecorder/warcio) and `requests`
* For WARC-Target-URI uses "mailbag", Mailbag-Message-ID
  * "http://mailbag/11/body.html"
* body.html response record
* headers.json response record
* duplicated in metadata record not used by replayweb.page
* includes attachments

## How mailbagit handles encoding

* Bodies and headers
* Tries listed encoding if present (and valid)
* Tries chardetect
  * raises warning if successful
  * raises error if failed
* replaces errors with listed encoding
* Writes derivatives as "UTF-8"
  * except EML/MBOX

## Lossiness in derivatives

* Most derivatives only contain part of the data
* Most derivatives get written to/from [the mailbagit Model](https://github.com/UAlbanyArchives/mailbagit/blob/main/mailbagit/models.py)
* EML and MBOX derivatives
  * tries to write complete object
  * only possible for EML/MBOX sources
  * uses original encoding
  * sometimes fails due to encoding issues
    * raises warning and generates from Model
    
## EML / MBOX from Model

* Contains all headers, HTML and TXT bodies, attachments
* Writes folder structure to X-Folder header
* multipart/mixed
  * multipart/alternative (text/plain if present)
  * multipart/alternative (text/html if present)
  * application/pdf (application/octet-stream if missing)
  
## Future Questions

* Packaging from IMAP
* Packaging over APIs
  * Gmail
  * Office365/Exchange
  * would require OAuth
* Exclusions/filtering