Skip to content

Latest commit

 

History

History
170 lines (109 loc) · 6.35 KB

mailinglists.rst

File metadata and controls

170 lines (109 loc) · 6.35 KB

Mailinglists

Below we describe, how the public mailing lists of each of the Internet standard developing organisations can be scraped from the web. Some mailng lists reach back to 1998 and are multiple GBs in size. Therefore, it can take a considerable amount of time to scrape an entire mailing list. This process can't be sped up, since one would commit a DDoS attack otherwise. So be prepared to leave your machine running over (multiple) night(s).

IETF

There are severa many ways to access the public mailing list data of the Internet Engineering Task Force (IETF).

The IETF documents many access methods here. We discuss several oprtions.

Export from Web Interface -----------------------

The IETF allow logged in users to export up to 5000 messages at a time from their on-line mail archive.

These are downloaded as a compressed directory of .mbox files.

Put this directoy in you archives/ directory to make it available for analysis.

Collect-mail from text archives

The full email archives are also available as text files (.mail`) available through the web.

BigBang comes with a script for collecting files like. For example, mail for a specific list can be collected using its archival URL.

For example, for the mailing list archive of the dnsop working group, run the following command from the root directory of this repository:

bigbang collect-mail --url https://www.ietf.org/mail-archive/text/dnsop/

More information is available in the CLI help interface: bigbang collect-mail --help

W3C

The World Wide Web Consortium (W3C) mailing archive is managed using the Hypermail 2.4.0 software and is hosted at:

https://lists.w3.org/Archives/Public/

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import W3CMailList

mlist = W3CMailList.from_url(
    name="public-testtwf",
    url="https://lists.w3.org/Archives/Public/public-testtwf/",
    select={"years": 2014, "fields": "header"},
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

bigbang collect-mail --file examples/url_collections/W3C.txt

3GPP

The 3rd Generation Partnership Project (3GPP) mailing archive is managed using the LISTSERV software and is hosted at:

https://list.etsi.org/scripts/wa.exe?HOME

In order to successfully scrape all public mailing lists, one needs to create an account here: https://list.etsi.org/scripts/wa.exe?GETPW1=&X=&Y=

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import ListservMailList

mlist = ListservMailList.from_url(
    name="3GPP_TSG_SA_WG2_EMEET",
    url="https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_SA_WG2_EMEET",
    select={"fields": "header",},
    url_login="https://list.etsi.org/scripts/wa.exe?LOGON=INDEX",
    url_pref="https://list.etsi.org/scripts/wa.exe?PREF",
    login=auth_key,
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

bigbang collect-mail --file examples/url_collections/listserv.3GPP.txt

IEEE

The Institute of Electrical and Electronics Engineers (IEEE) mailing archive is managed using the LISTSERV software and is hosted at:

https://listserv.ieee.org/cgi-bin/wa?INDEX

There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:

from bigbang.ingress import ListservMailList

mlist = ListservMailList.from_url(
    name="IEEE-TEST",
    url="https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST",
    select={"fields": "header",},
    url_login="https://listserv.ieee.org/cgi-bin/wa?LOGON",
    url_pref="https://listserv.ieee.org/cgi-bin/wa?PREF",
    login=auth_key,
)
mlist.to_mbox(path_to_file)

Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:

bigbang collect-mail --file examples/url_collections/listserv.IEEE.txt

ICANN

The Internet Corporation for Assigned Names and Numbers (ICANN) mailing archive is managed using the Pipermail 0.09 format and is hosted at:

https://mm.icann.org/pipermail/<name_of_mailing_list>

where the part inside <name_of_mailing_list> needs to substituted by the name of the mailing list one wants to ingress.

Mailing lists in this format are scraped by reading their .txt or .txt.gz files of each month of a year. For a singled month, this can be done as follows

from bigbang.ingress import PipermailMailList

mlist = PipermailMailList.from_period_urls(
    name="accred-model",
    url="https://mm.icann.org/pipermail/accred-model",
    period_urls=["https://mm.icann.org/pipermail/accred-model/2018-August.txt.gz"],
    fields="total",
)

while an entire mailing list can be ingressed using

from bigbang.ingress import PipermailMailList

mlist = PipermailMailList.from_url(
    name="accred-model",
    url="https://mm.icann.org/pipermail/accred-model",
    select={
        "years": 2018,
        "fields": "total",
    },
)

Public Mailman 1 Web Archive

BigBang comes with a script for collecting files from public Mailman 1 web archives. An example of this is the scipy-dev mailing list page. To collect the archives of the scipy-dev mailing list, run the following command from the root directory of this repository:

bigbang collect-mail --url http://mail.python.org/pipermail/scipy-dev/

You can also give this command a file with several urls, one per line. One of these is provided in the examples/ directory.

bigbang collect-mail --file examples/urls.txt

Once the data has been collected, BigBang has functions to support analysis.