Archives Unleashed Cloud: Jupyter Notebooks
Jupyter notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.
The following article provides a nice overview:
Deschamps, Ryan, Ruest, Nick, Lin, Jimmy, Fritz, Samantha, Milligan, Ian. The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration. Proceedings of the 2019 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2019), June 2019, Urbana-Champaign, Illinois.
- Python 3.7+
- Jupyter Notebook (1.0.0)
- au_notebook (0.0.3)
- matplotlib (3.0.2)
- numpy (1.15.1)
- pandas (0.23.4)
- networkx (2.2)
- nltk (3.4.5)
Anaconda is a package manager that can help you find packages and dependencies, including some of the most popular ones used in data science research analysis. To run the Jupyter Notebook via Anaconda run the following:
git clone https://github.com/archivesunleashed/auk-notebooks.git cd auk-notebooks pip install -r requirements.txt python -m nltk.downloader punkt vader_lexicon stopwords jupyter notebook
Docker is a container-based virtual machine system that bundles dependencies together, this means you can build the Docker image and it will work out of the box. To run the Jupyter Notebook via Docker, there are two options, Docker Hub and Docker Locally.
docker run --rm -it -p 8888:8888 archivesunleashed/auk-notebooks
git clone https://github.com/archivesunleashed/auk-notebooks.git cd auk-notebooks docker build -t auk-notebook . docker run --rm -it -p 8888:8888 auk-notebook
This repository comes with sample data, you can swap out the sample data with your own Archives Unleashed Cloud data.
docker run --rm -it -p 8888:8888 -v "/path/to/own/data:/home/jovyan/data" auk-notebook
Note: You must grant the within-container notebook user or group (NB_UID or NB_GID) write access to the host directory (e.g., sudo chown 1000 /some/host/folder/for/work).
Types of Visualizations
There are several types of visualizations that you can produce in the Jupyter Notebook. A total of 14 outputs can be generated.
- Domain Analysis: Provides information about what has been crawled (e.g. which domains) and how often.
- Text Analysis: Highlights the frequency of words through various filters including domain and year.
- Sentiment Analysis: Visualizes sentiment scores by domain and year.
- Network Analysis: Shows the connections and relationship among websites through network graph layouts.
This application is available as open source under the terms of the Apache License, Version 2.0.
The example dataset in the
data directory was created with the Archives Unleashed Cloud, and is drawn from the B.C. Teachers' Labour Dispute (2014), collected by the University of Victoria Libraries. We are grateful that they've allowed us to use this material. The full-text derivative file is a random sample (37,000 lines) of the complete file because of GitHub file size limitations.
If you use this material, please cite it along the following lines:
- Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.
- University of Victoria Libraries, B.C. Teachers' Labour Dispute (2014), Archive-It Collection 4867, https://archive-it.org/collections/4867.
This work is primarily supported by the Andrew W. Mellon Foundation. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.