![AOY Logo](https://raw.githubusercontent.com/BrockDSL/AOYTK/main/AOY_Logo.png)

All Our Yesterdays - A toolkit to explore web archives

[Homepage](https://brockdsl.github.io/AOTYK)

# File Format Analysis

This notebook provides some basic tools for looking at file type derivatives.

Some of the tools in this notebook are specific to derivatives containing information about image files, these tools are collected at the end of the notebook.

### Mount Google Drive
In order to persist working files/data across multiple runs of the notebook.


In [None]:
from google.colab import drive
drive.mount("/content/drive/")

### Load AOYTK

Downloads and imports the All Our Yesterdays Toolkit

In [None]:
!wget "https://raw.githubusercontent.com/BrockDSL/AOYTK/main/aoytk.py"
import aoytk

### Set Working Directory
Select a working folder for reading and writing data. By default, this path will be set to `/content/drive/MyDrive/AOY/`.

In [None]:
aoytk.display_path_select()

### Set up analysis environment and load data
Next we'll load in our analyzer which will give us access to a variety of tools for analyzing our data, and then load up the data file we want to work with.

In [None]:
atk = aoytk.Analyzer()

In [None]:
atk.load_data()

Now we can have a look at the data we've loaded in!

In [None]:
atk.data

## File Derivative Summary
The following cell will generate a brief summary of the information contained in the file derivative, including the number of files and domains represented and the number and type of different extensions present.


In [None]:
atk.display_file_summary()

## Most Common Files
The following widget will display the top `x` files occurring in the dataset, where the number of occurances is counted either by MD5 hash or by URL.


In [None]:
atk.display_top_files()

### Files with the same MD5 hash shared between different domains

The following widget will print the hashes of files which appear in multiple different domains in the dataset, along with a list of the domains in which they appear.

In [None]:
atk.same_hash_different_domains()

## Image-specific Features

The following functions are specifically for use with derivatives containing information about image files.



### Display Image
The following widget allows you to view an image using its corresponding URL or MD5 hash value, either directly from your derivative CSV or found using one of the tools above. The widget will also produce a link to view the image on the Internet Archive's WayBack Machine site, in case it does not display correctly within the notebook. Please note: images may take some time to load into the notebook!

In [None]:
atk.view_image()

### Temporal Distribution of an Image

The following widget will generate a link that will allow you to view the number of times the specified image has been saved by the Internet Archive's Wayback Machine, and when it was saved.

In [None]:
atk.get_image_temporal_distribution()

### Download Images
Download the top `x` images, where top images are defined by link or MD5 hash, to the specified output folder. The output folder will be created if it does not already exist. This folder will be a subfolder of the working folder set using the `aoytk.display_path_select()` function earlier in the notebook (under "Set Working Directory"). By default, this is the `/content/drive/MyDrive/AOY` folder. To download all images in the derivative, move the slider to the maximum.

**Note:** Downloading large numbers of images may require significant storage space, be mindful of this when saving large numbers of images to your Google Drive!

Files will be saved under their original filename, with the domain added as a prefix. An image with the name `hello_world.jpg` from domain `example.org` would be saved as `example_org_hello_world.jpg`.

In [None]:
atk.display_download_images()