# 2019-05-31

Today, I am following 

- [Jake VanderPlas](https://github.com/jakevdp/PythonDataScienceHandbook)' `03.01-Introducing-Pandas-Objects.ipynb`.
- Thomas Cram's [download instructions](https://github.com/coltongrainger/fy20pandas/blob/master/derived-examples/download_image.ipynb)
    > This notebook demonstrates how to download an image from a remote URL hosted at the National Archives Catalog.  These images are collections of scanned pages from historic ship logs; see https://catalog.archives.gov/id/23709729 for an example.
    
We'll load packages from 2019-05-29.

In [1]:
import numpy as np
import pandas as pd

import os
import requests
from pathlib import Path

import random

Now to consider a representative sample of images for bulk downloads. We list data.

In [2]:
%ls metadata/*

[0m[32mmetadata/2018-05-16-wood-metadata-federal-archives.txt[0m

metadata/2018-05-16-NARA-master-manifest:
[32mall.csv[0m  [32mCG.csv[0m  [32mNavy_A2.csv[0m  [32mNavy.csv[0m  [32mRAC.csv[0m  [32msums.csv[0m  [32mUSCS.csv[0m


In [3]:
NARA_record_group_dict = dict([(23, 'USCS'), # Records of the Coast and Geodetic Survey
                               (24, 'Navy'), # Records of the Bureau of Naval Personnel
                               (26, 'CG'), # Records of the U.S. Coast Guard
                               (261, 'RAC') # Records of Former Russian Agencies
                              ])

Import `all.csv` (just Wood's NARA metadata).

In [4]:
df = pd.read_csv('metadata/2018-05-16-NARA-master-manifest/all.csv')

# trim spaces
df.rename(str.strip, axis='columns',inplace=True)
# remove redundant column
df.drop(columns=['Box or Volume Number.1'], inplace=True)

In [5]:
# take random sample
sample = pd.concat(
    [df.loc[df['Record Group'] == gp].sample(10, random_state=1) 
     for gp in NARA_record_group_dict]
)

In [6]:
# drop entries in sample without valid NARA URL
ndf = sample[~sample['NARA URL'].str.contains(" ")]
ndf

Unnamed: 0,Ship Name,Record Group,Entry Number,Box or Volume Number,Digital Directory,Start Date,End Date,Assets,Number of Images,Number of Pages,NARA URL,Geographic Focus
1744,HASSLER,23.0,102,Volume 71 and 74,hassler-1876,01/01/1876,09/24/1876,2,281,561,https://catalog.archives.gov/id/24332142,Arctic
1758,HASSLER,23.0,102,Box 1944,hassler-1890,01/01/1890,12/31/1890,1,288,573,https://catalog.archives.gov/id/24335367,Arctic
1349,DALE (DD-353),24.0,118-A1,b2721,Dale-DD-353-1943-03,03/01/1943,03/31/1943,1,67,67,https://catalog.archives.gov/id/24357119,Arctic
1850,IDAHO (BB-42),24.0,118-A1,Box 4796,idaho-bb-42-1944-05,05/01/1944,05/31/1944,1,101,101,http://catalog.archives.gov/id/17298664,Arctic
617,BEAR (AG-29),24.0,118-A1,Box 894,Bear-AG-29-1940-04,04/01/1940,04/30/1940,1,65,65,http://research.archives.gov/description/7794769,Arctic
1489,EDWARDS (DD-619),24.0,118-A1,b3195,Edwards-DD-619-1945-11,11/01/1945,11/30/1945,1,67,67,http://research.archives.gov/description/7795021,Arctic
986,CHARLESTON (PG-51),24.0,118-A1,Box 1956,Charleston-PG-51-1945-08,08/01/1945,08/31/1945,1,65,65,http://research.archives.gov/description/7794966,Arctic
2283,MONAGHAN (DD-354),24.0,118-A1,b6298,Monaghan-DD-354-1942-11,11/01/1942,11/30/1942,1,51,100,https://catalog.archives.gov/id/24361657,Arctic
2280,MONAGHAN (DD-354),24.0,118-A1,b6298,Monaghan-DD-354-1942-08,08/01/1942,08/31/1942,1,47,91,https://catalog.archives.gov/id/24361480,Arctic
1081,CHELAN,26.0,159A,Box 541,chelan-1935-10,10/01/1935,10/31/1935,1,42,83,https://catalog.archives.gov/id/23678516,Arctic


List unique ships.

In [7]:
ships = ndf['Ship Name'].unique()
ships

array(['HASSLER', 'DALE (DD-353)', 'IDAHO (BB-42)', 'BEAR (AG-29)',
       'EDWARDS (DD-619)', 'CHARLESTON (PG-51)', 'MONAGHAN (DD-354)',
       'CHELAN', 'STORIS', 'MANNING', 'UNALGA', 'ATALANTA'], dtype=object)

List ship and value count in `ndf`.

In [8]:
[(ship, ndf['Ship Name'].value_counts()[ship]) for ship in ships]

[('HASSLER', 2),
 ('DALE (DD-353)', 1),
 ('IDAHO (BB-42)', 1),
 ('BEAR (AG-29)', 1),
 ('EDWARDS (DD-619)', 1),
 ('CHARLESTON (PG-51)', 1),
 ('MONAGHAN (DD-354)', 2),
 ('CHELAN', 1),
 ('STORIS', 3),
 ('MANNING', 1),
 ('UNALGA', 1),
 ('ATALANTA', 2)]

### Define URL parameters

In [9]:
entry = ndf.sample(1, random_state=0)
entry

Unnamed: 0,Ship Name,Record Group,Entry Number,Box or Volume Number,Digital Directory,Start Date,End Date,Assets,Number of Images,Number of Pages,NARA URL,Geographic Focus
1758,HASSLER,23.0,102,Box 1944,hassler-1890,01/01/1890,12/31/1890,1,288,573,https://catalog.archives.gov/id/24335367,Arctic


In [10]:
base_url = 'https://catalog.archives.gov/'

In [11]:
nara_id = entry['NARA URL'].iloc[0].split("/")[-1]
nara_id

'24335367'

In [12]:
digital_directory = entry['Digital Directory'].iloc[0]
digital_directory

'hassler-1890'

In [13]:
record_group = "rg-0{0}".format(int(entry['Record Group'].iloc[0]))
record_group

'rg-023'

In [14]:
num_images = int(entry['Number of Images'].iloc[0])
num_images

288

### Download images for a single entry

In [17]:
# Sample image URL:
# https://catalog.archives.gov/OpaAPI/media/23709293/content/dc-metro/rg-026/587169/0002/Aivik-1943-01/Aivik-1943-01_0004.JPG

for n in range(num_images):
    img = "{0}-{1:04d}.JPG".format(digital_directory,n)
    url = "{0}OpaAPI/media/{1}/content/dc-metro/{2}/587169/0002/{3}/{4}".format(base_url, nara_id, record_group, digital_directory, img)
    #r = requests.get(url)
    

https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0000.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0001.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0002.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0003.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0004.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0005.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0006.JPG
https://catalog.archives.gov/OpaAPI/media/24335367/content/dc-metro/rg-023/587169/0002/hassler-1890/hassler-1890-0007.JPG
https://catalog.archives