## Using the Internet Archive API to access the Medical Heritage Library 

The Internet Archive offers an API (Application Programming Interface) that makes it easier to access their archives programmatically. Even the [Medical Heritage Library (https://archive.org/details/medicalheritagelibrary)! Using just a few keystrokes we can download thousands of PDFs and analyze them in different ways. This Jupyter Notebook is set up to help you download PDFs from the Medical Heritage Library. 

First, load the "internetarchive" library for python. This library handles communicating with the Internet Archive.

In [1]:
import internetarchive as ia

Now we run a search for the Medical Heritage Library, and print out how many documents we found:

In [2]:
search_results = ia.search_items('collection:medicalheritagelibrary')
print(search_results.num_found)

139025


You will see that there are over 100,000 collections in the Medical Hertiage Library. Wow! That's a lot of medical history. Internet Archive's API, right now, can only handle 10,000 items at a time so try limiting your results by year. 

In [4]:
search_results = ia.search_items('collection:medicalheritagelibrary AND title:attention')
print(search_results.num_found)

47


Great! Now we have a little over a little over 80 items. Let's check out what is in that list. 

Right now, the ```search_results``` variable is just a pointer to the search. But in order to get the actual IDs for all the items in the collection, we need to perform that search and store what it returns in a list:

In [5]:
search_items = list(search_results)

Now let's see what the list contains. This should print out the IDs for first ten items on the list:

In [6]:
for item in search_items[0:10]:
    print("ID: {}".format(item['identifier']))

ID: 0217316.nlm.nih.gov
ID: 101188494.nlm.nih.gov
ID: 101525032.nlm.nih.gov
ID: 101644556.nlm.nih.gov
ID: 101644615.nlm.nih.gov
ID: 2543035RX1.nlm.nih.gov
ID: 2543035RX2.nlm.nih.gov
ID: 2544040R.nlm.nih.gov
ID: 2544041R.nlm.nih.gov
ID: 2544042R.nlm.nih.gov


Now that we know the IDs of what we're interested in, we can download these items. 

In [7]:
import os
from requests import ConnectionError


directory = "MedicalHeritagePDF-hysteria"

try:
    os.mkdir(directory)
except OSError:
    pass


for search_item in search_items:
    item_id = search_item['identifier']
    item = ia.get_item(item_id)
    if not os.path.exists(directory):
        os.makedirs(directory)
    try:
        item.download(verbose=True, formats=['DjVuTXT', 'MARC','Text PDF','MPEG4'],destdir=directory)
    except ConnectionError:
        print("There's a connection error. Try to get {} again later")

Getting file 0217316.pdf...
Check!
Getting file 101188494.pdf...
Check!
Getting file 101525032.pdf...
Check!
Getting file 101644556.pdf...
Check!
Getting file 101644615.pdf...
Check!
Getting file 2543035RX1.pdf...
Check!
Getting file 2543035RX2.pdf...
Check!
Getting file 2544040R.pdf...
Check!
Getting file 2544041R.pdf...
Check!
Getting file 2544042R.pdf...
Check!


What are the things you can change in this code?

1. What formats to download.
2. The directory you save to.
3. Or--and this requires more explanation--the search terms itself.

Now, let's say you want to do other types of searches rather than searching by date. Don't worry! You can! 

You can search for things like: 

title (the title of the work) 
ex: title:[civil war]

date (the date of the work formatted like so: [YEAR-MONTH-DAY TO YEAR-MONTH-DAY]) 
ex: date:[1820-01-01 TO 1830-12-31]

description (the description of the work)
ex: description:photograph

addeddate (the date the work was added)
ex: addeddate:2016

In [None]:
search_results = ia.search_items('collection:medicalheritagelibrary AND title:medical')
print(search_results.num_found)

0217316.nlm.nih.gov:
 skipping MedicalHeritagePDF-hysteria/0217316.nlm.nih.gov/0217316.nlm.nih.gov_marc.xml, file already exists based on length and date.
 skipping MedicalHeritagePDF-hysteria/0217316.nlm.nih.gov/0217316_djvu.txt, file already exists based on length and date.
 downloaded 0217316.nlm.nih.gov/0217316.pdf to MedicalHeritagePDF-hysteria/0217316.nlm.nih.gov/0217316.pdf
101188494.nlm.nih.gov:
 skipping MedicalHeritagePDF-hysteria/101188494.nlm.nih.gov/101188494.nlm.nih.gov_marc.xml, file already exists based on length and date.
 downloaded 101188494.nlm.nih.gov/101188494.pdf to MedicalHeritagePDF-hysteria/101188494.nlm.nih.gov/101188494.pdf
 skipping MedicalHeritagePDF-hysteria/101188494.nlm.nih.gov/101188494_djvu.txt, file already exists based on length and date.
101525032.nlm.nih.gov:
 skipping MedicalHeritagePDF-hysteria/101525032.nlm.nih.gov/101525032_djvu.txt, file already exists based on length and date.
 downloaded 101525032.nlm.nih.gov/101525032.pdf to MedicalHeritag

In [15]:
item.files


[{u'crc32': u'b478d272',
  u'format': u'Metadata',
  u'md5': u'ad730d5057dc7f2e669864266c56cd5d',
  u'mtime': u'1332960952',
  u'name': u'2544042R.nlm.nih.gov_marc.xml_meta.txt',
  u'sha1': u'6a84d626b5da58c0e8478bf41c252dcb51c35cda',
  u'size': u'826',
  u'source': u'original'},
 {u'crc32': u'396c2634',
  u'format': u'MARC',
  u'md5': u'67226829f6825492ee6c775586efc3b9',
  u'mtime': u'1332960952',
  u'name': u'2544042R.nlm.nih.gov_marc.xml',
  u'sha1': u'5128cddca5711a68876dd45b40038dcdd63266a6',
  u'size': u'9320',
  u'source': u'original'},
 {u'crc32': u'482cf668',
  u'format': u'Generic Raw Book Zip',
  u'md5': u'0831326e8785ad16bba47330c7fb0519',
  u'mtime': u'1332964415',
  u'name': u'2544042R_images.zip',
  u'sha1': u'd2f2ec3ca5eb040558af49a3fa3ebd7917ab16f4',
  u'size': u'464280844',
  u'source': u'original'},
 {u'crc32': u'a5641080',
  u'format': u'Metadata',
  u'md5': u'09a49760b382f4e02cb07443c7565169',
  u'mtime': u'1332964415',
  u'name': u'2544042R_images.zip_meta.txt',
 