# What if there is no (maintained) Python library?

![corpus DB home page](figs/corpusdb.png)
Corpus-DB - a textual corpus database for the digital humanities by  @j0_0n (Jonathan Reeve) - provides a small API at http://corpus-db.org/docs

# Requests
Since there's no Python library written especially for it, the general purpose library for talking to the internet is [requests](http://docs.python-requests.org/en/master/).
```
Kenny Meyers—
    Python HTTP: When in doubt, or when not in doubt, use Requests. Beautiful, simple, Pythonic.
```
If you haven't installed it yet, [open a terminal](https://github.com/GCDigitalFellows/installdri.github.io/blob/master/anaconda.md) and type:
```bash
conda install requests -y
```

In [1]:
#let's import requests
import requests

# Some of the corpus db API: http://corpus-db.org/docs
Get all the metadata for all books by a certain author.

Handles Project Gutenberg authors, for now. Write name in the form Last, First.
```http://corpus-db.org/api/author/<Last, First>```

Example: get metadata for all books by Jane Austen.
```http://corpus-db.org/api/author/Austen, Jane```
Get the full text for all books by a certain author.

Handles Project Gutenberg authors, for now. Write name in the form Last, First.
```http://corpus-db.org/api/author/<Last, First>/fulltext```

Example: get full text for all books by Jane Austen.
```http://corpus-db.org/api/author/Austen, Jane/fulltext```

In [2]:
#let's get all the books for jane austen
r = requests.get("http://corpus-db.org/api/author/Austen, Jane")

In [6]:
# what's in r?
r.status_code

200

In [5]:
# http://docs.python-requests.org/en/master/api/#requests.Response

In [7]:
#200 means nothing broke, but what's in r? 
#here were' looking at the first 100 letters
r.text[:325]

'[{"lcsh":"{\'Love stories\', \'Psychological fiction\', \'Young women -- Fiction\', \'Dysfunctional families -- Fiction\', \'First loves -- Fiction\', \'Ship captains -- Fiction\', \'Regency fiction\', \'Rejection (Psychology) -- Fiction\', \'Motherless families -- Fiction\', \'England -- Social life and customs -- 19th century -- Fiction\'}",'

In [12]:
#library of congress subject headings
# there's also a json version of this response
# json = more easily parsable
r.json()[0]['lcsh']

"{'Love stories', 'Psychological fiction', 'Young women -- Fiction', 'Dysfunctional families -- Fiction', 'First loves -- Fiction', 'Ship captains -- Fiction', 'Regency fiction', 'Rejection (Psychology) -- Fiction', 'Motherless families -- Fiction', 'England -- Social life and customs -- 19th century -- Fiction'}"

In [13]:
#lets use pandas dor a quick view
import pandas as pd
meta = pd.io.json.json_normalize(r.json())
meta.head()

Unnamed: 0,Unnamed: 1,LCC,_repo,_version,alternative_title,author,authoryearofbirth,authoryearofdeath,contributor,covers,...,tableOfContents,title,titlepage_image,type,url,wikipedia,wp_info,wp_literary_genres,wp_publication_date,wp_subjects
0,102,{'PR'},Persuasion_105,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'archival', 'image_path': 'epu...",...,,Persuasion,,Text,http://www.gutenberg.org/ebooks/105,['https://fi.wikipedia.org/wiki/Viisasteleva_s...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_about_nobility', 'John_Murray_(publis..."
1,116,{'PR'},Northanger-Abbey_121,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,,Northanger Abbey,,Text,http://www.gutenberg.org/ebooks/121,['https://fi.wikipedia.org/wiki/Northanger_Abb...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_by_Jane_Austen', 'Novels_set_in_Somer..."
2,134,{'PR'},Mansfield-Park_141,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,,Mansfield Park,,Text,http://www.gutenberg.org/ebooks/141,['https://fi.wikipedia.org/wiki/Kasvattityt%C3...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_by_Jane_Austen', 'British_novels_adap..."
3,151,{'PR'},Emma_158,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,,Emma,,Text,http://www.gutenberg.org/ebooks/158,['https://fi.wikipedia.org/wiki/Emma_(romaani)...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,['Novel_of_manners'],,"['Novels_by_Jane_Austen', 'Novels_about_nobili..."
4,154,{'PR'},Sense-and-Sensibility_161,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,,Sense and Sensibility,,Text,http://www.gutenberg.org/ebooks/161,['https://fi.wikipedia.org/wiki/J%C3%A4rki_ja_...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Debut_novels', 'Novels_by_Jane_Austen', 'Wor..."


In [14]:
#but it's a corpus? 
r = requests.get("http://corpus-db.org/api/author/Austen, Jane/fulltext")

In [15]:
fulltext = pd.io.json.json_normalize(r.json())
fulltext.head()

Unnamed: 0,id,text
0,105,by Al Haines.\n\n\n\n\n\n\n\n\n\n\nPersuasion\...
1,121,\n\n\n\n\nNORTHANGER ABBEY\n\n\nby\n\nJane Aus...
2,141,\n\n\n\n\nMANSFIELD PARK\n\n(1814)\n\n\nBy Jan...
3,158,\n\n\n\n\nEMMA\n\nBy Jane Austen\n\n\n\n\nVOLU...
4,161,\nSpecial thanks are due to Sharon Partridge f...


In [16]:
#remove leading and trailing new lines
fulltext["text"] = fulltext['text'].str.strip("\n")
fulltext['text']

0     by Al Haines.\n\n\n\n\n\n\n\n\n\n\nPersuasion\...
1     NORTHANGER ABBEY\n\n\nby\n\nJane Austen (1803)...
2     MANSFIELD PARK\n\n(1814)\n\n\nBy Jane Austen\n...
3     EMMA\n\nBy Jane Austen\n\n\n\n\nVOLUME I\n\n\n...
4     Special thanks are due to Sharon Partridge for...
5     LADY SUSAN\n\nby Jane Austen\n\n\n\n\nI\n\n\nL...
6     LOVE AND FREINDSHIP AND OTHER EARLY WORKS\n\n(...
7     PRIDE AND PREJUDICE\n\nBy Jane Austen\n\n\n\nC...
8                              Transcriber's Note:\n...
9     Online Distributed Proofreading Team at http:/...
10    THE WORKS OF JANE AUSTEN\n\n\n\nEdited by Davi...
11       Note de transcription:\n   Les erreurs clai...
12      Au lecteur\n\n  Madame de Montolieu a tradui...
13      Au lecteur\n\n  Madame de Montolieu a tradui...
14      Au lecteur\n\n  Cette version électronique r...
15    produced from scanned images of public domain ...
16      Au lecteur\n\n  Madame de Montolieu a tradui...
17    [Transcriber's Note: letters that were sup

In [17]:
# how do we combine the two?
print(meta.columns)
print(fulltext.columns)

Index(['', 'LCC', '_repo', '_version', 'alternative_title', 'author',
       'authoryearofbirth', 'authoryearofdeath', 'contributor', 'covers',
       'creator', 'description', 'downloads', 'edition_identifiers',
       'edition_note', 'filename', 'formats', 'gutenberg_bookshelf',
       'gutenberg_issued', 'gutenberg_type', 'id', 'identifiers', 'jmdate',
       'language_note', 'languages', 'lcsh', 'production_note',
       'publication_date', 'publication_note', 'publisher', 'releaseDate',
       'rights', 'rights_url', 'series_note', 'subjects', 'summary',
       'tableOfContents', 'title', 'titlepage_image', 'type', 'url',
       'wikipedia', 'wp_info', 'wp_literary_genres', 'wp_publication_date',
       'wp_subjects'],
      dtype='object')
Index(['id', 'text'], dtype='object')


In [18]:
# id looks similar? using values just to make the printing cleaner
print(meta['id'].values)
print(fulltext['id'].values)

['105.0' '121.0' '141.0' '158.0' '161.0' '946.0' '1212.0' '1342.0'
 '21839.0' '25946.0' '31100.0' '33388.0' '35151.0' '35163.0' '36777.0'
 '37431.0' '37634.0' '42078.0' '42671.0' '45186.0']
['105' '121' '141' '158' '161' '946' '1212' '1342' '21839' '25946' '31100'
 '33388' '35151' '35163' '36777' '37431' '37634' '42078' '42671' '45186']


In [19]:
# lets clean up meta id to remove the last two letters '.0'
meta['id'] = meta['id'].str.rstrip('.0')
meta['id'].values

array(['105', '121', '141', '158', '161', '946', '1212', '1342', '21839',
       '25946', '311', '33388', '35151', '35163', '36777', '37431',
       '37634', '42078', '42671', '45186'], dtype=object)

In [20]:
#how do we merge so we have metadata with text?
#left.merge(right, **kwargs)
austen_corpus = meta.merge(fulltext, left_on='id', right_on="id")
austen_corpus.head()

Unnamed: 0,Unnamed: 1,LCC,_repo,_version,alternative_title,author,authoryearofbirth,authoryearofdeath,contributor,covers,...,title,titlepage_image,type,url,wikipedia,wp_info,wp_literary_genres,wp_publication_date,wp_subjects,text
0,102,{'PR'},Persuasion_105,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'archival', 'image_path': 'epu...",...,Persuasion,,Text,http://www.gutenberg.org/ebooks/105,['https://fi.wikipedia.org/wiki/Viisasteleva_s...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_about_nobility', 'John_Murray_(publis...",by Al Haines.\n\n\n\n\n\n\n\n\n\n\nPersuasion\...
1,116,{'PR'},Northanger-Abbey_121,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,Northanger Abbey,,Text,http://www.gutenberg.org/ebooks/121,['https://fi.wikipedia.org/wiki/Northanger_Abb...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_by_Jane_Austen', 'Novels_set_in_Somer...",NORTHANGER ABBEY\n\n\nby\n\nJane Austen (1803)...
2,134,{'PR'},Mansfield-Park_141,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,Mansfield Park,,Text,http://www.gutenberg.org/ebooks/141,['https://fi.wikipedia.org/wiki/Kasvattityt%C3...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Novels_by_Jane_Austen', 'British_novels_adap...",MANSFIELD PARK\n\n(1814)\n\n\nBy Jane Austen\n...
3,151,{'PR'},Emma_158,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,Emma,,Text,http://www.gutenberg.org/ebooks/158,['https://fi.wikipedia.org/wiki/Emma_(romaani)...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,['Novel_of_manners'],,"['Novels_by_Jane_Austen', 'Novels_about_nobili...",EMMA\n\nBy Jane Austen\n\n\n\n\nVOLUME I\n\n\n...
4,154,{'PR'},Sense-and-Sensibility_161,0.2.0,,"Austen, Jane",1775,1817,,"[{'cover_type': 'generated', 'image_path': 'co...",...,Sense and Sensibility,,Text,http://www.gutenberg.org/ebooks/161,['https://fi.wikipedia.org/wiki/J%C3%A4rki_ja_...,{'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...,,,"['Debut_novels', 'Novels_by_Jane_Austen', 'Wor...",Special thanks are due to Sharon Partridge for...


In [21]:
# let's save out the new dataset
austen_corpus.to_csv("austen.csv")

# Try with a different author