# Meta data analysis of the 19th Century Books from the British Library

This notebook will review the meta data that comes with the [19th Century book corpus from the British Library.](https://data.bl.uk/digbks/db14.html)

## Load the data

Below we show how to load the data into a python `List` whereby each book's meta data is stored as an item in the lists' collection: 


In [164]:
import json
from pathlib import Path
from typing import Dict, Any, List, Set

def load_book_meta_data(book_data_fp: Path=Path('.', 'book_data.json').resolve()
                        ) -> List[Dict[str, Any]]:
    with book_data_fp.open('r') as book_data_file:
        return json.load(book_data_file)

def unique_identifiers(book_data: List[Dict[str, Any]]) -> Set[str]:
    '''
    :returns: A set of unique book identifiers.
    '''
    identifiers = set()
    for book in book_data:
        identifiers.add(book['identifier'])
    return identifiers


book_meta_data = load_book_meta_data()
print('--------- META DATA START -------')
print(json.dumps(book_meta_data[0], indent=4, sort_keys=True))
print('--------- META DATA END -------')
print(f'The number of books {len(book_meta_data)}')
print('Are all books unique based on their identifier field? '
      f'{len(unique_identifiers(book_meta_data))==len(book_meta_data)}')


--------- META DATA START -------
{
    "authors": {},
    "corporate": {},
    "date": "1888",
    "datefield": "[1888]",
    "edition": "",
    "flickr_url_to_book_images": "http://www.flickr.com/photos/britishlibrary/tags/sysnum000000037",
    "identifier": "000000037",
    "imgs": {
        "0": {
            "000004": [
                "11194557546"
            ],
            "000006": [
                "11193640604"
            ],
            "000007": [
                "11104733396"
            ],
            "000010": [
                "11105407186",
                "11102797916"
            ],
            "000011": [
                "11193211526"
            ],
            "000012": [
                "11290477144"
            ],
            "000014": [
                "11291300706"
            ],
            "000015": [
                "11100321335"
            ],
            "000016": [
                "11195895503"
            ],
            "000021": [
                "1129

Above we can see that we have loaded all of the books meta data into the `book_meta_data` list, and the first item in that lists' collection represents the meta data for the book `A Gossip about Old Manchester. With illustrations. [Signed: A.]`

Also as the meta data has been loaded as a list we can see, easily, that there are **49,509** books. However according to the [British Library OCR book corpus that contains the full text](https://data.bl.uk/digbks/db14.html) there should only be **49,455** books, thus above we also check that all of the books meta data unique identifier are unique of which we find that they are. Thus the difference in full OCR text book count could be due to the British Library releasing a smaller book corpus of OCR text books compared to the releated meta data that we are analysing here. 

However we show below that there are 4 books, list indexs `[9122, 14699, 33207, 46786]`, that contain no pdfs of which an example is shown as output below:

In [165]:
def books_with_no_pdfs(book_data: List[Dict[str, Any]]) -> List[int]:
    '''
    :return: A list of indexes whereby each index refers to an index in 
             the given `book_data` list whereby the meta data contains 
             no pdf data.
    '''
    no_pdf_indexes = []
    for index, book in enumerate(book_data):
        if 'pdf' not in book:
            no_pdf_indexes.append(index)
    return no_pdf_indexes
print(f'Indexes of books that contain no PDFs {books_with_no_pdfs(book_meta_data)}')
print(f'Example of index 9122:\n{json.dumps(book_meta_data[9122], indent=4, sort_keys=True)}')

Indexes of books that contain no PDFs [9122, 14699, 33207, 46786]
Example of index 9122:
{
    "authors": {
        "creator": [
            "COLEMAN, F. M."
        ]
    },
    "corporate": {},
    "date": "1807",
    "datefield": "1807",
    "edition": "",
    "flickr_url_to_book_images": "http://www.flickr.com/photos/britishlibrary/tags/sysnum000741339",
    "identifier": "000741339",
    "imgs": {
        "0": {
            "000007": [
                "11001418734"
            ],
            "000008": [
                "11000688195"
            ],
            "000010": [
                "11001417834"
            ],
            "000012": [
                "11001292025",
                "11001377336",
                "11219716895"
            ],
            "000013": [
                "11001516523"
            ],
            "000014": [
                "11219712374"
            ],
            "000016": [
                "11219715495"
            ],
            "000018": [
          

Even though the 4 above books contain no PDFs they all contain images and other meta data thus we keep these samples in all further analysis. This has only been highlighted to show that some books do not contain PDFs.

### Date formatting

Within the [meta data](https://data.bl.uk/digbks/DB21.html) description they state that the `date` key/field is a more standard version of the `datefield` key, here we show some differences between the two by displaying the first 5 differences between the `date` and `datefield`:


In [166]:
from typing import Tuple
def difference_in_datefield(book_data: List[Dict[str, Any]]) -> List[Tuple[str, str]]:
    '''
    :returns: A list of tuples containing the `date` and the `datefield`
              values respectively for books where these values differ.
    '''
    date_differences = []
    for book in book_data:
        datefield = book['datefield'].strip('[]')
        date = book['date'].strip('[]')
        if datefield != date:
            date_differences.append((date, datefield))
    return date_differences
date_differences = difference_in_datefield(book_meta_data)
for date, datefield in date_differences[:5]:
    print(f'date: {date}    datefield: {datefield}')

date: 1879    datefield: 1879 [1878
date: 1898    datefield: 
date: 1887    datefield: 1887.
date: 1886    datefield: 1886.
date: 1840    datefield: 1840-51


As we can see the `date` key is a lot more standardised and should be used when wanting the date of a book.

### Explore all possible meta data keys

Here we look at all the possible meta data and how often it occurs as a percentage of the whole corpus:


In [167]:
from collections import Counter

book_meta_data_keys = Counter()
number_books = len(book_meta_data)
for book in book_meta_data:
    for key, value in book.items():
        if key == 'authors':
            if 'creator' in value:
                book_meta_data_keys.update([key])
        elif key == 'place':
            if value:
                if value.strip():
                    book_meta_data_keys.update([key])
        elif value:
            book_meta_data_keys.update([key])
book_meta_data_keys = {key: f'{round(float(value) / number_books, 4) * 100:.2f}'
                       for key, value in book_meta_data_keys.items()}
print(json.dumps(book_meta_data_keys, indent=4, sort_keys=True))

{
    "authors": "86.24",
    "corporate": "3.57",
    "date": "99.72",
    "datefield": "96.91",
    "edition": "8.33",
    "flickr_url_to_book_images": "100.00",
    "identifier": "100.00",
    "imgs": "62.98",
    "issuance": "100.00",
    "pdf": "99.99",
    "place": "99.07",
    "publisher": "48.83",
    "shelfmarks": "100.00",
    "title": "100.00"
}


As we can see above not all of the meta data exists for all books.

## Metadata

The following is probably the metadata we would want to collect, based on the statistics we created in the [explore all possible meta data keys section](#explore-all-possible-meta-data-keys):

1. Identifier -- 100%
2. Date -- 99.72%
3. Publisher -- 48.83%
4. Title -- 100%
5. Place -- 99.07%
6. Authors -- 86.24% -- The authors meta data actually contains the following information `['Former owner', 'contributor', 'creator', 'engraver', 'fmo', 'former owner']`, we are assuming that the `creator` is the author of the book.

The meta data that is missing the most from this list is the `publisher` whereby only `48.83%` of the books have this meta data.

### Number of volumes

The other piece of meta data that might be useful to sotre is the `volume` number of the book. This volume number can be found through the `pdf` and `imgs` meta data field. Below we show a table containing the prcentage of books that have the given number of volumes:


In [168]:
import pandas as pd

def number_volumes_per_book(book_data: List[Dict[str, Any]]) -> Dict[str, List[int]]:
    '''
    :returns: A dictionary with two keys: volumes and count. Where the
              volumes key contains a list of integers representing the 
              volumes that exist for all books. The count contains a 
              same size list with the counts for the corresponding 
              volumes.
    '''
    volumes_counter = Counter()
    for book in book_data:
        volume = None
        if 'pdf' not in book:
            volume = len(book['imgs'])
        else:
            volume = len(book['pdf'])
        volumes_counter.update([volume])
    volumes_count = {'volumes': [], 'count': []}
    for volume, count in volumes_counter.items():
        volumes_count['volumes'].append(volume)
        volumes_count['count'].append(count)
    return volumes_count

number_books = len(book_meta_data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    volume_count = number_volumes_per_book(book_meta_data)
    volume_count['%'] = [(count / number_books) * 100 for count in volume_count['count']]
    df = pd.DataFrame(volume_count)
    display(df.set_index('volumes').sort_index().T.round(2))

volumes,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,28,29,34,39,41,65
count,42448.0,3108.0,3106.0,287.0,155.0,113.0,52.0,51.0,27.0,48.0,18.0,22.0,10.0,9.0,6.0,4.0,6.0,6.0,6.0,3.0,1.0,3.0,1.0,6.0,3.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0
%,85.74,6.28,6.27,0.58,0.31,0.23,0.11,0.1,0.05,0.1,0.04,0.04,0.02,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As we can see the majority of book (85.74%) of books only have one volume and ~98.3% of books have at most 3 volumes. Thus not many of the books in the collection have more than one volume and very few have more than 3 volumes.

### Date overview

Here we show the number of book in this dataset per decade:

In [169]:
book_decades_count = Counter()
for book in book_meta_data:
    if 'date' in book:
        if book['date']:
            book_decades_count.update([(int(book['date']) // 10) * 10])
book_decade = []
decade_count = []
decade_cuml_freq = []
current_cuml_freq = 0
for decade, count in sorted(book_decades_count.items(), key=lambda x: x[0]):
    book_decade.append(decade)
    decade_count.append(count)
    count_percentage = ((float(count) / number_books) * 100) + current_cuml_freq
    decade_cuml_freq.append(count_percentage)
    current_cuml_freq = count_percentage
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    df = pd.DataFrame({'decade': book_decade, 'count': decade_count, 'Cumulative %': decade_cuml_freq})
    display((df.set_index('decade').sort_index().T).round(2))


decade,1510,1520,1540,1550,1560,1570,1580,1590,1600,1610,1620,1630,1640,1650,1660,1670,1680,1690,1700,1710,1720,1730,1740,1750,1760,1770,1780,1790,1800,1810,1820,1830,1840,1850,1860,1870,1880,1890,1900,1910,1920,1930,1940
count,1.0,1.0,1.0,1.0,1.0,2.0,1.0,5.0,12.0,13.0,12.0,103.0,47.0,41.0,74.0,119.0,108.0,146.0,91.0,88.0,68.0,143.0,95.0,136.0,177.0,282.0,315.0,417.0,1139.0,1937.0,1963.0,2086.0,3228.0,4678.0,5258.0,6156.0,8102.0,12072.0,98.0,68.0,78.0,1.0,5.0
Cumulative %,0.0,0.0,0.01,0.01,0.01,0.01,0.02,0.03,0.05,0.08,0.1,0.31,0.4,0.49,0.64,0.88,1.09,1.39,1.57,1.75,1.89,2.18,2.37,2.64,3.0,3.57,4.21,5.05,7.35,11.26,15.23,19.44,25.96,35.41,46.03,58.46,74.83,99.21,99.41,99.55,99.71,99.71,99.72


As we can see very few books exist in this dataset until the 19th century (this can be best seen through the `Cumulative %` row), furthermore in comparison to the 19th century the 20th centruy decades contain very few book within this dataset. 

### Author overview

Here we show the number of authors and how frequent each author is in our dataset, all author names are normalised by lower casing them and removing any whitespace either side of the name:

In [170]:
author_counter = Counter()

for book in book_meta_data:
    if 'authors' in book:
        if book['authors']:
            if 'creator' in book['authors']:
                creator = book['authors']['creator']
                assert len(creator) == 1
                creator = creator[0].strip().lower()
                author_counter.update([creator])
# Top 20 authors:
print('Top 20 authors')
for author, count in author_counter.most_common(20):
    print(f'Author: {author}\nBooks written within this dataset: {count}')
# Total number of authors
print()
print(f'Total number of authors in the dataset: {len(author_counter)}')
single_authors = sum([count for count in author_counter.values() if count == 1])
print(f'Number of authors that only have one book in the dataset: {single_authors}')

Top 20 authors
Author: byron, george gordon byron - baron
Books written within this dataset: 154
Author: scott, walter - sir
Books written within this dataset: 106
Author: wood, henry - mrs
Books written within this dataset: 103
Author: dickens, charles
Books written within this dataset: 79
Author: oliphant, (margaret) - mrs
Books written within this dataset: 74
Author: shakespeare, william
Books written within this dataset: 62
Author: marryat, afterwards church, afterwards lean, florence.
Books written within this dataset: 58
Author: goldsmith, oliver
Books written within this dataset: 55
Author: ainsworth, william harrison
Books written within this dataset: 47
Author: dryden, john.
Books written within this dataset: 47
Author: burns, robert
Books written within this dataset: 46
Author: payn, james
Books written within this dataset: 42
Author: fenn, george manville.
Books written within this dataset: 41
Author: norie, john william.
Books written within this dataset: 41
Author: carey, 

As we can see there are a lot of authors that only published once in the dataset (19,249). On the other hand from the top 20 published authors in this dataset the most published has 154 book in the dataset (George Gordon Byron). From the author names we can also see that the names are full names containing the authors titles e.g. Baron, Sir, etc.

### Place overview

Here we are going to explore the `place` meta data in the same way as we did with authors. 

In [145]:
place_counter = Counter()

for book in book_meta_data:
    if 'place' in book:
        place = book['place'].strip().lower()
        if place:
            place_counter.update([place])

print('Top 20 places')
for place, count in place_counter.most_common(20):
    print(f'Place: {place}\tCount: {count}')

print()
print(f'Total number of places in the dataset: {len(place_counter)}')
single_places = sum([count for count in place_counter.values() if count == 1])
print(f'Number of places that only have one book in the dataset: {single_places}')

Top 20 places
Place: london	Count: 23405
Place: paris	Count: 2165
Place: new york	Count: 1093
Place: edinburgh	Count: 1033
Place: leipzig	Count: 514
Place: philadelphia	Count: 442
Place: berlin	Count: 425
Place: dublin	Count: 381
Place: boston [mass.]	Count: 247
Place: wien	Count: 240
Place: boston	Count: 238
Place: glasgow	Count: 214
Place: bruxelles	Count: 206
Place: chicago	Count: 187
Place: oxford	Count: 181
Place: calcutta	Count: 175
Place: manchester	Count: 166
Place: stockholm	Count: 166
Place: madrid	Count: 153
Place: kjøbenhavn	Count: 143

Total number of places in the dataset: 5370
Number of places that only have one book in the dataset: 3764


In comparison to the authors there are far fewer places (5371), further London takes up more than `50%` of the dataset. However there are still `3764` places that only have one book attached to them. The names might be difficult to automatically match to their actually place name e.g. "Boston \[mass.\]" should be "Boston Massachusetts". To look into this we shall use the [geonames](http://download.geonames.org/export/dump/) `allCountries.zip` list of country and city names to see how many of the place names we can link with an actually location: 

In [181]:
country_name_fp = Path('.', 'CountriesNamesAlt.txt').resolve()
all_city_and_country_names = set()
with country_name_fp.open('r') as country_name_file:
    for name in country_name_file:
        name = name.strip()
        if name:
            all_city_and_country_names.add(name.lower())
print(f'Number of country and city names in the geonames list: {len(all_city_and_country_names)}')

Number of country and city names in the geonames list: 15106416


In [191]:
places_cannot_identify = []
place_split_lengths = Counter()
for place in place_counter.keys():
    if place not in all_city_and_country_names:
        place_split_length = len(place.split(','))
        place_split_lengths.update([place_split_length])
        places_cannot_identify.append(place)
print(f'Number of unique places that we cannot identify: {len(places_cannot_identify)}')
books_with_places_cannot_identify = 0
for place in places_cannot_identify:
    books_with_places_cannot_identify += place_counter[place]
print(f'Number of books whose place name cannot be resolved to a name in geonames: {books_with_places_cannot_identify}')

Number of unique places that we cannot identify: 3566
Number of books whose place name cannot be resolved to a name in geonames: 7154


After loading and lower caseing the geonames place names, we have 15 million place names from geonames to compare the book place names too. Out of the 5370 unique place names we can match 1804 and not resolve 3566 place names. 3566 place names does seem like a lot of place name however from the whole book collection (49,509) those 3566 unique place names only affect 7154 books. Out of the 7154 book below shows the top 20 unique place names that affects the most books:

In [192]:
place_counter_no_name = Counter()
for place in places_cannot_identify:
    place_counter_no_name[place] = place_counter[place]
place_counter_no_name.most_common(20)

[('boston [mass.]', 247),
 ('kjøbenhavn', 143),
 ('с.-петербургъ', 129),
 ('edinburgh & london', 120),
 ('new-york', 97),
 ('münchen', 85),
 ('london, guildford [printed]', 80),
 ('санктпетербургъ', 78),
 ('edinburgh and london', 74),
 ('london]', 72),
 ('london & new york', 64),
 ('méxico', 59),
 ('london; guildford [printed]', 56),
 ("'s gravenhage", 51),
 ('london; edinburgh [printed]', 51),
 ('london, edinburgh [printed]', 51),
 ('zürich', 50),
 ('genève', 40),
 ('v praze', 35),
 ('с. петербургъ', 32)]

As we can see `boston [mass.]` and `kjøbenhavn` affects the most books and if resolved will affect in total 390 books. There are a fair few place names that contain more than one place name e.g. `edinburgh & london`, some have extra text that needs to be removed e.g. `[printed]`, some are in the native language e.g. `с. петербургъ` which I believe is `Saint Petersburg`. Below shows the unique place names that cannot be resolved but affect the least number of books, of which these show larger errors which can mainly fall into the topic of extra text that is not relevant to the place name e.g. including the date.



In [198]:
place_counter_no_name.most_common()[-20:]

[('london: w. oxberry; sold by sherwood, neely & jones, 1812', 1),
 ('london: lowndes & hobbs; sold by hatchard, [1811]', 1),
 ('london: d. wilson & t. durham, 1752', 1),
 ('london: t. n. longman & o. rees, 1800', 1),
 ('london: richard bentley & son, 1885', 1),
 ('london: printed for nicholas vavasour, 1634', 1),
 ('london: macmillan & co., pp. xl, 400. 1895', 1),
 ('london, 1893', 1),
 ('boston: ticknor & co., 1886', 1),
 ('pp. xxvi, 375. london: jackson, walford & hodder; 1865', 1),
 ('pp. viii. 64. j. debrett: london, 1789', 1),
 ('g. eld: london, 1608', 1),
 ('j. y., for e. d. & n. e.: london, 1652', 1),
 ('for henry brome: london, 1661', 1),
 ('paris, amsterdam, 1768', 1),
 ('london, new york: george routledge & sons, [1882]', 1),
 ('london: printed by woodfall & kinder, 1860', 1),
 ('printed by h. e. carrington; london', 1),
 ('newcastle-upon-tyne, simpkin, marshall & co', 1),
 ('4 vol. parker & co.: oxford, 1891, 1892', 1)]

### Publisher overview

Here we are going to explore the `publisher` meta data in the same way as we did with authors and places. 

In [151]:
publisher_counter = Counter()

london_publishers = []
for book in book_meta_data:
    if 'publisher' in book:
        publisher = book['publisher'].strip().lower()
        if publisher:
            publisher_counter.update([publisher])
            if publisher == 'london':
                london_publishers.append(book)

print('Top 20 publishers')
for publisher, count in publisher_counter.most_common(20):
    print(f'Publisher: {publisher}\tCount: {count}')

print()
print(f'Total number of publishers in the dataset: {len(publisher_counter)}')
single_publishers = sum([count for count in publisher_counter.values() if count == 1])
print(f'Number of publishers that only have one book in the dataset: {single_publishers}')

Top 20 publishers
Publisher: hurst & blackett	Count: 418
Publisher: macmillan & co.	Count: 401
Publisher: chatto & windus	Count: 377
Publisher: sampson low & co.	Count: 374
Publisher: chapman & hall	Count: 315
Publisher: longmans & co.	Count: 296
Publisher: london	Count: 281
Publisher: john murray	Count: 274
Publisher: f. v. white & co.	Count: 259
Publisher: privately printed	Count: 252
Publisher: r. bentley & son	Count: 251
Publisher: ward & downey	Count: 247
Publisher: cassell & co.	Count: 246
Publisher: smith, elder & co.	Count: 216
Publisher: hutchinson & co.	Count: 216
Publisher: printed for the author	Count: 203
Publisher: remington & co.	Count: 183
Publisher: w. blackwood & sons	Count: 172
Publisher: simpkin, marshall & co.	Count: 171
Publisher: g. routledge & sons	Count: 167

Total number of publishers in the dataset: 7088
Number of publishers that only have one book in the dataset: 5177


This is more similar to the `places` meta data, fewer single publishers (5177) and less publishers in total (7088). There are some publishers that are popular in the dataset, the most being `hurst & blackett`. However one publisher is interesting `london` which I thought would be a place name, thus an example of a book meta data is shown below where the publisher is `London`. We can see that the `publisher` meta data has been corrupted with the `place` meta data, in this example [A. Millar](https://en.wikipedia.org/wiki/Andrew_Millar) within the `place` meta data is probably referring to Andrew Millar.

In [158]:
print(json.dumps(london_publishers[0], indent=4, sort_keys=True))

{
    "authors": {
        "contributor": [
            "HOME, John - Author of \u201cDouglas.\u201d"
        ]
    },
    "corporate": {},
    "date": "1758",
    "datefield": "1758",
    "edition": "",
    "flickr_url_to_book_images": "http://www.flickr.com/photos/britishlibrary/tags/sysnum000029505",
    "identifier": "000029505",
    "imgs": {
        "0": {
            "000003": [
                "10999869474"
            ],
            "000006": [
                "10999792016"
            ],
            "000007": [
                "10999931323"
            ],
            "000020": [
                "10999934983"
            ],
            "000035": [
                "10999795436"
            ],
            "000050": [
                "10999869144"
            ],
            "000064": [
                "10999871024"
            ]
        }
    },
    "issuance": "monographic",
    "pdf": {
        "0": "lsidyv33510698"
    },
    "place": "A. Millar",
    "publisher": "London",
  

In [163]:
date_range = sorted([int(book['date']) for book in london_publishers])
print(f'London publishing problem starts at date {date_range[0]} and ends in {date_range[-1]}')

London publishing problem starts at date 1667 and ends in 1898


We can see that the London `publisher` meta data problem is not within an old date range it covers almost the whole dataset data range from 1667 to 1898 and affects in total 281 books.

## Conclusion

To conclude we have covered:

1. How to load the meta data.
2. An examples of what each books meta data looks like.
3. The number of books with meta data (49,509)
4. Overview of the meta data that might be of interest.
5. Explored some of the meta data fields.  