# Initial Corpus Counts

*notes*

note regarding structure: this comes after I describe the bibliographical sources and the clean-up process for my origbib, but before I discuss the file sources.
Elements still needed: 
1. why discuss my sources in this depth - before drawing samples, need to know the shape of the foundation
2. discussion of why even this bibliography is not "representative" - I think this is handled in more introductory materials, but can highlight elements o fit here.
3. a good place to requote a source about the function of anthologies vs bibliographies
4. I need a naming structure for my charts/images. Mix of chapter/sections and Fig ids?
5. I need to figure out a way to render footnotes (base Markdown doesn't seem to work?)
6. I need to figure out integrating citations smoothly (seem to be resources on this)
7. Create some charts for by decade instead of year - more clear? Also curious!

*last update: May 21*

---------------------

Following the iterative process of cleaning my data, my bibliography consisted of 4970 short titles (with multivolume works combined into one entry), along with their author and year of publication. Even without matching these titles to digital files, this dataset offers an opportunity to explore the metadata of travel writing, ranging from publication dates to cross-references. Because of the boundaries of my project goals and my resources, the findings presented below are only a glimpse of the benefits of this type of bibliographical collation; larger bibliographical projects such as *British Travel Writing*, which also records source cross-references, will offer a more comprehensive view of these networks.

This deep examination of my sources is also a critical step en route to digital corpus formation. The constitution of this foundational corpus influences what digital titles we may find and the distribution of publication dates. Lacking a comprehensive equivalent to Cox in the nineteenth century, my sources become more disparate. 



However, a variety of sources contributed to this data, and influenced the numbers. Before we move into searching for files, understanding the constitution of this foundational corpus is important because it influences what works we may find.

This notebook calculates and visualizes numbers in my original bibliography. This includes:
1. Publication numbers sorted by year
2. Publication numbers sorted by source bibliography
        a. Unique titles (ie, titles that appear only in one bibliography)
        b. Common titles (ie, titles that appear in more bibliographies)

There are a few things to note before diving into the numbers of this notebook. First, I delete titles that are dated 17--, a total of 5 titles). Secondly, the `pubdate_short` column is drawn from the first four characters of the `pubdate` column, meaning that works with a date range, such as 1727 "buck's antiquities or venerable remains" by Samuel Buck, published 1727-1740 according to Cox, is reduced to 1727. For many multi-volume titles---especially anthologies such as John Senex's *Modern Geography*, which can be published over several years or even decades---this decision creates a necessary distortion of the publication numbers for 214 titles.

I also take this opportunity to remind my readers (what am I, a game-show host?! ~change~ ) that not all of the titles are accurately recorded as first editions.

Let's jump in and look at some of the numbers, and try come up with some of my analysis as I do so.

In [1]:
# I had trouble updating the config (ie, my account) details for plotly and cufflinks
# Running command line/Anaconda prompt as admin solved the issue
import plotly.plotly as py
import cufflinks as cf

import pandas as pd
import numpy as np

In [2]:
from markdown.extensions import footnotes

In [3]:
origbib = pd.read_excel('finalbibs/origbib.v.16.xlsx', encoding='utf8')
len(origbib)

4970

In [4]:
# drop rows where the pubdate column is 17--
origbib = origbib[origbib.pubdate != '17--']
origbib['pubdate_short'] = pd.to_numeric(origbib['pubdate_short'])
len(origbib)

4965

In [5]:
origbib['pubdate_short'].value_counts().iplot(kind='bar', yTitle = 'Number of Titles',
                                          xTitle = 'Publication Year', title = 'Publication Counts from Original Bibliography',
                                          filename='dissertation/origbib_counts')

Let's start with a description of the chart. The highest peak is 144 titles in 1800; the late 1780s and through the 1790s are also higher on average. The peak at 1800 is followed by a significant drop in the early 1800s. Even without looking at which bibliography the various titles are from, the century divide is very clear: after 1800, the number of titles per year drops precipitously.

In [6]:
origbib['pubdate_short'].value_counts().sort_index()

1700    25
1701    18
1702    20
1703    21
1704    18
1705    23
1706    17
1707    28
1708    26
1709    20
1710    22
1711    24
1712    21
1713    10
1714    26
1715    26
1716    13
1717    27
1718    14
1719    17
1720    41
1721    12
1722    28
1723    22
1724    24
1725    26
1726    35
1727    29
1728    23
1729    23
        ..
1801    15
1802    12
1803    12
1804    14
1805    15
1806    11
1807    12
1808     6
1809    10
1810    16
1811     9
1812    10
1813    19
1814    50
1815    48
1816    41
1817    50
1818    53
1819    24
1820    19
1821    25
1822    23
1823    14
1824    26
1825    29
1826    13
1827    16
1828    20
1829    22
1830    25
Name: pubdate_short, Length: 131, dtype: int64

In a perfect distribution, the average between 1700 and 1830 would be 1765. The mean of all of our publication dates, however, is 1771, and the median is 1777: the weight of late eighteenth century drags the averages forward in time, though they are all within an 12 year span.

In [7]:
origbib['pubdate_short'].mean()

1771.158710976838

In [8]:
origbib['pubdate_short'].median()

1777.0

#  Bibliography by Source

I used 14 different sources for my bibliography (*see 1.XX for a more detailed description of each bibliography*). To better understand how each source contributes, we can compare several elements of these subcorpora:

1. Overall number of titles from each source
2. Publication dates from each source
3. Unique titles (titles only appear in one source)
4. Common titles (titles occur in multiple sources)

Because titles can appear in more than one source, totals calculated from the individual sources are inflated [~note passive] compared to the publication counts above.

As a reminder, our sources include:

In [9]:
listOfBibs = ['Robinson_W','Gove','BrynMawr', 'Leask', 'TEE', 'BDanth', 'Andrews',
 'Cox', 'BTW_W', 'Murray', 'Irish_McVeagh', 'BTW_EuroTour', 'NCCO_c19Trav', 'NCCO_TravelNarr']

## Overall Number of Titles from Each Source

In [10]:
###### make a list of value_counts() results, then make them into a df

overallValueCountsList = []
for bib in listOfBibs:
    overallValueCountsList.append(origbib[bib].value_counts())

# number of titles from each bibliography
# transpose makes the col names (ie, the bibliographies) into index rows

overallValueCountsDF = pd.concat(overallValueCountsList, axis=1).transpose()
overallValueCountsDF.rename(columns={'x':'OverallTitleCounts'}, inplace=True)
overallValueCountsDF.sort_values(by=['OverallTitleCounts'])

Unnamed: 0,OverallTitleCounts
NCCO_TravelNarr,10
Robinson_W,41
BDanth,54
Leask,61
Andrews,65
TEE,67
Gove,79
Irish_McVeagh,93
NCCO_c19Trav,116
Murray,118


In [9]:
overallValueCountsDF.sum()

OverallTitleCounts    5458
dtype: int64

In [10]:
# compare the inflation from cross-referenced titles
overallValueCountsDF.sum() - len(origbib)

OverallTitleCounts    493
dtype: int64

Cox clearly dominates; of the 4965 titles in the `origbib` corpus, 4247 of them---85.5%---are in Cox. The next closest source, BrynMawr, is only at 198. The BTW_EuroTour is particularly interesting, as it contains 172 titles, but has a very limited scope: it focuses on travel between 1814-1818, excluding Britain and Ireland. If BTW_EuroTour's total is at all predictive of  publication numbers, then we might expect at least 1290 titles published from the end of 1800 to the end of 1830, rather than the 660 currently in my bibliography.

Murray's impact on the publication of travel writing is clear as well; this single publishing house, which includes only 'Non-European' travel texts, contributes more titles than McVeagh's catalogue of Irish travels, or Gove's list of imaginary travels.

The anthologies and critical works---such as BDanth, Leask, Andrews, and TEE---have less titles in general, as we might expect from works who do not aim at the cataloguing function of bibliographies. The unique cases of the NCCO collection are commented on earlier. *keep this comment?*

In [11]:
overallValueCountsDF.drop('Cox').sort_values(by=['OverallTitleCounts']).iplot(kind='bar', yTitle = 'Number of Titles',
                                          xTitle = 'Source', title = 'Publication Counts by Source (except Cox)',
                                          filename='dissertation/origbib_counts_by_source')

~Stefan note~ I also tried out a bar graph (below) that includes Cox and uses logs to ease visualization (`logy=True`), but I'm not sure if it is more helpful than the graph minus Cox above, and [the documentation on the plotly page](https://plot.ly/pandas/log-plot/) is a little...empty.

In [12]:
overallValueCountsDF.sort_values(by=['OverallTitleCounts']).iplot(kind='bar', yTitle = 'Number of Titles',
                                          xTitle = 'Source', title = 'Publication Counts by Source',
                                          filename='dissertation/origbib_counts_by_sourceWithCox', logy=True)

## Publication Counts by Year from Each Source

Just as our sources have different overall publication counts, they also contribute in different ways to the striation ~awk?~ of publication dates. 


In [13]:
tempBibList = []

for bib in listOfBibs:
    
    # select rows with only that bib source
    tempDF = origbib[origbib[bib] == 'x']
    
    # create value_counts() series, name for each bib source
    tempValueCounts = tempDF['pubdate_short'].value_counts().sort_index().rename(bib, inplace=True)
    
    # add the series to a list, to be used for concating
    tempBibList.append(tempValueCounts)

# concat all the series together on the index (ie, on the pubdate_short)
bibPubDF = pd.concat([x for x in tempBibList], axis =1, sort=False)

bibPubDF.sort_index(axis=1, inplace=True)

# preview the last 5 rows
bibPubDF.head(5)

Unnamed: 0,Andrews,BDanth,BTW_EuroTour,BTW_W,BrynMawr,Cox,Gove,Irish_McVeagh,Leask,Murray,NCCO_TravelNarr,NCCO_c19Trav,Robinson_W,TEE
1700,,,,,,24.0,,1.0,,,,,,
1701,,,,,,18.0,,,,,,,,
1702,,,,,,19.0,,1.0,,,,,,
1703,,,,,,20.0,1.0,,,,,,,
1704,,,,,2.0,16.0,,,,,,,,


In [14]:
bibPubDF.iplot(kind='bar', yTitle = 'Number of Titles', xTitle = 'Year of Publication',
               title = "Publication Counts from Each Source", filename='dissertation/origbib_counts_by_each_source')

In [15]:
bibPubDF.iplot(kind='bar', barmode='stack', yTitle = 'Number of Titles', xTitle = 'Year of Publication',
               title = "Publication Counts from Each Source", filename='dissertation/origbib_counts_by_each_source_stacked')

In [16]:
# We can also visualize these separately, by bib 

for bib in listOfBibs:
    
    # select rows with only that bib source
    tempDF = origbib[origbib[bib] == 'x']
    
    # create value_counts() series, name for each bib source
    tempValueCounts = tempDF['pubdate_short'].value_counts().sort_index().rename(bib, inplace=True)
    
    tempValueCounts.sort_index(axis=0, inplace=True)
    
    tempValueCounts.iplot(kind='bar', yTitle = 'Number of Titles from ' + bib, xTitle = 'Year of Publication',
               title = "Publication Counts from " + bib, filename='dissertation/origbib_counts_'+bib)


~Stefan note~ - There are some issues here with the colouring. The colours for Gove and McVeagh are way too similar, and they shouldn't be next to each other like that. In general, I think that the stacked chart is more clear than the other one above, but I'm wondering if I should collapse into decades?

The importance of Cox is clear, especially if you view the graph without his contributions (toggle off in above chart by clicking in key). He provides the only titles for 13 of the years before 1770: 1701, 1710, 1711, 1716, 1717, 1718, 1721, 1730, 1746, 1753, 1758, 1764, and 1769. It is important to remind the reader here that my bibliography does not include every single list of travel writing, nor even every bibliography as such;, such as as Shirley H. Weber's *Voyages and Travels in Greece, the Near East and Adjacent Regions Made Previous to the Year 1801* would have contributed to the diversity of my corpus, but that I did not transcribe them during corpus creation (see Relevant Section).


In general, more sources contribute to the years after 1790, with 1795 and 1817 at the highest with eleven sources.

In [17]:
maxSourcesDF = bibPubDF.copy()
for col in maxSourcesDF.columns.tolist():
    maxSourcesDF[col] = (maxSourcesDF[col] >= 1).astype(int)
maxSourcesDF.sum(axis = 1, skipna = True).sort_values().tail(21)

maxSourcesDF.sum(axis=1, skipna=True).sort_values().iplot(kind='bar', yTitle = 'Number of Sources', xTitle = 'Year',
               title = "Sources per Year", filename='dissertation/origbib_sources_per_year')

These latter years, with their higher numbers of sources for each year, are due in part to the cross-section of sources that I used. As is evident below, the highest cross section of bibliographies the 1790s, which all except BTW_EuroTour covers.

Note that some works, such as Cox, record titles decades before 1700; similarly, some sources, such as McVeagh, reach far beyond my 1830 limit. 

In [18]:
import plotly.plotly as py
import plotly.figure_factory as ff

origbibSourceDateCoverage = [dict(Task="Cox", Start='1650-01-01', Finish='1880-12-31'),
      
      # what do to for comtemporary? just add note?
      dict(Task="Robinson", Start='1650-01-01', Finish='1880-12-31'),
      dict(Task="McVeagh", Start='1650-01-01', Finish='1880-12-31'),
      
      dict(Task="BrynMawr", Start='1650-01-01', Finish='1850-12-31'),
      dict(Task="Gove", Start='1700-01-01', Finish='1800-12-31'),
      dict(Task="BDanth", Start='1700-01-01', Finish='1830-12-31'),
      dict(Task="Leask", Start='1770-01-01', Finish='1840-12-31'),
      dict(Task="Andrews", Start='1760-01-01', Finish='1800-12-31'),
      dict(Task="TEE", Start='1772-01-01', Finish='1857-12-31'),
      dict(Task="NCCO_19thC", Start='1776-01-01', Finish='1880-12-31'),
      dict(Task="NCCO_TravNarr", Start='1786-01-01', Finish='1880-12-31'),
      dict(Task="Murray", Start='1773-01-01', Finish='1859-12-31'),
      dict(Task="BTW_W", Start='1780-01-01', Finish='1840-12-31'),
      dict(Task="BTW_EuroTour", Start='1814-01-01', Finish='1818-12-31'),
     ]

fig = ff.create_gantt(origbibSourceDateCoverage, title='Coverage of Year by Sources')
py.iplot(fig, filename='dissertation/source_time_coverage', world_readable=True, xTitle = 'Year')

## Most Common Titles

My sources are a mixture of bibliographies, critical texts, archival holdings, and curated databases---and only a fraction of the ones that are out there. These most common titles are not a rock-solid indicator of the influence of the work, therefore; however, they do indicate what a combination of literary scholars, archivists, and bibliographers have considered as important to---or, just as useful in our case, part of---the genre.

In [19]:
# count which texts have the most sources
origbib['numberofbibentries'] = origbib[listOfBibs].eq('x').sum(axis=1)
origbib['numberofbibentries'].value_counts()

1    4609
2     259
3      65
4      25
5       6
6       1
Name: numberofbibentries, dtype: int64

Of the 4970 titles in my bibliography, 4609 (92.7%) of them are only listed in one source. The publication numbers grow smaller as the number of cross-references grows, which makes sense considering that even if we count every title in sources other than Cox (1211), Cox still outweighs them with a total of 4247 titles.

In [20]:
overallValueCountsDF.sort_values('OverallTitleCounts')

Unnamed: 0,OverallTitleCounts
NCCO_TravelNarr,10
Robinson_W,41
BDanth,54
Leask,61
Andrews,65
TEE,67
Gove,79
Irish_McVeagh,93
NCCO_c19Trav,116
Murray,118


In [21]:
overallValueCountsDF.sum()-overallValueCountsDF.drop('Cox').sum()

OverallTitleCounts    4247
dtype: int64

When we look examine the results, we can see another effect of our diverse sources: of the 25 titles cross-referenced at least four times, the earliest title is John Hawkesworth's *An Account of the Voyages undertaken...for making discoveries in the Southern Hemisphere*, published in 1773, and less than half of the titles are published before 1800. The mean publication date is 1804.68. *[~worth including decimals? or round up?]*

Most of our cross-references are concentrated in the latter parts of the time period, after sources such as the TEE, Leask, Murray, and BTW_W begin.

In [22]:
origbib[origbib['numberofbibentries'] == 4][['title', 'author', 'pubdate_short']+listOfBibs].sort_values('pubdate_short')


Unnamed: 0,title,author,pubdate_short,Robinson_W,Gove,BrynMawr,Leask,TEE,BDanth,Andrews,Cox,BTW_W,Murray,Irish_McVeagh,BTW_EuroTour,NCCO_c19Trav,NCCO_TravelNarr
2267,An Account of the Voyages undertaken by the or...,"Hawkesworth, John",1773,,,x,x,x,,,x,,,,,,
2370,journey to the highlands of scotland,"hanway, mary ann",1775,,,x,,,x,x,x,,,,,,
2376,journey to the western islands of scotland,"johnson, samuel",1775,,,x,,,x,x,x,,,,,,
3024,observations relative chiefly to picturesque b...,"gilpin, william",1786,,,x,,,,x,x,,,,,x,
3202,A Journey through the Crimea to Constantinople,"Craven, Lady Elizabeth",1789,x,,,,,x,,x,x,,,,,
3284,travels to discover the source of the nile,"bruce, james",1790,,,x,x,,x,,x,,,,,,
3461,travels through north & south carolina,"bartram, william",1792,,,,x,x,x,,x,,,,,,
3592,"Travels in India, during the years","Hodges, William",1793,,,x,x,x,,,x,,,,,,
3676,Two Voyages to Sierra Leone,"Falconbridge, Anna Maria",1794,x,,,,,x,,x,x,,,,,
3792,"A Voyage round the World, in the Gorgon Man of...","Parker, Mary Ann",1795,x,,,,x,,,x,x,,,,,


In [23]:
origbib[origbib['numberofbibentries'] == 4][['title', 'author', 'pubdate_short']+listOfBibs].sort_values('pubdate_short')['pubdate_short'].mean()


1804.68

Going from 25 titles with four cross-references to 6 titles with five cross-references, the average year of publication stays the same. For a moment, there is a rare gender parity in the corpus: Hester Lynch Piozzi's *Observations and reflections made...through France, Italy, and Germany* (1789), Mary Wollstonecraft's *Letters Written During a Short Residence in Sweden, Norway, and Denmark* (1796), and Anne Carter's *Letters from a Lady to Her Sister, During a Tour to Paris* (1814) are all bolstered by cross-references in women-focused bibliographies, including Robinson_W and BTW_W. Wollstonecraft and  Piozzi are also well-known eighteenth-century figures involved in the world of letters, bolstering, perhaps, their reputation. The titles by men include Mungo Park's famous *Travels in the Interior Districts of Africa* (1799), Edward Daniel Clarke's collection of *Travels in Various Countries of Europe, Asia, and Africa* (1810; NCCO lists a later, 1816 edition), and James Hingston Tuckey's *Narrative of an expedition to explore the river Zaire* (1818).

*[~extra comments could discuss here: Clarke's Travels is in both the NCCO collections, but it is a later edition. Also, attention to Africa evident in the men-authored texts.]*

In [24]:
origbib[origbib['numberofbibentries'] == 5][['title', 'author', 'pubdate_short']+listOfBibs].sort_values('pubdate_short')['pubdate_short'].mean()


1804.3333333333333

In [25]:
origbib[origbib['numberofbibentries'] == 5][['title', 'author', 'pubdate_short']+listOfBibs].sort_values('pubdate_short')


Unnamed: 0,title,author,pubdate_short,Robinson_W,Gove,BrynMawr,Leask,TEE,BDanth,Andrews,Cox,BTW_W,Murray,Irish_McVeagh,BTW_EuroTour,NCCO_c19Trav,NCCO_TravelNarr
3234,Observations and reflections made in the cours...,"Piozzi, Hester Lynch",1789,x,,x,,,x,,x,x,,,,,
3918,Letters Written during a Short Residence in Sw...,"Wollstonecraft, Mary",1796,x,,,x,,x,,x,x,,,,,
4139,travels in the interior districts of africa,"park, mungo",1799,,,x,x,x,x,,x,,,,,,
4440,"travels in various countries of europe, asia, ...","clarke, edward daniel",1810,,,x,x,,,,x,,,,,x,x
4476,Letters from a Lady to Her Sister during a Tou...,"Carter, Anne",1814,x,,x,,,,,,x,,,x,x,
4700,Narrative of an expedition to explore the rive...,"Tuckey, James Kingston",1818,,,x,,x,x,,,,x,,,x,


Only one title is listed in six of the sources. Ann Radcliffe's *A Journey Made in the Summer of 1794*, published in 1795, is listed in the women-focused sources of Robinson_W and BTW_W; the anthology BDanth; the picturesque-oriented Andrews; the all-consuming Cox; and the BrynMawr catalogue. Radcliffe is best-known for her Gothic novels; her most famous, *The Mysteries of Udolpho*, was published the year before

 The Mysteries of Udolpho (1794) for £500, while Cadell and Davies paid £800 for The Italian (1797),

In [26]:
origbib[origbib['numberofbibentries'] == 6][['title', 'author', 'pubdate_short']+listOfBibs]


Unnamed: 0,title,author,pubdate_short,Robinson_W,Gove,BrynMawr,Leask,TEE,BDanth,Andrews,Cox,BTW_W,Murray,Irish_McVeagh,BTW_EuroTour,NCCO_c19Trav,NCCO_TravelNarr
3799,A Journey Made in the Summer of 1794,"Radcliffe, Ann",1795,x,,x,,,x,x,x,x,,,,,


Why might this title be the most cross-referenced? The publication date of *A Journey Made in the Summer of 1794* is covered by the bulk of my sources, increasing the chance of her inclusion. Radcliffe's novels, such as *The Mysteries of Udolpho* (1794) and *The Italian* (1797) were extremely popular during her lifetime, and critics situate her as one of the most important influences on the Gothic novel. Indeed, Radcliffe's popularity comes at a confluence of social factors influencing expectations for women's participation in travel and publishing; as Elizabeth Bohls traces in her discussion of Ann Radcliffe's *The Mysteries of Udolpho*, "Both tourism and writing for publication took women into the public realm in potentially transgressive ways. Sensibility and taste helped legitimize a woman like Radcliffe in her pursuit of these dubious endeavors” (103). Four extracts from *A Journey* were published in the *Lady's Magazine* (three in 1795 and one in 1797). Radcliffe's publisher for *Udolpho* and *A Journey*, the radical George Robinson, was best known for his travel narratives, translations, and periodicals, "position[ing] Radcliffe and her work outside of the insular worlds of the British novel and the fashionable milieu of the circulating library, and makes her a significant part of early conversations about human rights and even what we might today call global citizenship" (287).

Despite the significance of Radcliffe's ouvre in discussions of landscape aesthetics (note her presence in `Andrews`), the picturesque, and the Gothic, as well as the availability of accessible scholarly editions of most of her novels (Oxford has published editions of *Udolpho*, *A Sicilian Romance*, *The Romance of the Forest*, and *The Italian*, for example), it is regrettable that there is not an easily accessible edition of *A Journey*.





## Titles that only appear in one bibliography

In [27]:
uniqueTitlesDF = origbib[origbib['numberofbibentries'] == 1]

In [28]:
uniqueTitlesDF

Unnamed: 0,pubdate,pubdate_mod,title,author,author_2,short_title,orig_merged,vol,galenum,author_3,...,OLD_NCCO,origbib_short_title,mainbib_short_title,match_column,htrc_id,ecco_id,gale_id,file_match,pubdate_short,numberofbibentries
1,1790,1790,Travels from the Cape of Good Hope,"Levaillant, Francrois","Helme, Elizabeth (translator)",Travels from the Cape of Good Hope,Travels from the Cape of Good Hope + Levaillan...,,,,...,,,,,,,,,1790,1
2,1800,1800,Sketch of the Life and Literary Career of Augu...,"Kotzebue, August Friedrich Ferdinand von","Plumptre, Anne (translator)",Sketch of the Life and Literary Career,Sketch of the Life and Literary Career + Kotze...,,,,...,,,,,,,,,1800,1
3,1802,1802,Travels in the Crimea. A History of the Embass...,"Struve, Johann Christian von","Godwin, Mary Jane (translator)",Travels in the Crimea,"Travels in the Crimea + Struve, Johann Christi...",,,,...,,,,,,,,,1802,1
4,1812,1812,"Travels in Southern Africa, in the Years 1803,...","Lichtenstein, Martin Heinrich Carl","Plumptre, Anne (translator)","Travels in Southern Africa, in the Years 1803,...","Travels in Southern Africa, in the Years 1803,...",,,,...,,,,,,,,,1812,1
5,1813,1813,Voyages and Travels in Various Parts of the Wo...,"Langsdorff, George Heinrich von","Plumptre, Anne (translator)",Voyages and Travels in Various Parts of the World,Voyages and Travels in Various Parts of the Wo...,,,,...,,,,,,,,,1813,1
6,1813,1813,"Travels in the Morea, Albania, and Other Parts...","Pouqueville, François Charles Hugues Laurent","Plumptre, Anne (translator","Travels in the Morea, Albania, and Other Parts...",,,,,...,,,,,,,,,1813,1
7,1815,1815,Battle of Waterloo,"Eaton, Charlotte Anne",,Battle of Waterloo,"Battle of Waterloo + Eaton, Charlotte Anne",,,,...,,,,,,,,,1815,1
8,1821,1821,authentic narrative of the shipwreck and suffe...,"bradley, eliza",,authentic narrative of the shipwreck,authentic narrative of the shipwreck + bradley...,,,,...,,,,,,,,,1821,1
9,1826,1826,memoirs of the margravine,"Craven, Lady Elizabeth",,memoirs of the margravine,"memoirs of the margravine + Craven, Lady Eliza...",,,,...,,,,,,,,,1826,1
10,1829,1829,memoirs of lady fanshawe,"fanshawe, lady ann",,memoirs of lady fanshawe,"memoirs of lady fanshawe + fanshawe, lady ann",,,,...,,,,,,,,,1829,1


Using the same `value_counts` method as above, we'll recount to calculate how many unique values each bibliography contributed.

In [29]:
uniqueTitleCountsList = []

for bib in listOfBibs:
    uniqueTitleCountsList.append(uniqueTitlesDF[bib].value_counts())

uniqueTitleCountsDF = pd.concat(uniqueTitleCountsList, axis=1).transpose()
uniqueTitleCountsDF.rename(columns={'x':'UniqueTitleCounts'}, inplace=True)
uniqueTitleCountsDF.sort_values(by=['UniqueTitleCounts'])

Unnamed: 0,UniqueTitleCounts
Robinson_W,5
BDanth,5
NCCO_TravelNarr,5
TEE,13
Andrews,13
Leask,24
BrynMawr,43
NCCO_c19Trav,43
Gove,51
Murray,63


For ease, I want to combine this with my other `df` - `overallValueCountsDF`

In [30]:
bibOrigTitleCountsDF = uniqueTitleCountsDF.join(overallValueCountsDF)

In [31]:
bibOrigTitleCountsDF

Unnamed: 0,UniqueTitleCounts,OverallTitleCounts
Robinson_W,5,41
Gove,51,79
BrynMawr,43,198
Leask,24,61
TEE,13,67
BDanth,5,54
Andrews,13,65
Cox,4026,4247
BTW_W,77,137
Murray,63,118


Which bibliography had the most unique titles compared to overall titles?

In [32]:
bibOrigTitleCountsDF['UniqueToOverallRatio'] = bibOrigTitleCountsDF['UniqueTitleCounts']/bibOrigTitleCountsDF['OverallTitleCounts']

In [33]:
bibOrigTitleCountsDF.sort_values('UniqueToOverallRatio')

Unnamed: 0,UniqueTitleCounts,OverallTitleCounts,UniqueToOverallRatio
BDanth,5,54,0.092593
Robinson_W,5,41,0.121951
TEE,13,67,0.19403
Andrews,13,65,0.2
BrynMawr,43,198,0.217172
NCCO_c19Trav,43,116,0.37069
Leask,24,61,0.393443
NCCO_TravelNarr,5,10,0.5
Murray,63,118,0.533898
BTW_W,77,137,0.562044


Initially, I found some of these rates to be surprising. Considering how anthologies are attuned to their respective canons, for a general teaching anthology such as BDanth to have 10% titles not in any of my other sources implies that there is some funkiness between what the bibliographies and BDanth consider canonical. 

However, these rates made me suspicious, and I thought that perhaps this is because of 



Interesting things to note about the chart: 

- The anthologies have fairly low unique values. This might be expected, as anthologies anthologize (lol) works that critics consider to have influenced, defined, or exemplified a particular genre. (a qt on the purpose of anthologies would be useful here.)
- Cox still dominates. His work was the most detailed and thorough, working with a broad definition travel writing for his corpus.
- Some of these sources will naturally have more overlap than others. For example, some of the sources have a similar focus (such as women authors), and some sources relied on the others (BTW_W includes Leask, Cox, McVeagh, and Robinson_W, for example, or, more notably, BrynMawr serving as the foundation for NCCO_c19Trav).