# Study carrels, and SQL, and Python ("Oh my!")

The result of the Distant Reader process is the creation of a "study carrel" -- a structured data set with many components. One of those components is an SQLite database file. This notebook outlines many ways to extract information from the database and output the result in a number of different ways.


## Initialize

The first steps are to: 1) configure what database to read, import SQLite functionality into the script, and to open ("initialize") a connection to the database. 

In [1]:
# configure; define some constants
CARREL   = 'williamPenn-from-freebo'
TEMPLATE = './carrels/%s/etc/reader.db'


In [2]:
# require
import sqlite3


In [3]:
# initialize; connect to the database
db                     = TEMPLATE % CARREL
connection             = sqlite3.connect( db )
connection.row_factory = sqlite3.Row


##  Bibliographics

The database includes a table called "bib" for "bibliographics". This is the study carrel's central table, and it includes fields for things like identifier, author, title, date, and summary (a computed "abstract"). It also includes fields akin to values for extent, such as: number of words, number of sentences, and readability score. It also include fields denoting the location of the cached original documents as well as their plain text transformations.

The following cells outline how to query the bib ("bibliographics") table.

In [4]:
# how many items are in the database; initialize a query
sql = "select count( id ) from bib"

# search; there is only one result, so only get a single item
result = connection.execute( sql ).fetchone()

# parse the result
count = result[ 0 ]

# output a formatted message
print( "There are %d documents in the database." % count )


There are 98 documents in the database.


In [5]:
# what is the average readability score of all documents; initialize and search
sql    = "select cast( avg( flesch ) as integer ) from bib"
result = connection.execute( sql ).fetchone()

# parse the result
score = result[ 0 ]

# output a formatted message
print( "The average readability score is %d." % score )
print( "Scores closer to 100 are easier to read. Scores closer to zero are more difficult." )


The average readability score is 91.
Scores closer to 100 are easier to read. Scores closer to zero are more difficult.


In [6]:
# create a rudimentary bibliography; initialize
header = ( 'id', 'author', 'title', 'date')
sql    = "select id, author, title, date from bib order by author;"

# search; find all rows
rows = connection.execute( sql )

# output the header
print( "\t".join( header ) )

# process each row; output a tab-delimited list
for row in rows : print( "\t".join( row ) )
    

id	author	title	date
A44560	Penn, William, 1644-1718.	The spiritual bee, or, A miscellany of scriptural, historical, natural observations and occasional occurencyes applyed in divine meditations by an university pen	1662.
A54151	Penn, William, 1644-1718.	The guide mistaken, and temporizing rebuked, or, A brief reply to Jonathan Clapham''s book intituled, A guide to the true religion in which his religion is confuted, his hypocrisie is detected, his aspersions are reprehended, his contradictions are compared / by W.P., a friend to the true religion.	1668.
A54206	Penn, William, 1644-1718.	The sandy foundation shaken, or, Those so generally believed and applauded doctrines ... refuted from the authority of Scripture testimonies, and right reason / by W.P. ...	Printed in the Year, 1668.
A54235	Penn, William, 1644-1718.	Truth exalted, in a short, but sure testimony against all those religions, faiths, and vvorships that have been formed and followed in the darkness of apostacy ... by Willia

In [7]:
# create a rudimentary bibliography with keywords and summary; configure and find all
sql  = '''select
            b.id,
            b.author,
            b.title,
            b.date,
            group_concat(w.keyword, '; ') as keywords,
            b.summary
          from
            bib as b,
            wrd as w
          where
            b.id = w.id
          group by
            b.id
          order by
            b.author'''
rows = connection.execute( sql )

# process each row
for row in rows : 
    
    # parse
    id, author, title, date, keywords, summary = row
    
    # output
    print( "          id: %s" % id )
    print( "      author: %s" % author )
    print( "       title: %s" % title )
    print( "        date: %s" % date )
    print( "  keyword(s): %s" % keywords )
    print( "     summary: %s" % summary )
    print()


          id: A23597
      author: Penn, William, 1644-1718.
       title: England''s great interest in the choice of this new Parliament dedicated to all her free-holders and electors.
        date: 1679]
  keyword(s): Choice; Government; Parliament; TCP
     summary: This keyboarded and encoded edition of the work described above is co-owned by the institutions providing financial support to the Early English Books Online Text Creation Partnership. England''s great interest in the choice of this new Parliament dedicated to all her free-holders and electors. England''s great interest in the choice of this new Parliament dedicated to all her free-holders and electors. EEBO-TCP is a partnership between the Universities of Michigan and Oxford and the publisher ProQuest to create accurately transcribed and encoded texts based on the image sets published by ProQuest via their Early English Books Online (EEBO) database (http://eebo.chadwyck.com). EEBO-TCP aimed to produce large quantities o

## Parts-of-speech, tokens, and words

The largest table in a study carrel -- by far -- is the pos ("parts-of-speech") table. This table contains each & every word from each & every document in a study carrel.

Each row in the pos table describes a "token", and a token may be a word, a number, a puncutation mark, or a combination of any of those things. Each token is assoicated with a bib ("document") id, a sentence id, a token id, the token, the token's lemma, and the token's part-of-speech label ("NN" for noun, "VRB" for verb, "JJ" for adjective, etc.)

Given this data structure, it is possible to count & tabulate the frequency of words, word stems, the lemmas of words, and parts-of-speech. Given a word, word stem, lemma, or part-of-speech value, it is also possible to extract and rebuild all the sentences containing these values. So, for example, the student, researcher, or scholar can output all the sentences containing "ahad" and/or "whale", and then they can do analysis against the result.

It is possible to apply combinations of SQL and grammars to the pos table, but such is discouraged. Instead the student, researcher, or scholar is encouraged to use alternative pattern-matching and/or machine learning techniques. Such techniques are implemented in the veneragble the Natural Langauge Toolkit, spaCy, and Textacy Python libraries.

The follow cells describe a number of different -- and hopefully, interesting -- techniques for exploiting the pos table.

## Keywords

Each document in the study carrel is associated with zero or more statistically computed keywords. These keywords are stored in a table called "wrd", and the table only has two fields: 1) id, and 2) keyword. The value of id is the value of a bib table id, and it is through this value that SQL joins can be established.

The cells below outline ways the wrd ("keywords") table can be used.


In [8]:
# how many keywords are in this carrel; initialize and get the result
sql     = "select count( keyword ) from wrd"
results = connection.execute( sql ).fetchone()

# parse
count = results[ 0 ]

# output a formatted message
print( "There are %d keywords in this carrel." % count )


There are 841 keywords in this carrel.


In [9]:
# how many distinct keywords exist in this carrel; initialize and search
sql     = "select count( distinct( lower( keyword ) ) ) from wrd"
results = connection.execute( sql ).fetchone()

# parse
count = results[ 0 ]

# output a formatted message
print( "There are %d distinct (read \"unique\") keywords in this carrel." % count )


There are 218 distinct (read "unique") keywords in this carrel.


In [10]:
# count and tabulate the keywords; so, what are the keywords and how often do they occur?

# configure and search
header = ( 'count', 'keyword' )
sql    = '''select
              lower(keyword),
              count(lower(keyword)) as count
            from
              wrd
            group by
              lower(keyword)
            order by
               count desc'''
rows   = connection.execute( sql )

# output a header
print( "\t".join( header ) )

# process each result
for row in rows :
    
    # parse and output as a tab-delimited list
    keyword, count = row
    print( "\t".join ( ( str( count ), keyword ) ) )


count	keyword
50	god
49	tcp
36	spirit
27	man
27	light
26	lord
26	church
25	people
24	christ
23	world
21	men
19	religion
18	truth
15	power
15	life
14	holy
13	law
12	government
10	laws
10	king
9	quakers
9	body
8	soul
8	scripture
8	father
8	conscience
7	scriptures
6	liberty
6	faith
5	william
5	rule
5	reason
5	meeting
5	gospel
5	english
5	book
4	word
4	son
4	parliament
4	kingdom
4	interest
4	doctrine
4	dissenters
4	court
4	christian
4	answer
4	adversary
3	way
3	thomas
3	penn
3	love
3	house
3	grace
3	faldo
3	eternal
3	civil
3	authority
3	apostle
2	year
2	verdict
2	thing
2	sufferings
2	state
2	saviour
2	river
2	province
2	protestants
2	prince
2	priest
2	popish
2	penal
2	papists
2	oaths
2	muggleton
2	ministry
2	knowledge
2	justice
2	jury
2	john
2	jesus
2	hat
2	guide
2	goods
2	glory
2	friends
2	england
2	duke
2	devil
2	death
2	day
2	country
2	city
2	christians
2	charter
2	books
2	blood
2	bishop
2	baptism
1	yea
1	writ
1	worship
1	work
1	women
1	warrant
1	w.p.
1	tryal
1	town
1	toleration
1	testi

In [11]:
# list items with a given keyword

# initialize; denote a keyword from the output of the previous cell
keyword = 'trojans'

# build a query and execute it; sounds so brutal
sql = ( '''select
             b.title
           from
             bib as b,
             wrd as w
           where
             lower(keyword) is '%s'
             and
             b.id = w.id
           order by
             title''' % keyword )
rows = connection.execute( sql )

# process each row; output a simple list
for row in rows :
    print( row[ 0 ] )
    print()
    

In [12]:
# find documents with more than one given keyword; perform a Boolean intersection

# configure with keywords from above, and remember, there may be zero documents in the result
keyword01 = 'ulysses'
keyword02 = 'jove'

# initialize
sql = ('''select 
           b.title,
           group_concat(lower(w.keyword), '; ') as keywords
         from
           bib as b,
           wrd as w,
           wrd as w1,
           wrd as w2
         where
           ( lower(w1.keyword) is '%s' and b.id is w1.id )
           and
           ( lower(w2.keyword) is '%s' and b.id is w2.id )
           and b.id = w.id
         group by
           b.id
         order by title''' % ( keyword01, keyword02 ) )

# search
rows = connection.execute( sql )

# process each resulting row
for row in rows :
    
    # parse
    title    = row[ "title" ]
    keywords = row[ "keywords" ] 

    # output
    print( "     title: %s" % title )
    print( "  keywords: %s" % keywords )
    print()
   

## URLs

Many documents include URLs, and the Reader does its best to identify those URLs and store them in a table called "urls". The table includes three fields: 1) id, 2) url, and 3) domain. The value of id is a link back to the bib table. The value of url is the... URL. The domain value is the string after the initial "//" of a URL and before the first instance of "/". 

The cells below demonstrate some of the ways the url ("URLs") table can be used.

In [13]:
# how many URLs are in this carrel; initialize and search
sql = "select count( url ) from url"
results = connection.execute( sql ).fetchone()

count = results[ 0 ]

print( "There are %d URLs in this carrel." % count )

There are 194 URLs in this carrel.


In [14]:
# count & tabulate the URLs; initialize and search
header = ( 'count', 'url' )
sql    = "select url, count(url) as count from url group by url order by count desc"
rows   = connection.execute( sql )

# output a header
print( "\t".join( header ) )

# process each row
for row in rows :
    
    # parse and output
    url, count = row
    print( "\t".join( ( str( count), url ) ) )
    

count	url
97	http://www.tei-c.org
97	http://eebo.chadwyck.com


In [15]:
# how many unique domains are represented by the URLs; do the work
sql     = "select count( distinct( lower( domain ) ) ) from url"
results = connection.execute( sql ).fetchone()
print( "There are %d unique domains represented by the URLs in this carrel." % results[ 0 ] )


There are 2 unique domains represented by the URLs in this carrel.


In [16]:
# count & tabulate the domains; what domains are oft-mentioned

# configure and serach
header = ( 'count', 'domain' )
sql    = '''select
              lower( domain ),
              count( lower( domain ) ) as count
            from
              url
            group by
              lower( domain )
            order by
              count desc'''
rows   = connection.execute( sql )

# ouput a header, and process each row
print( "\t".join( header ) )
for row in rows :
    
    # parse, and output some more
    domain, count = row
    print( "\t".join( ( str( count), domain )))


count	domain
97	www.tei-c.org
97	eebo.chadwyck.com


## Next steps

As a next step, return to the top of this notebook, change the value of "DB" to the name of a different database file found in this notebook's ./dbs directory. Examples include "homer.db", "melville-moby-1851.db" or "shakespeare-sonnets.db". Once you have changed the value, restart the notebook, and walk thorugh it again. By doing so the concepts outlined here will be re-enforced.