# Introduction

## Current Projects and Ongoing Work:

* Rails: Upgrading RIAMCO ([Rhode Island Archival and Manuscript Collections Online](https://www.riamco.org/)) to a newer version of Rails. Was previously a series of PHP scripts and XSLT Templates.
* Python for Natural Language Processing: A current project with the Center for Digital Scholarship to analyze Twitter data to measure attitudes toward racial disparities in maternal health in the context of COVID-19.
* PHP: Brown's Online Course Reserves Application.

## Other Technical skills

* Python
    * Web programming (Django, Bottle, Wagtail)
    * Visualization (matplotlib, Seaborn)
* Jupyter - sketching, presentations, notetaking, discussion.
* Databases (SQL, Solr)
* JavaScript (D3, jQuery)
* Git (collaboration, deployment)

In [46]:
#I need a few imports.
from collections import Counter, namedtuple
from difflib import HtmlDiff
import io
from math import ceil
from operator import itemgetter
import re
from tempfile import NamedTemporaryFile
from time import sleep
from urllib.parse import urlencode
from xml.etree import ElementTree as ET

import nltk
from nltk.corpus import brown, stopwords
from nltk.stem.snowball import SnowballStemmer
from PIL import Image
from pydash import py_
import pytesseract
import spacy
from rake_nltk import Rake
from requests import get, post
from textblob import TextBlob
import textract

import bingsc_settings as sc

from IPython import display

# Querying _Chronicling America_

This is a simple REST API. I just use the Python `requests` package to query it.

In [2]:
#URL template for searching Chronicling America.
searchtemplate = '''https://chroniclingamerica.loc.gov/search/pages/results/\
?state=&x=0&y=0&dateFilterType=yearRange&rows={count}&searchType=basic\
&format=json&language=eng\
&date1={startDate}&date2={endDate}&proxtext={searchTerms}&page={page}'''
searchopts = re.findall('\{(.+?)\}', searchtemplate)
searchopts

srch = {
    'searchTerms': 'mackenzie king', 
    'startDate': 1941, 
    'endDate': 1942,
    'count': 10,
}
searchterms = {x: srch.get(x, '') for x in searchopts}
searchterms

papersearch = get(searchtemplate.format(**searchterms)).json()
len(papersearch)

5

I'm looking at [page 30 of the October 26, 1941 _Sunday Star_](https://chroniclingamerica.loc.gov/lccn/sn83045462/1941-10-26/ed-1/seq-30/#date1=1941&index=7&rows=100&words=King+Mackenzie&searchType=basic&sequence=0&state=&date2=1945&proxtext=mackenzie+king&y=0&x=0&dateFilterType=yearRange&page=1).

In [33]:
i = papersearch['items'][1]
i

{'sequence': 30,
 'county': [None],
 'edition': None,
 'frequency': 'Daily',
 'id': '/lccn/sn83045462/1941-10-26/ed-1/seq-30/',
 'subject': ['Washington (D.C.)--fast--(OCoLC)fst01204505',
  'Washington (D.C.)--Newspapers.'],
 'city': ['Washington'],
 'date': '19411026',
 'title': 'Evening star. [volume]',
 'end_year': 1972,
 'note': ['"From April 25 through May 24, 1861 one sheet issues were published intermittently owing to scarcity of paper." Cf. Library of Congress, Photoduplication Service.',
  'Also issued on microfilm from Microfilming Corp. of America and the Library of Congress, Photoduplication Service.',
  'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.',
  'Publisher varies: Noyes, Baker & Co., <1867>; Evening Star Newspaper Co., <1868->',
  "Suspended Jan. 1-6, 1971 because of a machinists' strike."],
 'state': ['District of Columbia'],
 'section_label': '',
 'type': 'page',
 'place_of_publication': 

In [4]:
text = i['ocr_eng']

# Bing Spell Check

I'm using a spell checker to try to correct OCR errors. My first try was with the [`PySpellchecker` module](https://pyspellchecker.readthedocs.io/en/latest/index.html), but this focuses on a single word at a time and wasn't particularly useful here. [Microsoft's Spell Check API](https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/) looks at context and does a better job.

Most of the complexity in my `bing_spellcheck` function involves keeping queries within Bing's 10,000-character per-request limit. I divide the text into sentences using `NLTK` and send groups of sentences to Bing. I also remove characters other than letters, numbers and some punctuation, since some characters seem to confuse the API.

In [5]:
scparams = {
    'mkt':'en-us',
    'mode':'proof'
}
scheaders = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': sc.KEY1,
    #'X-Search-Location': 'lat:41.823611;long:-71.422222;re:2000'
}

SC_MAX_LENGTH = 10000


def bing_spellcheck(text):
    """
    Send `text` to the Bing Spell Check API, breaking it into
    chunks if necessary to stay below the maximum input length of 
    10000 bytes.
    
    Returns corrected text and the raw response from Bing.
    """
    #Many (most? all?) of these documents seem to contain 
    #some characters that breaks the spell checker. 
    #This regex replaces anything but letters, numbers, 
    #and common punctuation with spaces.
    text = re.sub(r'[^a-zA-Z0-9\.\?!,"\';:\- ]+', ' ', text)
    sentences = nltk.sent_tokenize(text)
    bit_length = ceil(len(sentences) / (ceil(len(text) / SC_MAX_LENGTH)))
    sctext = ''
    scoutp = []
    
    while len(sentences) > 0:
        bit = False
        mult = 1.1
        diff = 0
        
        #find a small enough group of sentences to send to the API.
        while len(sentences) > 0:
            bit = False
            mult = 1.1
            diff = 0
            while (bit is False or len(bit) > SC_MAX_LENGTH) and mult > .1:
                mult -= .1
                scount = ceil(bit_length*mult)
                bit = ' '.join(sentences[:scount])         
                scbit = ' '.join(sentences[:scount])
            del(sentences[:scount])

            data = {
                'text': bit
            }
            rsp = post(sc.ENDPOINT, headers=scheaders, params=scparams, data=data)
            scdata = rsp.json()

            #Go through the suggested replacements and create a fixed copy of the text.
            for repl in scdata.get('flaggedTokens', []):
                start = repl['offset'] - diff
                token = repl['token']
                end = start + len(token)

                sug = repl['suggestions'][0].get('suggestion', token)

                #print('token:', token, 'suggestion:', sug)
                #print(scbit[start-100:start], '----', scbit[start:end], '----', scbit[end:end+100])
                scbit = scbit[:start] + sug + scbit[end:]
                diff += len(token) - len(sug)

            sctext += scbit
            scoutp.append(scdata)
            
    return (sctext, scoutp)

sctext, _ = bing_spellcheck(text)

Using `diff`, we can evaluate the changes made by the spellchecker.

In [6]:
hd = HtmlDiff()
tbl = hd.make_table(nltk.word_tokenize(text), nltk.word_tokenize(sctext), 
                    fromdesc='text', todesc='sctext', context=True, numlines=2)
display.HTML(tbl)

Unnamed: 0,text,text.1,Unnamed: 3,sctext,sctext.1
n,1.0,“,n,,
,2.0,ss,,1.0,ss
,3.0,;,,2.0,;
,7.0,pumky,,6.0,pumky
,8.0,Pte,,7.0,Pte
n,9.0,■,n,,
,10.0,',,8.0,'
n,11.0,°,n,,
,12.0,“,,,
,13.0,N,,9.0,N


# Tesseract OCR

I experimented with re-OCR'ing the scanned images. The process is slow, and I wasn't impressed with the results.

In [7]:
pagedata = get(i['url']).json()
pageimg = get(pagedata['jp2']).content

In [8]:
imgdata = io.BytesIO()
imgdata.write(pageimg)
imgdata.seek(0)

img = Image.open(imgdata)
img.format = 'TIFF'

In [9]:
tesscfg = r'--oem 3 --psm 3'
newtext = pytesseract.image_to_string(img, config=tesscfg)

In [10]:
print(newtext)

Editorial Page
Features

.

The Sumiay Star

Civics

Organization News

 

TEN PAGES.

 

War With U. S. Seen Inevitable

If Tokio Plays for Big Stakes

Little Opposition Likely in Congress to Settling

Scores With Japan if Nipponese Push — -
Plan to Attack Siberia

By Constantine Brown.

Administrative quarters in Washing-
ton and most members of Congress—
including some of the inveterate isola-
tionists—are reported
Secretary Knox’s belief’'that war between
the United States and Japan is almost
inevitable.

If the Moscow front collapses, the per-
sonal position of Joseph Stalin may be
in jeopardy. Dictators cannot survive

defeats.
U. S. S. R. is open to unlimited ccnjec-
ture.

The Japanese have now assumed the
role that the Italians pleyed in 1949
after the fall of France. With the de-
feat of the principal Soviet armies gen-
erally conceded, Tokio is reported pian-
ning to strike at Siberia. The Germans
are encouraging them in this action.

Nazi agents and diplomats have been
extr

In [11]:
scnewtext, _ = bing_spellcheck(newtext)

In [12]:
hd = HtmlDiff()
tbl = hd.make_table(nltk.word_tokenize(sctext), nltk.word_tokenize(scnewtext), 
                    fromdesc='sctext', todesc='scnewtext', context=True, numlines=0)
display.HTML(tbl)

Unnamed: 0,sctext,sctext.1,Unnamed: 3,scnewtext,scnewtext.1
n,1.0,ss,n,1.0,Editorial
,2.0,;,,2.0,Page
,3.0,-,,3.0,Features
,4.0,-,,4.0,.
,5.0,pje,,5.0,The
,6.0,pumky,,6.0,Summary
,7.0,Pte,,7.0,Star
,8.0,',,8.0,Civics
,9.0,N,,9.0,Organization
,10.0,",",,10.0,News


# NLTK

Here I try comparing word frequencies in our serach result with the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus) to identify words that are more common in the search result pages than in American English writing in general.

This is more useful as an illustration of the technique than for producing useful information, owing to some limitations of the Brown Corpus:
* It's relatively small (about 500 writing samples, ~1,000,000 words.
* The text dates from 1961. Results based on it probably won't be relevant for the older texts in the _Chronicling America_ collection.
* _Brown_ is intentionally diverse, containing a variety of writing styles. Some of the "uncommon words" here might just be "newspaper speak" rather than unusual words.



In [38]:
#Get a list of common words that add no real information.
sws = stopwords.words('english')

In [35]:
#Get word frequencies from the Brown Corpus.
brownfreqs = Counter(word.lower() for word in brown.words())
brownlen = len(brown.words())
brownprobs = {k: v/brownlen for k, v in brownfreqs.items()}

In [39]:
#Get word frequencies on our page, minus stopwords.
wordfreqs = Counter(word.lower() for word in nltk.word_tokenize(sctext) 
                            if not (word in sws or len(word) < 4))
textlen = len(nltk.word_tokenize(sctext))
wordprobs = {k: v/textlen for k, v in wordfreqs.items()}

#Get differences in word frequencies in our page vs. Brown.
upwords = {k: v - brownprobs.get(k, 1) for k, v in wordprobs.items()}

In [40]:
sorted(upwords.items(), key=itemgetter(1), reverse=True)[:30]

[('hitler', 0.005094311778419546),
 ('canada', 0.0030972624451515904),
 ('japanese', 0.0025872353115259533),
 ('german', 0.0025596774227358136),
 ('canadian', 0.0024622948976096154),
 ('would', 0.002434838049499022),
 ('germany', 0.0020694575216780894),
 ('united', 0.0018886776068288143),
 ('siberia', 0.0018049365654256517),
 ('japan', 0.0017773786766355118),
 ('russia', 0.0017480984197959883),
 ('states', 0.0016199194607889342),
 ('program', 0.0014707971638452067),
 ('people', 0.0014097905587641183),
 ('king', 0.0014052097172965908),
 ('masses', 0.001298354167898782),
 ('government', 0.0012855738682004381),
 ('political', 0.001258808433098973),
 ('price', 0.0012234311577505894),
 ('order', 0.0011571887181853326),
 ('world', 0.0011323518421413024),
 ('prices', 0.0010993519278589435),
 ('support', 0.000996871028920611),
 ('control', 0.0009598401158588607),
 ('criticism', 0.0009528819133253088),
 ('wages', 0.000951159545275925),
 ('wage', 0.0009391029689302388),
 ('words', 0.000915919730

# Entity Recognition

I tried two methods to find important terms in the text.

## TextBlob

For my first attempt I used [TextBlob](https://textblob.readthedocs.io/en/dev/)'s part-of-speech tagging and looked for groups of proper nouns. 

In [17]:
textb = TextBlob(sctext)

In [41]:
textb.tags[:100]

[('ss', 'NN'),
 ('pje', 'NN'),
 ('pumky', 'NN'),
 ('Pte', 'NNP'),
 ("'", 'POS'),
 ('N', 'NNP'),
 ('ws', 'NN'),
 ('b', 'NN'),
 ('TEX', 'NNP'),
 ('PAGES', 'NNP'),
 ('WASHINGTON', 'NNP'),
 ('D.', 'NNP'),
 ('C.', 'NNP'),
 ('OCTOBER', 'NNP'),
 ('26', 'CD'),
 ('1941', 'CD'),
 ('-r', 'NN'),
 ('n', 'NN'),
 ('War', 'NNP'),
 ('With', 'IN'),
 ('U.', 'NNP'),
 ('S.', 'NNP'),
 ('Seen', 'NNP'),
 ('Inevitable', 'NNP'),
 ('If', 'IN'),
 ('Tokio', 'NNP'),
 ('Plays', 'NNP'),
 ('for', 'IN'),
 ('Big', 'NNP'),
 ('Stakes', 'NNP'),
 ('Little', 'NNP'),
 ('Opposition', 'NNP'),
 ('Likely', 'NNP'),
 ('in', 'IN'),
 ('Congress', 'NNP'),
 ('to', 'TO'),
 ('Settling', 'VBG'),
 ('Scores', 'NNS'),
 ('With', 'IN'),
 ('Japan', 'NNP'),
 ('if', 'IN'),
 ('Nipponese', 'JJ'),
 ('Push', 'NNP'),
 ('Plan', 'NN'),
 ('to', 'TO'),
 ('Attack', 'VB'),
 ('Siberia', 'NNP'),
 ('By', 'IN'),
 ('Constantine', 'NNP'),
 ('Brown', 'NNP'),
 ('Administrative', 'JJ'),
 ('quarters', 'NNS'),
 ('in', 'IN'),
 ('Washington', 'NNP'),
 ('and', 'CC'),
 ('

In [43]:
#Get a list of just proper nouns.
[(idx, word) for idx, (word, pos) in enumerate(textb.tags) if pos in ('NNP', 'NNPS')][:100]

[(3, 'Pte'),
 (5, 'N'),
 (8, 'TEX'),
 (9, 'PAGES'),
 (10, 'WASHINGTON'),
 (11, 'D.'),
 (12, 'C.'),
 (13, 'OCTOBER'),
 (18, 'War'),
 (20, 'U.'),
 (21, 'S.'),
 (22, 'Seen'),
 (23, 'Inevitable'),
 (25, 'Tokio'),
 (26, 'Plays'),
 (28, 'Big'),
 (29, 'Stakes'),
 (30, 'Little'),
 (31, 'Opposition'),
 (32, 'Likely'),
 (34, 'Congress'),
 (39, 'Japan'),
 (42, 'Push'),
 (46, 'Siberia'),
 (48, 'Constantine'),
 (49, 'Brown'),
 (53, 'Washington'),
 (58, 'Congress'),
 (69, 'Navy'),
 (70, 'Secretary'),
 (71, 'Knox'),
 (78, 'United'),
 (79, 'States'),
 (81, 'Japan'),
 (87, 'Moscow'),
 (94, 'Joseph'),
 (95, 'Stalin'),
 (110, 'U.'),
 (111, 'S.'),
 (112, 'R.'),
 (119, 'Japanese'),
 (127, 'Italians'),
 (135, 'France'),
 (146, 'Tokyo'),
 (153, 'Siberia'),
 (155, 'Germans'),
 (171, 'Japan'),
 (185, 'Pacific'),
 (197, 'Russia'),
 (203, 'Japanese'),
 (218, 'Nazis'),
 (224, 'Moscow'),
 (234, 'Hence'),
 (235, 'Japan'),
 (243, 'United'),
 (244, 'States'),
 (252, 'Japan'),
 (260, 'Reich'),
 (262, 'Tokyo'),
 (265, 

In [20]:
nnp = []

etags = enumerate(textb.tags)
for idx, (word, tag) in etags:
    if tag in ('NNP', 'NNPS'):
        newnnp = [idx, [word]]
        idx, (word, tag) = next(etags)
        while tag in ('NNP', 'NNPS'):
            newnnp[1].append(word)
            idx, (word, tag) = next(etags)
        nnp.append(newnnp)

len(nnp)

387

Taking groups of consecutive proper nouns gives us 

In [44]:
nnp[:100]

[[3, ['Pte']],
 [5, ['N']],
 [8, ['TEX', 'PAGES', 'WASHINGTON', 'D.', 'C.', 'OCTOBER']],
 [18, ['War']],
 [20, ['U.', 'S.', 'Seen', 'Inevitable']],
 [25, ['Tokio', 'Plays']],
 [28, ['Big', 'Stakes', 'Little', 'Opposition', 'Likely']],
 [34, ['Congress']],
 [39, ['Japan']],
 [42, ['Push']],
 [46, ['Siberia']],
 [48, ['Constantine', 'Brown']],
 [53, ['Washington']],
 [58, ['Congress']],
 [69, ['Navy', 'Secretary', 'Knox']],
 [78, ['United', 'States']],
 [81, ['Japan']],
 [87, ['Moscow']],
 [94, ['Joseph', 'Stalin']],
 [110, ['U.', 'S.', 'R.']],
 [119, ['Japanese']],
 [127, ['Italians']],
 [135, ['France']],
 [146, ['Tokyo']],
 [153, ['Siberia']],
 [155, ['Germans']],
 [171, ['Japan']],
 [185, ['Pacific']],
 [197, ['Russia']],
 [203, ['Japanese']],
 [218, ['Nazis']],
 [224, ['Moscow']],
 [234, ['Hence', 'Japan']],
 [243, ['United', 'States']],
 [252, ['Japan']],
 [260, ['Reich']],
 [262, ['Tokyo']],
 [265, ['Wait']],
 [267, ['See']],
 [277, ['Pacific']],
 [280, ['Japan']],
 [285, ['Axis']

## SpaCy

[SpaCy](https://spacy.io/) is another natural language processing package. Here I try using its default entity recognition system on the text.

In [45]:
nlp = spacy.load("en_core_web_sm")

This gives us a reasonable-looking list of entities, tagged by what kind of thing each one is (_Person_, _Geo-political entity_, etc.).

In [23]:
scydoc = nlp(sctext)

[(ent.text, ent.start_char, ent.end_char, ent.label_) 
     for ent in scydoc.ents]

[('ss;-', 1, 5, 'ORG'),
 ('TEX', 35, 38, 'ORG'),
 ('WASHINGTON', 46, 56, 'GPE'),
 ('D. C., OCTOBER 26, 1941', 58, 81, 'PERSON'),
 ('War With U. S.', 104, 118, 'WORK_OF_ART'),
 ('Nipponese', 235, 244, 'NORP'),
 ('Attack Siberia', 258, 272, 'PERSON'),
 ('Constantine Brown', 276, 293, 'PERSON'),
 ('Administrative quarters', 295, 318, 'DATE'),
 ('Washington', 322, 332, 'GPE'),
 ('Congress', 353, 361, 'ORG'),
 ('Navy', 431, 435, 'ORG'),
 ('Knox s', 446, 452, 'PERSON'),
 ('the United States', 477, 494, 'GPE'),
 ('Japan', 499, 504, 'GPE'),
 ('Inevitable', 515, 525, 'ORG'),
 ('Moscow', 534, 540, 'GPE'),
 ('Joseph Stalin', 583, 596, 'PERSON'),
 ('the U. S.  R.', 678, 691, 'GPE'),
 ('Japanese', 729, 737, 'NORP'),
 ('Italians', 773, 781, 'NORP'),
 ('1910', 792, 796, 'DATE'),
 ('France', 815, 821, 'GPE'),
 ('Soviet', 856, 862, 'NORP'),
 ('Tokyo', 890, 895, 'GPE'),
 ('Siberia', 930, 937, 'LOC'),
 ('Germans', 943, 950, 'NORP'),
 ('Nazi', 988, 992, 'NORP'),
 ('Japan', 1044, 1049, 'GPE'),
 ('American-

In [24]:
entities = set((ent.text, ent.start_char, ent.label_) 
     for ent in scydoc.ents
     if ent.label_ not in ('DATE', 'MONEY', 'CARDINAL', 'ORDINAL')
           and len(ent.text) > 3
    )
print(len(entities))
entities

376


{('"Mein Kampf  even then was available in English.', 23920, 'WORK_OF_ART'),
 ('"My New Order', 8675, 'WORK_OF_ART'),
 ('A Gangster Plus', 23104, 'WORK_OF_ART'),
 ('Adolf Hitler', 12261, 'PERSON'),
 ('America', 4469, 'GPE'),
 ('America', 7153, 'GPE'),
 ('America', 12840, 'GPE'),
 ('America', 23337, 'GPE'),
 ('America', 24005, 'GPE'),
 ('American', 1297, 'NORP'),
 ('American', 7196, 'NORP'),
 ('American', 12857, 'NORP'),
 ('American', 29455, 'NORP'),
 ('American', 30047, 'NORP'),
 ('American-Japanese', 1078, 'NORP'),
 ('Americans', 9790, 'NORP'),
 ('Aryan', 6853, 'PERSON'),
 ('Asia', 6425, 'LOC'),
 ('Asiatic', 3626, 'LOC'),
 ('Asiatic', 4868, 'GPE'),
 ('Asiatic', 7318, 'GPE'),
 ('Atlantic', 32173, 'LOC'),
 ('Attack Siberia', 258, 'PERSON'),
 ('Austrian', 14606, 'NORP'),
 ('Balkans', 9717, 'LOC'),
 ('Bargaining Out', 27388, 'ORG'),
 ('Berlin', 22404, 'GPE'),
 ('Berlin', 31841, 'GPE'),
 ('Big Business', 27435, 'ORG'),
 ('Bolshevism', 22650, 'ORG'),
 ('Bolshevist', 22764, 'NORP'),
 ('Bolsh

In [47]:
#Remove duplicates
ents = list(entities)
print(len(ents))
for e in range(len(ents)):
    if ents[e] != False:
        for f in range(e+1, len(ents)):
            if ents[f] != False:
                if ents[e][0].lower() == ents[f][0].lower():
                #dist = distance(ents[e][0], ents[f][0])
                #if 1 >= dist or (len(ents[e][0]) > 6 and 2 == dist):
                    ents[f] = False

ents = sorted(list(filter(None, ents)), key=itemgetter(1)) #Sort by position in text.
print(len(ents))

376
166


# Wikidata

In [26]:
wikisearchurl = "https://www.wikidata.org/w/api.php?action=wbsearchentities&language=en&limit=1&format=json&"
wikigeturl = "https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&languages=en&format=json&"

This code searches Wikidata for each of our entities. In the end, I decided against doing it (one call to the API for each item in our list seemed excessive for this demo) but it might be worth pursuing. 

As a compromise, I could search for items that aren't found in my next attempt below.

In [27]:
"""
cnt = 0
mapdata = []
for search in ents:
    results = get(wikiurltemp + urlencode({'search': search[0]})).json()
    print(search)
    try: 
        print(results)
        res = results['search'][0]
        print(res)
        cnt += 1
    except:
        print('no results')
    print('-------------------------------')
    sleep(.05)
cnt
"""

"\ncnt = 0\nmapdata = []\nfor search in ents:\n    results = get(wikiurltemp + urlencode({'search': search[0]})).json()\n    print(search)\n    try: \n        print(results)\n        res = results['search'][0]\n        print(res)\n        cnt += 1\n        if search[3] == 'GPE':\n    except:\n        print('no results')\n    print('-------------------------------')\n    sleep(.05)\ncnt\n"

This code retrieves Wikidata entries by article name. It misses some items that searching might retrieve, but this API allows us to request 50 titles per call.

In [48]:
outp = {}
while ents:
    srch = '|'.join(e[0] for e in ents[:50])
    del(ents[:50])
    results = get(wikigeturl + urlencode({'titles': srch})).json()
    outp.update({k: v for k, v in results['entities'].items() if k[0] != '-'})

In [29]:
EntityData = namedtuple('EntityData', ('name', 'description', 'url', 'lat', 'long'))
entity_data = []

for idx, ent in outp.items():
    entity_data.append(EntityData(py_.get(ent, 'labels.en.value'),
                                  py_.get(ent, 'descriptions.en.value'),
                                  f'https://www.wikidata.org/wiki/{idx}',
                                  py_.get(ent, 'claims.P625.0.mainsnak.datavalue.value.longitude', False),
                                  py_.get(ent, 'claims.P625.0.mainsnak.datavalue.value.latitude', False)))

# Folium

Finally, as an example of what the Wikidata input is good for, we can map the locations of entities we have geographic data for.

In [49]:
import os
import folium
from folium.plugins import MarkerCluster


m = folium.Map()

marker_cluster = MarkerCluster().add_to(m)

for ent in entity_data:
    if ent.lat and ent.long:
        #print(ent)
        folium.Marker(
            location=[ent.long, ent.lat],
            popup=f'<h3>{ent.name}</h3>{ent.description}',
            icon=folium.Icon(color='blue'),
        ).add_to(marker_cluster)
    
#m.save(os.path.join('1000_MarkerCluster0.html'))

m

# Closing

* Chronicling America API: https://chroniclingamerica.loc.gov/about/api/
* Wikidata API:  https://www.mediawiki.org/wiki/API:Web_APIs_hub
* [XML (Text Creation Partnership - Eighteenth Century Collections Online)](https://github.com/Text-Creation-Partnership/ECCO-TCP/blob/master/ecco_all/ecco_unfinished/XML/ECCO_unedited.xml/K134436.000.xml) [[Description](https://textcreationpartnership.org/tcp-texts/ecco-tcp-eighteenth-century-collections-online/)]

I ended up using two of these--I considered doing a similar analysis using the ECCO data, but thought I'd have more luck finding named entities from newspapers in Wikidata.