## Visualizing results -  Original Geoparser

We have run the [Edinburgh geoparser](https://programminghistorian.org/en/lessons/geoparsing-text-with-edinburgh#patch-fix) to extract and resolvoe locations contained within the **First Edition, Volume 1, A-B** (year 1771) of the *Encyclopaedia Britannica*
                             
The query for running that we have used is the following: 

`spark-submit --py-files defoe.zip defoe/run_query.py nls_subsample.txt nls defoe.nls.queries.geoparser_pages -r geopaser_original_EB_144133901 -n 16`

The nls_subsample.txt has the path of this particular book that we are exploring:

`xxx/nls-data-encyclopaediaBritannica/144133901`

The results file ( [geopaser_original_EB_144133901](http://localhost:8888/edit/results_NLS/geopaser_original_EB_144133901)) is stored in this repository under the *results_NLS* subdirectory. 


### Loading the necessary libraries

In [1]:
import yaml
from IPython.core.display import display, HTML
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

### Functions

In [2]:
def read_query_results(filename):
    with open('./results_NLS/'+filename, 'r') as f:
        query_results = yaml.load(f)
    return query_results


In [3]:
def geresolution_dataframe(result):
    dfs=[]
    data=[]
    c_locs=[]
    for i in result.keys():
        t_ind = 0
        e_ind = 0
        for k in result[i]:
            locs= k["georesolution_page"]
            page = k["text_unit id"]
       
            if locs != {}:
                data=[]
                l_ind = 0
                for i in locs:
                    if type(locs[i]) == type([]) :
                        c_locs=locs[i].copy()
                        c_locs.append(i.split("-")[0])
                        c_locs.append(page)
                        data.append(c_locs)
                        l_ind = l_ind + 1   
                e_ind = t_ind + l_ind 
                if data:
                    df_page = pd.DataFrame(data, columns = ['Latitude', 'Longitude', 'Place', 'Page'], 
                                      index=list(range(t_ind, e_ind)))
                    dfs.append(df_page)
                    t_ind=e_ind
    df_total = pd.concat(dfs)
    return df_total


This query does the following tasks:

- Ingests all the pages from the directory "144133901",  which corresponds to the book "Encyclopaedia Britannica; or, A dictionary of arts and sciences, compiled upon a new plan … - First edition, 1771, Volume 1, A-B - EB.1"
- Cleans the text applied two fixes: Long-s and hyphen words
- Identifies *entities* using the original geotagging of the geoparser. 
- From the previous entities, just selects the one about *location* and creates an xml (in memory) per page with these "location" entities. 
- Applies the georesolve to each xml and gets lat and long. **Important: Everything is in memory, we do not create XML files in those steps**
- Group the results by Book's title, and also gets some informative metadata

As a result we get a file per gazetter|book with an entry per page with the following information:

    * archive_filename: Path to the gazetteer 
    * clean_text: Page's clean text after applying 2 fixes: long-S and hyphenate words
    * display_ER: Display of a page’s entities found by the geoparser (in HTML format)
    * edition: Edition of the gazetteer
    * georesolution_page: Page's geolocations after applying the georesolver
    * model: defoe model – could be fmp|nls|papers|alto. In this case is “nls”
    * text_unit: page (for other defoe models could be “article”)
    * num_text_unit: number of tex units. In this case, number of pages of this particular gazetter (e.g. 832)
    * page_filename: Page's filename (page's relative path)
    * text_unit id: The number of this page (e.g. Page 1)
    * lang_model : The language model applied (original_geoparser)
    * type_distribution: type of document (newspaper|book). In this case is "book". 
    * year: Publication year 


Example:
  - archive_filename: /home/tdm/datasets/encyclopaedia-britannica-sample/144133901
  - clean_text: "AGRICULTURE. 47 PART II. Of the various Operations upon the Soil, in\
    \ order to prepare it for the Recsp-' tion and Nourijhment of Plants. S e c t.\
  - edition: First edition, 1771, Volume 1, A-B
  - georesolution_page:
     - Scotland-rb2:
       - '57.68633318560074'
       - '-4.96890721218449'
  - lang_model: geoparser_original
  - model: nls
  - num_text_unit: 832
  - page_filename: alto/188083661.34.xml
  - place: Edinburgh
  - text_unit: page
  - text_unit id: Page73
  - type_distribution: book



In [4]:
results=read_query_results('geopaser_original_EB_144133901')

In [5]:
df_total= geresolution_dataframe(results)

In [6]:
df_total

Unnamed: 0,Latitude,Longitude,Place,Page
0,57.68633318560074,-4.96890721218449,SCOTLAND,Page9
1,45.27190104864457,-69.02811957460864,Me,Page12
2,40.66852,16.60158,Materia,Page12
3,53.69180622064813,-2.423529257645026,Scotland,Page13
4,27.77136719389102,-83.82959337018778,Florida,Page13
5,33.62540263492369,-80.96431606708276,Carolina,Page13
6,34.8381384,-84.41853570000001,Bart,Page13
7,18.11396006320902,-77.28058823696463,Jamaica,Page14
8,-29.1,18.4166667,Amam,Page14
9,35.67397,-94.07353999999999,Locke,Page14


In [7]:
df_total[["Place"]].count()

Place    5618
dtype: int64

In [8]:
df_total.sum()

Latitude     57.6863331856007445.2719010486445740.6685253.6...
Longitude    -4.96890721218449-69.0281195746086416.60158-2....
Place        SCOTLANDMeMateriaScotlandFloridaCarolinaBartJa...
Page         Page9Page12Page12Page13Page13Page13Page13Page1...
dtype: object

In [9]:
df_total.groupby("Place").count()

Unnamed: 0_level_0,Latitude,Longitude,Page
Place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AACH,1,1,1
AADE,1,1,1
ABACA,1,1,1
ABACH,1,1,1
ABACO,1,1,1
ABADAN,1,1,1
ABADIR,1,1,1
ABAI,1,1,1
ABANCAI,1,1,1
ABANO,1,1,1


In [10]:
df_total.groupby("Page").count()

Unnamed: 0_level_0,Latitude,Longitude,Place
Page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Page100,9,9,9
Page101,37,37,37
Page102,25,25,25
Page103,34,34,34
Page104,24,24,24
Page105,34,34,34
Page106,42,42,42
Page109,22,22,22
Page110,23,23,23
Page111,36,36,36


In [11]:
df_total.groupby(["Latitude", "Longitude"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Place,Page
Latitude,Longitude,Unnamed: 2_level_1,Unnamed: 3_level_1
,,298,298
-0.1366818175840478,100.6383162769587,2,2
-0.1544928803232608,-78.44005994797827,1,1
-0.2333333,-78.33333330000001,1,1
-0.6525817868904085,14.88498763015494,1,1
-1.0753,121.7811,1,1
-1.15,17.3333333,1,1
-1.380613091643525,-48.41852484800316,1,1
-1.445725396559766,5.629326002193295,1,1
-10.28333,14.95,1,1
