# Exploring Terms in the Encyclopaedia Britannica


### Loading the necessary libraries

In [1]:
import yaml
import matplotlib.pyplot as plt
import numpy as np
import collections
import matplotlib as mpl

In [2]:
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize

## We have dataframe with these information


- definition:           Definition of the article
- edition_num:          1,2,3,4,5,6,7,8
- header:               Header of the page's article                                  
- num_article_words:    Number of words per article
- place:                Place where the volume was edited (e.g. Edinburgh)                                    
- related_terms:        Related articles (see X article)  
- source_text_file:     File Path of the XML file from which the article belongs       
- term:                 Article name                            
- term_id_in_page:      Number of article in the page     
- start_page:           Number page in which the article starts 
- end_page:             Number page in which the article ends 
- title:               Title of the Volume
- type_article:            Type of Page [Full Page| Topic| Mix | Articles]                                       
- year:                 Year of the Volume
- volume:               volume (e.g. 1)
- letters:              leters of the volume (A-B)

### 1. Load dataframe from JSON file

In [3]:
df = pd.read_json('./results_NLS/results_eb_1_edition_postprocess_dataframe', orient="index") 

In [4]:
df = df[["term", "definition", "related_terms", "num_article_words", "header", "start_page", "end_page",  "term_id_in_page", "type_article", "edition_num", "volume", "letters", "year", "title",  "place", "source_text_file"  ]]
df

Unnamed: 0,term,definition,related_terms,num_article_words,header,start_page,end_page,term_id_in_page,type_article,edition_num,volume,letters,year,title,place,source_text_file
1,FIRSTARTICLE,S :u -I >;J .1 M U a C V',[],10,**■*,8,8,0,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082813.34.xml
10,AAM,"a Dutch measure for liquids, containing about ...",[],17,EncyclopaediaBritannica,15,15,5,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082904.34.xml
100,ABEYANCE,"in law, the expedtancy of an edate. Thus if la...",[],36,ABE,18,18,0,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082943.34.xml
1000,ALCACER,"de Sal, or Alcarez, a town of Portugal in the ...",[],26,ALBALC,106,106,24,Mix,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188084090.34.xml
10000,NYCTALOPIA,"in medicine, a two-sold disorder or the eye, o...",[],67,NYBNYS,473,473,4,Article,1,3,M-Z,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133903/alto/144810223.34.xml
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,NUTMEG,"the kernel of a large fruit, not unlike the Th...","[MACE, PEEMED, DUTCH, THELARGEFT, EAP-INDIES, ...",451,NUTNUT,472,472,10,Article,1,3,M-Z,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133903/alto/144810211.34.xml
9996,NUTRITION,"in the animal ceconomy, is the repairing the c...",[PISTACHIA],486,NUTNUT,472,473,11,Article,1,3,M-Z,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133903/alto/144810211.34.xml
9997,NUYS,"a town of Germany, twenty miles north of Co-",[],9,NYBNYS,473,473,1,Article,1,3,M-Z,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133903/alto/144810223.34.xml
9998,INYBURG,"a town of Denmark, situated at the eafiend of ...",[],25,NYBNYS,473,473,2,Article,1,3,M-Z,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133903/alto/144810223.34.xml


### 2. Group results by year

In [5]:
df.groupby("year").count()

Unnamed: 0_level_0,term,definition,related_terms,num_article_words,header,start_page,end_page,term_id_in_page,type_article,edition_num,volume,letters,title,place,source_text_file
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1771,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623,13623
1773,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832,13832


#### Remark:
So, we have 13623 terms in 1771 and 13832 terms in 1773. Note, that some of those terms can be repeated

### 3. Group results by letters

In [6]:
df.groupby("letters").count()

Unnamed: 0_level_0,term,definition,related_terms,num_article_words,header,start_page,end_page,term_id_in_page,type_article,edition_num,volume,year,title,place,source_text_file
letters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
A-B,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906,7906
C-L,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172,10172
M-Z,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377,9377


### 4. Group results by letters and years

In [7]:
df.groupby(['year', 'letters'])["letters"].count()

year  letters
1771   A-B       3900
       C-L       5097
       M-Z       4626
1773   A-B       4006
       C-L       5075
       M-Z       4751
Name: letters, dtype: int64

#### Remark:
Note, that some of those terms can be repeated

### 5. Filtering by TERMS: ABACUS

We are going to explore the term ABACUS

Notice that the first edition of the EB, there 6 volumes, 3 published in 1771, and 3 published in 1773. However, the volumes from 1773 are a re-print from the ones of 1771. 

In [8]:
df_by_term=df.groupby(['term', 'year'])["term"].count()
df_by_term["ABACUS"]


year
1771    4
1773    2
Name: term, dtype: int64

### 5.1 Exploring ABACUS in 1771

We are going to explore the term "ABACUS" in the volumes from 1771.

In [9]:
abacus_df= df[df['term'].str.contains("ABACUS")]
abacus_df = abacus_df[abacus_df['year'] == 1771]
abacus_df

Unnamed: 0,term,definition,related_terms,num_article_words,header,start_page,end_page,term_id_in_page,type_article,edition_num,volume,letters,year,title,place,source_text_file
25,ABACUS,"a table strewed over with dust or sand, upon w...",[],23,EncyclopaediaBritannica,15,15,20,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082904.34.xml
26,ABACUS,"in architeflure, signifies the superior part o...",[],122,EncyclopaediaBritannica,15,16,21,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082904.34.xml
27,ABACUS,is also the name of an ancient instrument for ...,[],125,ABAABB,16,16,1,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082917.34.xml
28,ABACUS,"logijlicus, a right-angled triangle, whose sid...",[],50,ABAABB,16,16,2,Article,1,1,A-B,1771,"Encyclopaedia Britannica; or, A dictionary of ...",Edinburgh,144133901/alto/188082917.34.xml


#### Remark:
So, the TERM "ABACUS" appears 4 times, across two pages. 

#### 5.2 Getting the definnition for each of them 

In [10]:
for i in abacus_df["definition"]:
    print ("ABACUS - Definition: %s" %i)
    print("---")

ABACUS - Definition: a table strewed over with dust or sand, upon which the ancient mathematicians drew their figures, It also signified a cupboard, or buffet.
---
ABACUS - Definition: in architeflure, signifies the superior part or member of the capital of a column, and serves as a kind of crowning to both. It was originally intended to represent a square tile covering a basket. The form of the abacus is not the same in all orders: in the Tuscan, Doric, and Ionic, it‘is generally square; but in the Corinthian and Compofite, its four sides are arched ir Avards, and embellilhed in the middle withornament, as a rose or other flower, Scammozzi uses abacus for a concave moulding on the capital of the Tuscan pedefial; and Palladio calls the plinth above the echinus, or boultin, in the Tufean and Doric orders, by the same name. See plate I. fig. i. and
---
ABACUS - Definition: is also the name of an ancient instrument for facilitating operations in arithmetic. It is vadoully contrived. That 

#### 5.3 Creating groups of terms and years

In [11]:
df.groupby(['term', 'year']).groups.keys()

dict_keys([('A', 1771), ('A', 1773), ('AA', 1771), ('AA', 1773), ('AAB', 1773), ('AABAM', 1771), ('AACH', 1771), ('AADE', 1771), ('AAHUS', 1771), ('AAM', 1771), ('AAR', 1771), ('AARSEO', 1771), ('AATTER', 1771), ('AB', 1771), ('AB', 1773), ('ABA', 1771), ('ABAC', 1771), ('ABACATUAIA', 1771), ('ABACH', 1771), ('ABACISCUS', 1771), ('ABACO', 1771), ('ABACOT', 1771), ('ABACRE', 1771), ('ABACTORES', 1771), ('ABACTUS', 1771), ('ABACUS', 1771), ('ABACUS', 1773), ('ABADAN', 1771), ('ABADAN', 1773), ('ABADDON', 1771), ('ABADDON', 1773), ('ABADIR', 1771), ('ABADIR', 1773), ('ABAFT', 1771), ('ABAFT', 1773), ('ABAI', 1771), ('ABAISED', 1771), ('ABAISED', 1773), ('ABAISSE', 1771), ('ABAISSE', 1773), ('ABALIENATION', 1771), ('ABALIENATION', 1773), ('ABANBO', 1771), ('ABANBO', 1773), ('ABANCAI', 1771), ('ABANCAI', 1773), ('ABANO', 1771), ('ABANO', 1773), ('ABAPTISTON', 1771), ('ABAPTISTON', 1773), ('ABARCA', 1771), ('ABARCA', 1773), ('ABARTICULATION', 1771), ('ABAS', 1771), ('ABAS', 1773), ('ABASCIA'

In [12]:
len(df.groupby(['term']).groups['ABACUS'])

6

### 6. Grouping the results by TERM, YEAR and DEFINITION

Group data by term and years, and count each group. This will help us to see how many times each term is repeated by year. 

In [13]:
a=df.groupby(['term', 'year'])['definition'].count()
a

term         year
A            1771    7
             1773    6
AA           1771    2
             1773    3
AAB          1773    1
                    ..
ZYGOMA       1771    1
             1773    1
ZYGOMATICUS  1773    1
ZYGOPHYLLUM  1771    1
             1773    1
Name: definition, Length: 26771, dtype: int64

**Remark**: This means that the term "A" appears 7 times  in 1771, and 6 times in 1773. 

#### 6.1 Obtaining for each term, its years and the definitions

We are going to create groups of ("TERM", "YEAR"), and for each of those groups, we are going to print their definition.

**Remark**: I am going to restrict them to 10 groups. 

In [14]:
groups = df[['term', 'year', 'definition']].groupby(['term', 'year'])
cont = 0
for group_key, group_value in groups:
    group = groups.get_group(group_key)
    print(group)
    print("---- Len of this group: %s - group_key %s " %(len(group), group_key))
    cont+=1
    if cont > 10:
        break


      term  year                                         definition
1016     A  1771                                    See Alchemilla.
1086     A  1771  gives a—by and “ the investigation of that sur...
12048    A  1771  r Y\ C / 7 f C.A ( ^y \ ~^\ \ C' A h \A v m aa...
12049    A  1771  I -w 'i <? ^' 0 IY\, y f‘ 1 A_-A IV^-/Y\< -'/W...
12722    A  1771  /kins may be tawed : but thc-se chiefly used f...
3236     A  1771  -Bladder, in physiology. See Air. ^//-Bladders...
6933     A  1771  in London is cieditor to (B) in Paris, value 1...
---- Len of this group: 7 - group_key ('A', 1771) 
      term  year                                         definition
20828    A  1773  in London is creditor to (B) in Paris, value t...
22717    A  1773  so served, and the day of appearance. When the...
23623    A  1773                                     f a 2±= m I ±P
24233    A  1773  l class. ' The calix consists of five leaves, ...
25141    A  1773  worate performance. Government lias, however, .

#### 6.2 Exploring how many times each term appears per year

Now, lets get the size of those groups, so we can see how many definitions we have per term and per year.
This exactly the same that we did in the 6.1, but having the results in dataframe format.

In [15]:
g_year_term=df.groupby(['term', 'year']).size().reset_index()
g_year_term

Unnamed: 0,term,year,0
0,A,1771,7
1,A,1773,6
2,AA,1771,2
3,AA,1773,3
4,AAB,1773,1
...,...,...,...
26766,ZYGOMA,1771,1
26767,ZYGOMA,1773,1
26768,ZYGOMATICUS,1773,1
26769,ZYGOPHYLLUM,1771,1


#### 6.2.1 Grouping the previous results per year. 
This will give us the number of terms that we have per year. 
**Remember that a term can appear several time per volume**.

In [16]:
g_year_term.groupby(['year']).size()

year
1771    13252
1773    13519
dtype: int64

#### 6.3 Exploring in how many years each term appears across years

Here we are interested to explore, for each unique term, in how many years appears. 

**Remark**: In the first eddition of the EB, 3 volumes (A-B, C-L, and M-Z) are published in two years: 1771 and 1773. So, the miminum time that each term appears in the first eddition is one, being two the maximum number of times. 

In [17]:
#here we get if a term appears 1 or in 2 sub_edditions. 
terms_per_ed=g_year_term[['term', 'year']].groupby(['term']).count()
#print(terms_per_ed.max())
terms_per_ed

Unnamed: 0_level_0,year
term,Unnamed: 1_level_1
A,2
AA,2
AAB,1
AABAM,1
AACH,1
...,...
ZUYDERSEE,2
ZWEIBRUGGEN,2
ZYGOMA,2
ZYGOMATICUS,1


This means that the term "A" appears in two years. And the term "AAB" only appears in one year

#### 6.3.1 Exploring the terms that only appear in 1 year

In [18]:
terms_only_once=terms_per_ed[terms_per_ed["year"]<2].reset_index()
terms_only_once

Unnamed: 0,term,year
0,AAB,1
1,AABAM,1
2,AACH,1
3,AADE,1
4,AAHUS,1
...,...,...
4154,ZEDILE,1
4155,ZEGIAS,1
4156,ZEGILETHRON,1
4157,ZINC,1


Spliting the previous results, so we can know how many terms (that only appears once), correspond to "1771" and to "1773" years.

In [19]:
list_terms_once=terms_only_once["term"].to_list()
cont_dict={1771:0, 1773:0}

for i in list_terms_once:
    i_year=g_year_term.loc[g_year_term['term'] == i]["year"].to_string(index=False)

    cont_dict[int(i_year)]+=1
cont_dict

{1771: 1946, 1773: 2213}

**Remark**: This means that 1946 UNIQUE terms appear only in 1771. And 2213 UNIQUE terms appear only in 1773. 

#### 6.3.2 Exploring the terms that only appear in 2 years

In [20]:
terms_more_once=terms_per_ed[terms_per_ed["year"]>1].reset_index()
terms_more_once

Unnamed: 0,term,year
0,A,2
1,AA,2
2,AB,2
3,ABACUS,2
4,ABADAN,2
...,...,...
11301,ZUTPHEN,2
11302,ZUYDERSEE,2
11303,ZWEIBRUGGEN,2
11304,ZYGOMA,2
