<a href="https://colab.research.google.com/github/baltuna/LT4DH/blob/main/LT4DH_ELTeC_NER_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting part of speech from ELTeC-ENG

Adaptation of a great Colab by Borja Navarro for the LT4DH course in the University of the Basque Country.

This version (to be cleaned) uses English resources in contrast to the Spanish one used by Borja Navarro.

Original reference here:

Borja Navarro Colorado | University of Alicante

In this case, the information about part of speech has not been manually annotated in the corpus. It is necessary first analyze the novels with a NLP system and then extract the linguistic information. The NLP system used is [SpaCy](https://spacy.io/).

The notebook shows:

- how to open a novel from ELTeC in COLAB,
- how to activate SpaCy in COLAB,
- how to analyze the novel with SpaCy, and
- how to extract information about Part of Speec.


## Loading ELTeC-ENG corpus in Colab

In [None]:
import zipfile

!wget "https://github.com/COST-ELTeC/ELTeC-eng/archive/refs/heads/master.zip" # paste here corpus url

zip_ref = zipfile.ZipFile('master.zip', 'r') #Opens the zip file in read mode
zip_ref.extractall() #Extracts files here (/content/)
zip_ref.close() 
!rm master.zip #Removes ZIP to save space

--2022-03-15 17:09:13--  https://github.com/COST-ELTeC/ELTeC-eng/archive/refs/heads/master.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/COST-ELTeC/ELTeC-eng/zip/refs/heads/master [following]
--2022-03-15 17:09:13--  https://codeload.github.com/COST-ELTeC/ELTeC-eng/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 192.30.255.120
Connecting to codeload.github.com (codeload.github.com)|192.30.255.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [      <=>           ]  86.87M  4.05MB/s    in 22s     

2022-03-15 17:09:35 (4.01 MB/s) - ‘master.zip’ saved [91092993]



## SpaCy: download and installing

[SpaCy](https://spacy.io/) is a NLP system. It analyzes part of speech and lemmas, sintax (dependencies) and named entities. 

Three steps:

1. Import SpaCy to Colab
2. Download language module (English)
3. Activate module


In [None]:
import spacy

!python -m spacy download en_core_web_sm #Download Spanish module (the "small" module in this case: "sm").

import en_core_web_sm
nlp_eng = en_core_web_sm.load() #Load Spanish analyzer in "nlp_esp".

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 8.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## Analyzing a novel from ELTeC-ENG

Once we have downloaded the corpus and activated SpaCy, let's analyze one novel.

First, select from the corpus [ELTeC-ENG](https://github.com/COST-ELTeC/ELTeC-eng/tree/master/level1) a novel and copy the file name. Then paste the name in the variable "novela_name". In this example, we will analyze the novel [*Yeast*](https://github.com/COST-ELTeC/ELTeC-eng/tree/master/level1/ENG18510_Kingsley.xml): ENG18510_Kingsley.xml

In [None]:
import os
from bs4 import BeautifulSoup

novela_name = "ENG18510_Kingsley.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

#print('Analyzing PoS and lemmas')
analisis = nlp_eng(novela_text) #Here the novel is analyzed with SpaCy. All the analysis is stored in "analisis" variable.
print('Done!')

Analyzing ENG18510_Kingsley.xml
Opening the file and extracting text
Done!


Now all the analysis is stored in "analisis" variable. It only remains to iterate over the variable and extract the information: in this case, named entities. How to extract information about syntax, named entities, etc. see [SpaCy 101](https://spacy.io/usage/spacy-101)

In [None]:
from google.colab import files

def show_ents(doc): #funzionatzen du
  if doc.ents: 
    for ent in doc.ents: 
     # print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +' - '+ent.label_) 
      print(ent.text+'\t'+str(ent.start_char) +'\t'+ str(ent.end_char) +'\t'+ent.label_) 
    else: print(str(len(doc.ents))+ ' were found')

def one_ent(doc): #funzionatzen du
  if doc.ents: 
    for ent in doc.ents: 
      if ent.label_ == 'LOC':
        entitateak.append(ent)
        print(ent.text+'\t'+str(ent.start_char) +'\t'+ str(ent.end_char) +'\t'+ent.label_)
    else: print(str(len(entitateak))+ ' entities were found')

def two_ents(doc): #funzionatzen du
  if doc.ents: 
    for ent in doc.ents: 
      if ent.label_ == 'TIME' or ent.label_ == 'DATE':
        entitateak.append(ent)
        print(ent.text+'\t'+str(ent.start_char) +'\t'+ str(ent.end_char) +'\t'+ent.label_)
    else: print(str(len(entitateak))+ ' temporal entities were found')



entitateak=[]

########################ENTITY TYPES#############################
# CARDINAL, PERSON, ORG, GPE, NORP, EVENT, DATE, WORK_OF_ART, TIME, PRODUCT, LOC LANGUAGE, FAC, ORDINAL


#################################PROGRAMA#########################################

#show_ents(analisis)

one_ent(analisis)

#two_ents(analisis)


#entitateak_out = ''
#for entitate in entitateak:
#  entitateak.sort()
#  entity = entitate[0]
#  enttype = entitate[3]
#  entitateak_out+=str(entity)+'\t'+enttype+'\n'


#out = open('chosen_entities.csv', 'w') #Opens a file in write mode ("w").
#out.write(entitateak) # "Writes" the content of authors_titles_out in the file
#out.close() #Closes the file
#files.download('chosen_entities.csv')


##################################################################################

#num_pers = str(len(pers))

#print(num_pers)

#NVA = '\tNovel\tORG\tPER\tGEO\tGPE\tTIM\tART\tEVE\tNAT\n' #
#NVA = '\tNovel\tNouns\tVerbs\tAdjectives\tUnique_nouns\tUnique_verbs\tUnique_adjs\tNER\n' #

#nom_novela = 'Yeats'
#nom_novela = novela_name

#novela=novela_text

#unique_ent=[]

#orgs=[]
#pers=[]
#geos=[]
#geps=[]
#tims=[]
#arts=[]
#eves=[]
#nats=[]





#for ent in analisis.ents: 
 # if ent.ent_type == 'B-geo':
  #  eves.append(token.lemma_) #
 # elif token.pos_ == 'VERB':
  #  verbs.append(token.lemma_) #
   # if token.lemma not in unique_verb:
    #   unique_verb.append(token.lemma)
 # elif token.pos_ == 'ADJ':
  #  adjs.append(token.lemma_) #
   # if token.lemma not in unique_adj:
    #   unique_adj.append(token.lemma)


#num_nouns = str(len(nouns))
#num_verbs = str(len(verbs))
#num_adjs = str(len(adjs))
#num_uninouns = str (len(unique_noun))
#num_univerbs = str (len(unique_verb))
#num_uniadjs = str (len(unique_adj))
#num_eves = str(len(ners))

#NVA += '\t'+nom_novela+'\t'+num_nouns+'\t'+num_verbs+'\t'+num_adjs+'\t'+num_uninouns+'\t'+num_univerbs+'\t'+num_uniadjs+'\t'+num_eves+'\n' #

#print(NVA)

#salida = open('analisis_soloNombres.txt', 'w') #Crea fichero, etc.
#salida.write(corpus_soloNombres)
#salida.close()
#files.download('analisis_soloNombres.txt')

#out = open('author_titles.csv', 'w') #Opens a file in write mode ("w").
#out.write(authors_titles_out) # "Writes" the content of authors_titles_out in the file
#out.close() #Closes the file
#files.download('author_titles.csv')


Eden	2583	2587	LOC
the south-east	2795	2809	LOC
Earth	8287	8292	LOC
Griselda	11026	11034	LOC
temper—(Lancelot	11038	11054	LOC
the
     river	12899	12913	LOC
Venus	19796	19801	LOC
Paradise	39821	39829	LOC
Eden	40861	40865	LOC
Earth	45767	45772	LOC
Plotinus	57100	57108	LOC
Avernus	57483	57490	LOC
whereof	68797	68804	LOC
the distant sea	78936	78951	LOC
Venus	84801	84806	LOC
Mops	102146	102150	LOC
Eden	104853	104857	LOC
the
     hedge	119616	119630	LOC
earth	128764	128769	LOC
Earth	133037	133042	LOC
Earth	136921	136926	LOC
Young England	146051	146064	LOC
Peelite	146729	146736	LOC
earth	174724	174729	LOC
earth	199919	199924	LOC
earth	199951	199956	LOC
earth	200786	200791	LOC
earth	202027	202032	LOC
the Cannibal Islands	208885	208905	LOC
Cannibal Island	209133	209148	LOC
gulf	267161	267165	LOC
Mahomet	285041	285048	LOC
Apostles	285702	285710	LOC
Christian England	289789	289806	LOC
the
     river	298868	298882	LOC
Redruth	361720	361727	LOC
earth	391790	391795	LOC
Jesuit	400260	400266	LOC
the 