<a href="https://colab.research.google.com/github/baltuna/LT4DH/blob/main/LT4DH_ELTeC_CountingWords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting part of speech from ELTeC-ENG

Adaptation of a great Colab by Borja Navarro for the LT4DH course in the University of the Basque Country.

This version (to be cleaned) uses English resources in contrast to the Spanish one used by Borja Navarro.

Original data here:

Borja Navarro Colorado | University of Alicante

In this case, the information about part of speech has not been manually annotated in the corpus. It is necessary first analyze the novels with a NLP system and then extract the linguistic information. The NLP system used is [SpaCy](https://spacy.io/).

The notebook shows:

- how to open a novel from ELTeC in COLAB and to analyse it with SpaCy, and
- analysing the output of Spacy for DH.


## Loading ELTeC-SPA corpus in Colab

In [None]:
import zipfile

!wget "https://github.com/COST-ELTeC/ELTeC-eng/archive/refs/heads/master.zip" # paste here corpus url

zip_ref = zipfile.ZipFile('master.zip', 'r') #Opens the zip file in read mode
zip_ref.extractall() #Extracts files here (/content/)
zip_ref.close() 
!rm master.zip #Removes ZIP to save space

## SpaCy: download and installing

[SpaCy](https://spacy.io/) is a NLP system. It analyzes part of speech and lemmas, sintax (dependencies) and named entities. 

Three steps:

1. Import SpaCy to Colab
2. Download langauge module (Spanish)
3. Activate module


In [None]:
import spacy

!python -m spacy download en_core_web_sm #Download English module (the "small" module in this case: "sm").

import en_core_web_sm
nlp_eng = en_core_web_sm.load() #Load English analyzer in "nlp_eng".

## Analyzing a novel from ELTeC-SPA

Once we have downloaded the corpus and activated SpaCy, let's analyze one novel.

First, select from the corpus [ELTeC-SPA](https://github.com/COST-ELTeC/ELTeC-spa/tree/master/level1) a novel and copy the file name. Then paste the name in the variable "novela_name". In this example, we will analyze the novel of Gertrudis Gómez de Avellaneda [*Sab*](https://github.com/COST-ELTeC/ELTeC-spa/blob/master/level1/SPA1021_GomezDeAvellaneda_Sab.xml): SPA1021_GomezDeAvellaneda_Sab.xml

In [None]:
import os
from bs4 import BeautifulSoup

novela_name = "ENG18510_Kingsley.xml" # Put here the name of the file
dir_in = "/content/ELTeC-eng-master/level1/"

novela_text = '' 

print('Analyzing', novela_name)

ficheroEntrada = dir_in + novela_name
with open(ficheroEntrada, 'r') as tei: #Opens the file
  print("Opening the file and extracting text")
  soup = BeautifulSoup(tei, 'xml') #Parse the XML
  capitulos = soup.find_all(type="chapter") #Only chapters are taking into account. No letters (To Do)
  for cap in capitulos:
    parrafos = cap.find_all('p') #Extract all paragraphs of each chapter
    for parrafo in parrafos:
      #print(parrafo.text)
      novela_text+=parrafo.text+'\n'

print('Analyzing PoS and lemmas')
analisis = nlp_eng(novela_text) #Here the novel is analyzed with SpaCy. All the analysis is stored in "analisis" variable.
print('Done!')

Now all the analysis is stored in "analisis" variable. It only remains to iterate over the variable and extract the information: in this case, part of speech. How to extract information about syntax, named entities, etc. see [SpaCy 101](https://spacy.io/usage/spacy-101)

In [None]:
NVA = '\tNovel\tNouns\tVerbs\tAdjectives\tUnique_nouns\tUnique_verbs\tUnique_adjs\n' #

nom_novela = 'Yeats'
#nom_novela = novela_name

nouns=[]
verbs=[]
adjs=[]

noun_counts= dict()
verb_counts= dict()
adj_counts= dict()

# for token in analisis: 
#   if token.pos_ == 'NOUN':
#     if token.text.lower() in noun_counts:
#        noun_counts[token.text.lower()] += 1
#     else:
#        noun_counts[token.text.lower()] = 1
#   elif token.pos_ == 'VERB':
#     if token.text.lower() in verb_counts:
#        verb_counts[token.text.lower()] += 1
#     else:
#         verb_counts[token.text.lower()] = 1
#   elif token.pos_ == 'ADJ':
#     if token.text.lower() in adj_counts:
#        adj_counts[token.text.lower()] += 1
#     else:
#        adj_counts[token.text.lower()] = 1

for token in analisis: 
  if token.pos_ == 'NOUN':
    if token.lemma_ in noun_counts:
       noun_counts[token.lemma_] += 1
    else:
       noun_counts[token.lemma_] = 1
  elif token.pos_ == 'VERB':
    if token.lemma_ in verb_counts:
       verb_counts[token.lemma_] += 1
    else:
        verb_counts[token.lemma_] = 1
  elif token.pos_ == 'ADJ':
    if token.lemma_ in adj_counts:
       adj_counts[token.lemma_] += 1
    else:
       adj_counts[token.lemma_] = 1




# Sort the noun_counts dictionary by appearance count in descending order
sorted_nouns = sorted(noun_counts.items(), key=lambda x: x[1], reverse=True)
sorted_verbs = sorted(verb_counts.items(), key=lambda x: x[1], reverse=True)
sorted_adjs = sorted(adj_counts.items(), key=lambda x: x[1], reverse=True)

# Print the sorted nouns and their appearance counts
for i, (noun, count) in enumerate(sorted_nouns):
    if i >= 50:
        break
    print(noun, count)

print("\n-----------------\n")

for i, (verb, count) in enumerate(sorted_verbs):
    if i >= 50:
        break
    print(verb, count)

print("\n-----------------\n")

for i, (adj, count) in enumerate(sorted_adjs):
    if i >= 50:
        break
    print(adj, count)




