# NLP Basics Assessment
# Version IVAN MORAN - ERIK VERGARA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/6-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:
[_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) por Ambrose Bierce (1890). Esta historia es de dominio público y el corpus fue obtenido de [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

## Referencias
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [2]:
# Se instala solo librerias requeridas, para optimizar recursos.
!pip install -U spacy nltk
!python -m spacy download en_core_web_trf
!python -m spacy validate

[0m[31mERROR: Could not find a version that satisfies the requirement spacy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for spacy[0m[31m
[0m/Volumes/External Storage/Maestria ICESI - IA/Semestre 3/ProcesamientoLN/Unidad 1/Taller1/venv_NLP_Taller1/bin/python: No module named spacy
/Volumes/External Storage/Maestria ICESI - IA/Semestre 3/ProcesamientoLN/Unidad 1/Taller1/venv_NLP_Taller1/bin/python: No module named spacy


In [3]:
# Se importa otro modelo para comparar desempeño:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp_t = spacy.load("en_core_web_trf")

ModuleNotFoundError: No module named 'spacy'

**1. Creamos el documento desde el archivo `owlcreek.txt`**<br>
> Pista: Usa `with open('./owlcreek.txt') as f:`

In [None]:
# !test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/owlcreek.txt

In [3]:
with open('./owlcreek.txt') as file:
    doc = nlp(file.read())

In [4]:
with open('./owlcreek.txt') as file:
    doc_t = nlp_t(file.read())

In [None]:
print(doc[:36])
print(doc_t[:36])

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  
AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


El documento fue cargado exitosamente!

**2. Cuantos tokens hay en el archivo?**

In [5]:
print(len(doc))
print(len(doc_t))

4835
4835


**3. Cuantas oraciones hay en el archivo?**
<br>Pista: Necesitarás una lista primero

**Modelo SM**

In [6]:
sentences = list(doc.sents)
len(sentences)

204

**Modelo TRF**

In [7]:
sentences_t = list(doc_t.sents)
len(sentences_t)

213

**Diferencia en identificacion de oraciones por modelo**

In [None]:
for i, (s1, s2) in enumerate(zip(doc.sents, doc_t.sents)):
    if s1.text != s2.text:
        print(f"Diferencia en oración {i}")
        print("SM:", s1.text)
        print("TRF:", s2.text)
        break

Diferencia en oración 0
SM: AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  
TRF: AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce


**Comparacion entre modelos:**

Se evidencia que la segmentacion de los modelos es diferente, debido a las siguientes razones:
- Arquitecturas distintas
- Capacidad contextual distinta
- Datos de entrenamiento distintos

**4. Imprime la segunda oración del documento**
<br> Pista: Los índices comienzan en 0 y el título cuenta como la primera oración.

In [9]:
print("Segunda Oracion del documento NLP:\n", sentences[1])
print("Segunda Oracion del documento Transformer:\n", sentences_t[1])

Segunda Oracion del documento NLP:
 The man's hands were behind
his back, the wrists bound with a cord.  
Segunda Oracion del documento Transformer:
 

I




Conclusion: Desde la primera oracion se diferencian los modelos.

**5. Por cada token en la oración anterior, imprime su `text`, `POS` tag, `dep` tag y `lemma`**
<br>
Se analiza la primera oracion para poder realizar una mejor comparación.

In [12]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[0]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
AN                  DET                 det                 an                  
OCCURRENCE          NOUN                nmod                occurrence          
AT                  PROPN               compound            AT                  
OWL                 PROPN               compound            OWL                 
CREEK               PROPN               compound            CREEK               
BRIDGE              PROPN               appos               BRIDGE              


                  SPACE               dep                 

                  
by                  ADP                 prep                by                  
Ambrose             PROPN               compound            Ambrose             
Bierce              PROPN               pobj                Bierce              


                  SPACE               dep                 

                  
I                   PRON    

In [13]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences_t[0]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
AN                  DET                 det                 an                  
OCCURRENCE          NOUN                ROOT                occurrence          
AT                  ADP                 prep                at                  
OWL                 PROPN               compound            OWL                 
CREEK               PROPN               compound            CREEK               
BRIDGE              PROPN               pobj                BRIDGE              


                  SPACE               dep                 

                  
by                  ADP                 prep                by                  
Ambrose             PROPN               compound            Ambrose             
Bierce              PROPN               appos               Bierce              


**6. Implementa un matcher llamado *Swimming* que encuentre las ocurrencias de la frase *swimming vigorously* Write a matcher called 'Swimming' that finds**
<br>
Pista: Deberías incluir un patrón`'IS_SPACE': True` entre las dos palabras.

In [18]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
# pattern = [{'LOWER': 'swimming'}, {'LOWER': 'vigorously'}]
matcher.add("Swimming", [pattern])


In [19]:
found_matches = matcher(doc)
found_matches

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]

Modelo TRF

In [20]:
matcher_t = Matcher(nlp_t.vocab)
pattern_t = [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
matcher_t.add("Swimming_t", [pattern_t])
found_matches_t = matcher_t(doc_t)
found_matches_t

[(17092295986777929446, 1274, 1277), (17092295986777929446, 3609, 3612)]

**7. Imprime el texto al rededor de cada match encontrado**

In [21]:
start, end = found_matches[0][1:]
doc[start-9:end+13]

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home

In [23]:
start, end = found_matches[1][1:]
doc[start-7:end+5]

over his shoulder; he was now swimming
vigorously with the current.  

**Modelo TRF**

In [24]:
start, end = found_matches_t[0][1:]
doc_t[start-9:end+13]

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home

In [25]:
start, end = found_matches_t[1][1:]
doc_t[start-7:end+5]

over his shoulder; he was now swimming
vigorously with the current.  

**8. Imprime la oración que contiene cada match encontrado**

In [26]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.   

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.   



**Modelo TRF**

In [27]:
for sentence in sentences_t:
    for _, start, end in found_matches_t:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home. 



The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current. 



# Conclusiones:

- Se pudo evidenciar que los modelos presentan diferencias en la extracion de sentencias, sin embargo, esto No es error o problema de código, se debe a que cada modelo aprende a segmentar oraciones de forma distinta.
- Para lo que se utilizaron dichos modelos en este notebook, no mostraron diferencias de desempeño, ambos fueron capaces de encontrar patrones en el texto y de clasificar los tokens, de acuerdo a la fase de entrenamiento de cada uno.
- El modelo que usa transformers, requiere ser ejecutado en un entorno con GPU y Cuda, una limitante para su utilizacion.
