# **Universidad Icesi - Maestría en Inteligencia Artificial Aplicada**

***

### **Equipo:**

1. Alvaro Acosta
2. Jhonatan Estrada
3. Cristian Gonzalez
4. Danny Martinez

***

# NLP Basics Assessment

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/6-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:
[The History of a Crime](https://en.wikipedia.org/wiki/The_History_of_a_Crime) por Victor Hugo (1877). Esta historia es de dominio público y el corpus fue obtenido de [Project Gutenberg](https://www.gutenberg.org/cache/epub/10381/pg10381.txt).

## Referencias
* [NLP - Natural Language Processing With Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python)
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [None]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [None]:
from pathlib import Path

# Define the dependencies for the requirements.txt file
requirements_text = """# Updated requirements
numpy==1.26.4
pandas==2.2.2
matplotlib==3.8.0
seaborn==0.12.2
scikit-learn==1.6.1
statsmodels==0.14.0
tqdm>=4.67.0
torch==2.2.0
torchvision==0.17.0
torchaudio==2.2.0
lightning==2.2.0.post0
tensorboard==2.19.0
bokeh==3.7.0
transformers[torch]==4.41.2
datasets==2.19.1
torchinfo==1.8.0
accelerate==0.30.1
evaluate==0.4.2
sentence-transformers==3.0.1
gradio==5.42.0
ollama==0.5.3
spacy==3.8.7
thinc>=8.3.4,<8.4.0
nltk==3.9.1
httpx[http2]==0.28.1
websockets>=14.0,<15.1
fsspec==2024.3.1
gcsfs==2024.3.1
"""

# Define the output file path
path = Path("requirements.txt")

# Write the dependencies in the output file
path.write_text(requirements_text.strip() + "\n", encoding="utf-8")

# Print the absolute path of the generated file
print(f"Saved to {path.resolve()}")

Saved to /content/requirements.txt


In [None]:
# Colab: uninstall OpenCV (prevents NumPy≥2), install requirements, force-reinstall spaCy/thinc, download model, pin NumPy 1.26.4, then check dependencies
!test '{IN_COLAB}' = 'True' && pip uninstall -y opencv-python opencv-python-headless opencv-contrib-python || true && pip install -U --no-cache-dir -r requirements.txt --force-reinstall "spacy==3.8.7" "thinc>=8.3.4,<8.4.0" && python -m spacy download en_core_web_sm && pip check || true

Found existing installation: opencv-python 4.12.0.88
Uninstalling opencv-python-4.12.0.88:
  Successfully uninstalled opencv-python-4.12.0.88
Found existing installation: opencv-python-headless 4.12.0.88
Uninstalling opencv-python-headless-4.12.0.88:
  Successfully uninstalled opencv-python-headless-4.12.0.88
Found existing installation: opencv-contrib-python 4.12.0.88
Uninstalling opencv-contrib-python-4.12.0.88:
  Successfully uninstalled opencv-contrib-python-4.12.0.88
Collecting spacy==3.8.7
  Downloading spacy-3.8.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.4.0,>=8.3.4
  Downloading thinc-8.3.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting numpy==1.26.4 (from -r requirements.txt (line 2))
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.9 MB/s[0m e

In [None]:
# Restart the kernel
import IPython; IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Creamos el documento desde el archivo `The_History_of_a_Crime_Victor_Hugo.txt`**
<br>
> Pista: Usa `with open('./The_History_of_a_Crime_Victor_Hugo.txt') as f:`

El archivo TXT debe encontrarse ubicado en el mismo directorio que el Notebook.

In [None]:
with open('./The_History_of_a_Crime_Victor_Hugo.txt') as file:
    doc = nlp(file.read())

In [None]:
doc[195:497]

THE HISTORY OF A CRIME

THE TESTIMONY OF AN EYE-WITNESS


By VICTOR HUGO


Translated by T.H. JOYCE and ARTHUR LOCKER.




CONTENTS


CHAPTER

       THE FIRST DAY--THE AMBUSH.

    I. "Security"
   II. Paris sleeps--the Bell rings
  III. What had happened during the Night
   IV. Other Doings of the Night
    V. The Darkness of the Crime
   VI. "Placards"
  VII. No. 70, Rue Blanche
 VIII. "Violation of the Chamber"
   IX. An End worse than Death
    X. The Black Door
   XI. The High Court of Justice
  XII. The Mairie of the Tenth Arrondissement
 XIII. Louis Bonaparte's Side-face
  XIV. The D'Orsay Barracks
   XV. Mazas
  XVI. The Episode of the Boulevard St. Martin
 XVII. The Rebound of the 24th June, 1848, on the 2d December 1851
XVIII. The Representatives hunted down
  XIX. One Foot in the Tomb
   XX. The Burial of a Great Anniversary

       THE SECOND DAY--THE STRUGGLE.

    I. They come to Arrest me
   II. From the Bastille to the Rue de Cotte
  III. The St. Antoine Barricade
   I

El documento fue cargado exitosamente!

**2. Cuantos tokens hay en el archivo?**

In [None]:
len(doc)

200481

El archivo contiene un total de 200,481 tokens.

**3. Cuantas oraciones hay en el archivo?**
<br>Pista: Necesitarás una lista primero

In [None]:
sentences = list(doc.sents)
len(sentences)

9865

El archivo contiene un total de 9,865 oraciones.

**4. Imprime la tercera oración del documento**
<br> Pista: Los índices comienzan en 0.

In [None]:
sentences[2]

If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.


**5. Por cada token en la oración anterior, imprime su `text`, `POS` tag, `dep` tag y `lemma`**
<br>

In [None]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[2]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
If                  SCONJ               mark                if                  
you                 PRON                nsubjpass           you                 
are                 AUX                 auxpass             be                  
not                 PART                neg                 not                 
located             VERB                advcl               locate              
in                  ADP                 prep                in                  
the                 DET                 det                 the                 
United              PROPN               compound            United              
States              PROPN               pobj                States              
,                   PUNCT               punct               ,                   

                   SPACE               dep                 
                   
you                 PRON    

**6. Implementa los matchers llamados Rue, Louis, General y NationalAssembly que encuentren las ocurrencias de “Rue …”, “Louis”, “General [Nombre]” y “National Assembly” a lo largo del texto.**

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Repeated phrases in the text
pattern_1 = [{"LOWER": "national"}, {"LOWER": "assembly"}]
pattern_2 = [{"LOWER": "louis"}]
pattern_3 = [{"LOWER": "general"}, {"IS_TITLE": True, "OP": "+"}]
pattern_4 = [{"LOWER": "rue"}, {"IS_ALPHA": True, "OP": "+"}]

# Add the patterns
matcher.add("NationalAssembly", [pattern_1])
matcher.add("Louis", [pattern_2])
matcher.add("General", [pattern_3])
matcher.add("Rue", [pattern_4])

In [None]:
found_matches = matcher(doc)
found_matches

[(17195783254545358969, 284, 286),
 (17385344523318437648, 328, 329),
 (17195783254545358969, 418, 420),
 (17195783254545358969, 418, 421),
 (17195783254545358969, 579, 581),
 (17195783254545358969, 600, 602),
 (17195783254545358969, 656, 658),
 (17385344523318437648, 907, 908),
 (17385344523318437648, 1515, 1516),
 (17195783254545358969, 1567, 1569),
 (17195783254545358969, 1567, 1570),
 (17195783254545358969, 1567, 1571),
 (17385344523318437648, 1895, 1896),
 (17385344523318437648, 1916, 1917),
 (17385344523318437648, 2016, 2017),
 (17385344523318437648, 2027, 2028),
 (17385344523318437648, 2058, 2059),
 (17385344523318437648, 2111, 2112),
 (9422347423360276159, 2165, 2167),
 (17385344523318437648, 2209, 2210),
 (17385344523318437648, 2309, 2310),
 (17385344523318437648, 2475, 2476),
 (1557313747422490215, 2562, 2564),
 (17195783254545358969, 2702, 2704),
 (17195783254545358969, 2936, 2938),
 (17195783254545358969, 2936, 2939),
 (17195783254545358969, 3347, 3349),
 (17195783254545358

In [None]:
from collections import Counter

# Print the total matches
total_matches = len(found_matches)
per_type = Counter(nlp.vocab.strings[m_id] for m_id, _, _ in found_matches)

print(f"TOTAL MATCHES: {total_matches}\n")
for label, count in per_type.most_common():
    print(f"{label}: {count}")

TOTAL MATCHES: 1202

Rue: 825
Louis: 222
General: 126
NationalAssembly: 29


Se encontraron 1.202 coincidencias en total en el texto: Rue: 825 (68,64%), Louis: 222 (18,47%), General: 126 (10,48%) y NationalAssembly: 29 (2,41%). El patrón con mayor frecuencia fue “Rue” y el menor “NationalAssembly”. Los conteos corresponden al número de spans detectados por cada patrón.

**7. Imprime el texto al rededor de algunos matches encontrado**

In [None]:
start, end = found_matches[30][1:]
doc[start-37:end+34]

Owing to the Palace of the
Constituent Assembly having been nearly seized by a crowd of insurgents on
the 22d of June, 1848, and there being no barracks in the neighborhood,
General Cavaignac had constructed at three hundred paces from the
Legislative Palace, on the grass plots of the Invalides, several rows of
long huts, under which the grass was hidden.

In [None]:
start, end = found_matches[50][1:]
doc[start-19:end+39]

This was precisely the hour at which the Palace of the National Assembly
was invested. In the Rue de l'Université there is a door of the Palace
which is the old entrance to the Palais Bourbon, and which opened into
the avenue which leads to the house of the President of the Assembly.

**8. Imprime la oración que contiene cada match encontrado**

In [None]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
The section of the Rue de Lille lying between his house and
the Palais Bourbon was occupied by infantry. 

The section of the Rue de Lille lying between his house and
the Palais Bourbon was occupied by infantry. 

The section of the Rue de Lille lying between his house and
the Palais Bourbon was occupied by infantry. 

The section of the Rue de Lille lying between his house and
the Palais Bourbon was occupied by infantry. 

The Representatives, on quitting M. Daru, bent their steps on the side
of the Rue des Saints-Pères, and left the soldiers behind them. 

The Representatives, on quitting M. Daru, bent their steps on the side
of the Rue des Saints-Pères, and left the soldiers behind them. 

To the Mairie of the Tenth Arrondissement."

"What do you intend to do there?"

"To decree the deposition of Louis Bonaparte."

 

Situated in a narrow street
in that short section of the Rue de Grenelle-St.-Germain which l

En el ejercicio se aplicaron técnicas clave de procesamiento de lenguaje natural sobre el dataset seleccionado: tokenización, etiquetado gramatical (POS tagging), lematización y uso de matchers. La tokenización permitió dividir el texto en unidades mínimas (tokens); el POS tagging asignó categorías gramaticales a cada token; la lematización redujo las palabras a su forma base; y los matchers facilitaron la detección de patrones específicos en el texto.

Estas herramientas ofrecieron distintos niveles de análisis: mientras la tokenización y el POS tagging permitieron estructurar el texto gramaticalmente, la lematización mejoró su consistencia semántica. El uso de matchers complementó este proceso al identificar expresiones o estructuras relevantes. En conjunto, estas técnicas permitieron una comprensión más precisa y estructurada del contenido textual.