# NLP Basics Assessment

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicados a un corpus específico: **The Adventures of Sherlock Holmes** por Arthur Conan Doyle (1892). Esta obra es de dominio público y el corpus fue obtenido de **Project Gutenberg.**

**Importamos librerias**

In [None]:
import warnings
import spacy
import re
from spacy.matcher import Matcher
import pkg_resources
import pandas as pd
import numpy as np
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
warnings.filterwarnings('ignore')

  import pkg_resources


**Detectamos si se esta ejecutando en Google Colab**

In [None]:
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

print("¿Ejecutando en Google Colab?:", IN_COLAB)

if IN_COLAB:
    !wget https://raw.githubusercontent.com/YesidCastelblanco/Fundamentos_NLP/main/requirements.txt -O requirements.txt
    !pip install -r requirements.txt
else:
    print("No estás en Google Colab. No se instalarán dependencias automáticamente.")


¿Ejecutando en Google Colab?: True
--2025-08-15 18:01:28--  https://raw.githubusercontent.com/YesidCastelblanco/Fundamentos_NLP/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 350 [text/plain]
Saving to: ‘requirements.txt’


2025-08-15 18:01:28 (19.2 MB/s) - ‘requirements.txt’ saved [350/350]

Collecting lightning>=2.2.0.post0 (from -r requirements.txt (line 8))
  Downloading lightning-2.5.3-py3-none-any.whl.metadata (39 kB)
Collecting torchinfo>=1.8.0 (from -r requirements.txt (line 13))
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Collecting evaluate>=0.4.2 (from -r requirements.txt (line 15))
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting ollama>=0.2.1 (from -r requirements.txt (line 18)

**1. Creamos el documento desde el archivo sherlock_holmes.txt**

In [None]:
!test '{IN_COLAB}' = 'True' && wget  https://raw.githubusercontent.com/YesidCastelblanco/Fundamentos_NLP/refs/heads/main/Unidad1/sherlock_holmes.txt

--2025-08-15 18:04:20--  https://raw.githubusercontent.com/YesidCastelblanco/Fundamentos_NLP/refs/heads/main/Unidad1/sherlock_holmes.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607648 (593K) [text/plain]
Saving to: ‘sherlock_holmes.txt’


2025-08-15 18:04:20 (24.5 MB/s) - ‘sherlock_holmes.txt’ saved [607648/607648]



In [None]:
# Cargar el modelo de SpaCy
nlp = spacy.load("en_core_web_sm")

# Abrir y leer el archivo
with open('sherlock_holmes.txt', 'r', encoding='utf-8') as file:
    texto = file.read()

# Eliminar encabezado y pie de página de Project Gutenberg
inicio = re.search(r"\*\*\* START OF(.*?)\*\*\*", texto, re.DOTALL)
fin = re.search(r"\*\*\* END OF(.*?)\*\*\*", texto, re.DOTALL)

if inicio and fin:
    texto_limpio = texto[inicio.end():fin.start()].strip()
else:
    texto_limpio = texto  # Si no encuentra marcas, usa todo el texto

# Procesar el texto limpio con Spacy
doc = nlp(texto_limpio)

# Información básica
print(f"Texto original: {len(texto)} caracteres")
print(f"Texto limpio: {len(texto_limpio)} caracteres\n")

# Mostrar primeros 500 caracteres del texto limpio
print("Primeros 500 caracteres del texto limpio:\n")
print(texto_limpio[:500])


Texto original: 581565 caracteres
Texto limpio: 562202 caracteres

Primeros 500 caracteres del texto limpio:

The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I.     A Scandal in Bohemia
   II.    The Red-Headed League
   III.   A Case of Identity
   IV.    The Boscombe Valley Mystery
   V.     The Five Orange Pips
   VI.    The Man with the Twisted Lip
   VII.   The Adventure of the Blue Carbuncle
   VIII.  The Adventure of the Speckled Band
   IX.    The Adventure of the Engineer’s Thumb
   X.     The Adventure of the Noble Bachelor
   XI.    The Adventure of the Beryl Coronet
 


In [None]:
doc[:50]

The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I.     A Scandal in Bohemia
   II.    The Red-Headed League
   III.   A Case of Identity
   IV.    The Boscombe Valley Mystery
   V.     The Five Orange

**2. Cuantos tokens hay en el archivo?**

In [None]:
len(doc)

136993

**3. Cuantas oraciones hay en el archivo?**


In [None]:
sentences = list(doc.sents)
len(sentences)

5800

**4. Imprime la segunda oración del documento**
<br> Los índices comienzan en 0 y el título cuenta como la primera oración.

In [None]:
sentences[1]

The Red-Headed League
   III.   

**5. Por cada token en la oración anterior, imprime su `text`, `POS` tag, `dep` tag y `lemma`**
<br>

In [None]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[1]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
The                 DET                 det                 the                 
Red                 PROPN               compound            Red                 
-                   PUNCT               punct               -                   
Headed              PROPN               compound            Headed              
League              PROPN               compound            League              

                   SPACE               dep                 
                   
III                 PROPN               ROOT                III                 
.                   PUNCT               punct               .                   
                    SPACE               dep                                     


**6. Implementa un matcher llamado *Swimming* que encuentre las ocurrencias de la frase *swimming vigorously* Write a matcher called 'Swimming' that finds**
<br>
Deberías incluir un patrón`'IS_SPACE': True` entre las dos palabras.

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'baker'}, {'IS_SPACE': True}, {'LOWER': 'street'}]
matcher.add("baker", [pattern])

In [None]:
found_matches = matcher(doc)
found_matches

[(9822559787564794947, 40299, 40302),
 (9822559787564794947, 66767, 66770),
 (9822559787564794947, 70527, 70530),
 (9822559787564794947, 74770, 74773)]

**7. Imprime el texto al rededor de cada match encontrado**

In [None]:
start, end = found_matches[0][1:]
doc[start-9:end+13]

had only known the quiet thinker and logician of Baker
Street would have failed to recognise him. His face flushed and
darkened

**8. Imprime la oración que contiene cada match encontrado**

In [None]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

Men who had only known the quiet thinker and logician of Baker
Street would have failed to recognise him. 

I think, Watson, that if we drive to Baker
Street we shall just be in time for breakfast.”




VII. 

Mr. Henry Baker
can have the same by applying at 6:30 this evening at 221B, Baker
Street.’ 

Then he stepped into
the cab, and in half an hour we were back in the sitting-room at Baker
Street. 



# Analisis de Sentimientos en reseñas de películas

Ahora pongamos en práctica algunos de estos conceptos en un caso más real. Para esta práctica vamos a hacer un análisis de sentimientos sobre unas reseñas de películas. Este caso sería una simple clasificación binaria y podemos utilizar cualquier modelo para ese fin, lo adicional aquí es el pre-procesamiento de las entradas de texto.

### Referencias
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [None]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/moviereviews.tsv

--2025-08-15 02:25:11--  https://github.com/Ohtar10/icesi-nlp/raw/refs/heads/main/Sesion1/moviereviews.tsv
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Ohtar10/icesi-nlp/refs/heads/main/Sesion1/moviereviews.tsv [following]
--2025-08-15 02:25:12--  https://raw.githubusercontent.com/Ohtar10/icesi-nlp/refs/heads/main/Sesion1/moviereviews.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7507050 (7.2M) [text/plain]
Saving to: ‘moviereviews.tsv’


2025-08-15 02:25:14 (17.9 MB/s) - ‘moviereviews.tsv’ saved [7507050/7507050]



**Empecemos por cargar el dataset:**

In [None]:
reviews = pd.read_csv('./moviereviews.tsv', sep='\t')
reviews.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


**Luego, hagamos algo de limpieza, vamos a remover nulos y valores vacíos:**

---



In [None]:
reviews.dropna(inplace=True)
reviews.review = reviews.review.apply(lambda r: r.strip())
blanks = reviews[reviews.review == ''].index
reviews.drop(blanks, inplace=True)
reviews[reviews.review == ''].index

Index([], dtype='int64')

In [None]:
reviews.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
neg,969
pos,969


Tenemos un dataset balanceado de casi mil ejemplares por cada clase.

Para hacer las cosas simples, vamos a utilizar un VADER para computar el puntaje de positivo o negativo. Este modelo ya viene implementado dentro de NLTK.

In [None]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews.review.apply(lambda r: sid.polarity_scores(r))
reviews.head()

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co..."


Con estos puntajes ahora podemos convertir el resultado en una etiqueta de predicción:

In [None]:
reviews['compound'] = reviews.scores.apply(lambda s: s['compound'])
reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c > 0 else 'neg')
reviews.head()

Unnamed: 0,label,review,scores,compound,prediction
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


Y finalmente computar unas cuantas métricas de calidad del modelo:

In [None]:
y_true = reviews.label.values
y_pred = reviews.prediction.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)


print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.6357069143446853

Classification Report:
              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

Confusion Matrix:
[[427 542]
 [164 805]]
