# Pre-procesamiento del texto

Antes de vectorizar (generar _embeddings_) con algún modelo, hay que _sanitizar_ el texto.

ChatGPT instructed me to do this:

- Tokenization: Break down your text into smaller units (tokens), like words or subwords.
- Cleaning: Remove noise like punctuation, stopwords, or special characters, depending on the nature of your data.
- Normalization: Convert everything to lowercase or perform stemming/lemmatization if necessary, though this can vary by use case.

Libraries like SpaCy or NLTK can help with preprocessing.

## Entorno

In [2]:
# Determina entorno de ejecución: local o Colab

# if (firstrun):
if('google.colab' in str(get_ipython() ) ):
    environment= 'colab'
else:
    import os
    if (os.environ.get('PWD')=='/kaggle/working'):
        environment= 'kaggle'
    elif ( os.path.exists('/workspaces') ):
        environment= 'codespaces'
    else:
        environment= 'local'
print(environment)

codespaces


In [3]:
# Directorio base ( cambiar según el sistema de archivos de cada uno)
current = 'nlpTP'

# if (firstrun):
if( environment== 'local' ):
    system_path = '/home/vbettachini/documents/universitet/FCEyN/maestríaDatos/' + current + '/'
elif( ( environment== 'colab' ) ):
    from google.colab import drive
    drive.mount('/content/drive')
    system_path = '/content/drive/MyDrive/maestría/' + current +'/'
elif( ( environment== 'codespaces' ) ):
    system_path = '/workspaces/' + current + '/'
elif( ( environment== 'kaggle' ) )  :
    a= 1

## Carga de texto de prueba

In [4]:
work_path = system_path + 'arn/normas_tokens/'
test_file = '4-2-3_r2.txt'
file_path = work_path + test_file

In [5]:
# load text from file at file_path
def load_text(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    return text

texto = load_text(file_path)

## Limpieza (cleansing ) | ?

Fuente: https://dylancastillo.co/posts/nlp-snippets-clean-and-tokenize-text-with-python.html

In Natural Language Processing (NLP), text normalization and text cleansing are essential preprocessing operations aimed at preparing text data for analysis or modeling. While these terms sometimes overlap, they generally refer to different sets of tasks. Here's a breakdown of what each typically involves:
Text Cleansing

Text cleansing refers to operations that remove or correct unwanted elements in the text to make it cleaner and more consistent for processing. It usually involves:

- Removing unnecessary characters: Deleting punctuation marks, special symbols, emojis, and other non-textual content (e.g., !, #, @, etc.).
- Removing extra whitespace: Stripping unnecessary spaces, tabs, or newlines.
- Correcting spelling or grammatical errors: Fixing typos or misspelled words, though this may be considered part of more advanced cleaning.
- Filtering stopwords: Removing common words (like "the," "and," etc.) that do not contribute significant meaning.
    - Implemented with NLTK
- Removing URLs or email addresses: Cleaning the text of extraneous web addresses or emails that are not useful for text analysis.
- Handling special tokens: Removing or replacing certain patterns like hashtags, mentions, or digits that don't add to the context.

### Quitar signos de puntuación

In [8]:
import re
from string import punctuation

In [9]:
sinPuntuación = re.sub(f"[{re.escape(punctuation)}]", "", texto)
sinPuntuación[500:700]

's que puedan afectar la seguridad radiológica o nuclear\n\nB ALCANCE\n2 Esta norma es aplicable al diseño puesta en marcha y operación de reactores de investigación\nEl cumplimiento de la presente norma y'

## Normalización de texto (normalisation) | ?

Normalization involves transforming the text into a consistent, standardized format to make it suitable for algorithms that need regular input formats. It includes:

- Lowercasing: Converting all text to lowercase to avoid treating words like "Apple" and "apple" differently.
    - str.casefold()
- Expanding contractions: Converting contractions like "can't" to "cannot" for consistency.
- Stemming: Reducing words to their base or root form (e.g., "running" to "run") using heuristics.
- Lemmatization: Converting words to their canonical base form, considering their part of speech (e.g., "better" to "good").
- Standardizing formats: Ensuring dates, numbers, currencies, and other entities follow a consistent format.
- Accented character normalization: Converting accented characters to their non-accented equivalents (e.g., "é" to "e").
- Handling Unicode or ASCII conversions: Converting different encodings into a unified format.

### Minúsculas

### Primer letra en mayúscula -> minúscula

In [15]:
# minúsculas_texto = texto.lower()
minúsculas_texto = texto.casefold() # solo la primer letra en minúscula

In [16]:
minúsculas_texto[500:700]

'das de incendios, que puedan afectar la seguridad radiológica o nuclear.\n\nb. alcance\n2. esta norma es aplicable al diseño, puesta en marcha y operación de reactores de investigación.\nel cumplimiento d'

## Tokenización | NLTK

In [17]:
try:
  import nltk
except:
  !pip3 install nltk
  import nltk
# nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [18]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [23]:
input = minúsculas_texto
tokens = word_tokenize(input, language="spanish")

### Quitar _stopwords_

In [20]:
# import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stopwords_ = set(stopwords.words("spanish"))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
clean_tokens = [token.lower() for token in tokens if token.lower() not in stopwords_ and token.isalpha()]
# clean_tokens = [t for t in tokens if not t in stopwords_]
clean_tokens[:40]

['autoridad',
 'regulatoria',
 'nuclear',
 'dependiente',
 'presidencia',
 'nacion',
 'ar',
 'revisión',
 'seguridad',
 'incendios',
 'reactores',
 'investigación',
 'aprobada',
 'resolución',
 'directorio',
 'autoridad',
 'regulatoria',
 'nuclear',
 'nº',
 'boletín',
 'oficial',
 'república',
 'argentina',
 'norma',
 'ar',
 'seguridad',
 'incendios',
 'reactores',
 'investigación',
 'objetivo',
 'establecer',
 'criterios',
 'seguridad',
 'incendios',
 'eventos',
 'generados',
 'explosiones',
 'derivadas',
 'incendios',
 'puedan']

### Stemming and Lemmatization

In [44]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...


True

In [45]:
input = clean_tokens
# stemming with a for loop
stemmed_tokens =  [stemmer.stem(word) for word in clean_tokens]
stemmed_tokens[:40]

['autoridad',
 'regulatoria',
 'nuclear',
 'dependient',
 'presidencia',
 'nacion',
 'ar',
 'revisión',
 'seguridad',
 'incendio',
 'reactor',
 'investigación',
 'aprobada',
 'resolución',
 'directorio',
 'autoridad',
 'regulatoria',
 'nuclear',
 'nº',
 'boletín',
 'ofici',
 'república',
 'argentina',
 'norma',
 'ar',
 'seguridad',
 'incendio',
 'reactor',
 'investigación',
 'objetivo',
 'establec',
 'criterio',
 'seguridad',
 'incendio',
 'evento',
 'generado',
 'explosion',
 'derivada',
 'incendio',
 'puedan']

In [46]:
input = stemmed_tokens
# stemming with a for loop
lemmatized_tokens =  [lemmatizer.lemmatize(word) for word in clean_tokens]
lemmatized_tokens[:40]

['autoridad',
 'regulatoria',
 'nuclear',
 'dependiente',
 'presidencia',
 'nacion',
 'ar',
 'revisión',
 'seguridad',
 'incendios',
 'reactores',
 'investigación',
 'aprobada',
 'resolución',
 'directorio',
 'autoridad',
 'regulatoria',
 'nuclear',
 'nº',
 'boletín',
 'oficial',
 'república',
 'argentina',
 'norma',
 'ar',
 'seguridad',
 'incendios',
 'reactores',
 'investigación',
 'objetivo',
 'establecer',
 'criterios',
 'seguridad',
 'incendios',
 'eventos',
 'generados',
 'explosiones',
 'derivadas',
 'incendios',
 'puedan']

### Limpiar basura

In [None]:
términos_basura = 