Spanish Text Preprocessing for NLP

This repository provides a Spanish text preprocessing tool that allows efficient cleaning and transformation of data, including the capability to process text and CSV files in parallel using multiple threads.

Features

Text preprocessing in Spanish.
Efficient processing of CSV files in parallel.
Support for specifying the number of threads and column when processing CSV files.
Easy-to-use with command-line options.

Dependencies

Make sure to have the dependencies specified in requirements.txt installed. You can install them using:

pip install -r requirements.txt

Usage

Text preprocessing

To preprocess text, use the following option:

python preprocess.py -text "Tu texto aquí"

CSV File Preprocessing

To preprocess a CSV file, use the following option:

python preprocess.py -file "<filename>" -t <number_of_threads> -c <column_name> -v <boolean>

: CSV file to process.
<number_of_threads>: Number of threads for parallel processing.
<column_name>: Name of the column to process in the CSV file.
: If False, the output won´t be shown in the console.

Preprocessing options

In the preprocess function, users can comment or uncomment any function according to their specific needs. By default, lemmatization and stopword removal are commented. This allows users to customize text preprocessing according to their preferences.

# In source code:
# Uncomment or comment functions based on user preferences

def preprocess(text):
    # Uncomment or comment functions based on user preferences
    def preprocessing(self, text:str) -> str:
        text = self.lower_case(text)
        text = self.normalize_emoji(text)        
        text = self.remove_non_ascii(text)        
        ....

        tokenized_text = self.tokenize(text)
        # tokenized_text = self.remove_stopwords(tokenized_text)
        # tokenized_text = self.hunspell_lemmatizer(tokenized_text)

        text = " ".join(tokenized_text)        
        return text

Examples:

python script.py -f "archivo.csv" -c "text" -v false -t 4

python script.py -text "Texto a procesar"

Preprocess examples:

"Ayer fui a ber a mi familia y nos sacamos fotos en el parqe." -> "ayer fui a ver a mi familia y nos sacamos fotos en el parque"
"Me compre un coxe nuebo, esta divino! 🚗✨" -> "me compre un come nuevo esta divino coche chispas"

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
hunspell_dicts		hunspell_dicts
abbreviations.txt		abbreviations.txt
input_examples.csv		input_examples.csv
main.py		main.py
output_example.csv		output_example.csv
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spanish Text Preprocessing for NLP

Features

Dependencies

Usage

Text preprocessing

CSV File Preprocessing

Preprocessing options

Examples:

Preprocess examples:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spanish Text Preprocessing for NLP

Features

Dependencies

Usage

Text preprocessing

CSV File Preprocessing

Preprocessing options

Examples:

Preprocess examples:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages