**Notebook to preprocess the Spanish and German Wikipedia descriptions.**

This is done by removing unrelated information and processing the text using NLP using the following pipeline:
* Remove unnecessary information such as description categories that are not relevant to the task
* Remove artifacts from text
* Preprocess by lowercasing and tokenizing, optionally removing numbers, punctuation and stopwords

# Libraries & Functions

In [1]:
'''Math & Data Libraries'''
import numpy as np
import pandas as pd

In [2]:
''' Miscellaneous Libraries'''
from tqdm import tqdm

In [3]:
'''NLP Libraries'''
import nltk
from nltk.corpus import stopwords
import string

import unidecode
import unicodedata

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## NLP Functions

In [4]:
def remove_artifacts(text):
    """
    Remove artifacts from text. 
    ---
    Parameters
    ----------
    text : str
        text from which artifacts should be removed

    Returns
    -------
    text_removed_artifacts : str
        text with removed artifacts
    """
    return str(text).replace("nbsp", "").replace("<p>", "").replace("</p>", "").replace("<i>", "").replace("</i>", "").replace("<b>", "").replace("</b>", "").replace("&","").replace("_x000D_", "").replace("\"", "").replace("\'", "").replace("<em>", "").replace("</em>", "").replace("<br>", "").replace("</br>", "")

def preprocess_description(text, remove_numbers = True, remove_stopwords = False):
    """
    Preprocess text with the following steps:
        1. Lowercasing
        2. Removal of digits (if remove_numbers = True)
        3. Removal of punctuation (if remove_numbers = True)
        4. Removal of stopwords (if remove_stopwords = True) 
        5. Tokenization and join to fix issues with blank space
    ---
    Parameters
    ----------
    text : str
        text from which artifacts should be removed
    remove_numbers : boolean
        whether to remove or keep digits in the text
    remove_stopwords : boolean
        whether to remove or stopwords in the text

    Returns
    -------
    preprocessed_text : str
        preprocessed_text
    """
    text = text.lower()
    if remove_numbers:
        text = text.translate(str.maketrans('', '', string.digits))
        text = text.translate(str.maketrans('', '', string.punctuation))
       
    tokens = nltk.word_tokenize(text)
    text = " ".join(tokens)

    if remove_numbers:
        tokens_long = []
        for token in tokens:
            if len(token)>1:
                tokens_long.append(token)
        text = ' '.join(tokens_long)

    if remove_stopwords:
        stopwords_ = stopwords.words('english')
        tokens = nltk.word_tokenize(text)
        tokens_stop_words = []
        for token in tokens:
            if token not in stopwords_:
                tokens_stop_words.append(token)
        text = ' '.join(tokens_stop_words)

    return text

# Input Data

## Spanish Wikipedia

In [6]:
df_WIKI_ES = pd.read_excel("..//Datasets//Initial Databases//WIKI_orig_descriptions_ESP.xlsx")

In [7]:
df_WIKI_ES

Unnamed: 0,WIKI_id,description,source,name,Date Retrieved,Binomial Name
0,Summary,Salsola oppositifolia o salao borde es una pla...,1,Salsola oppositifolia,2023-05-09,Salsola oppositifolia
1,HÃ¡bitat,Es propia de zonas costeras. Tolerante a la sa...,2,Salsola oppositifolia,2023-05-09,Salsola oppositifolia
2,DescripciÃ³n,"Se trata de un arbusto de hasta 2 m de altura,...",3,Salsola oppositifolia,2023-05-09,Salsola oppositifolia
3,TaxonomÃ­a,Salsola oppositifolia fue descrita por RenÃ© L...,4,Salsola oppositifolia,2023-05-09,Salsola oppositifolia
4,Nombres comunes,"Castellano: barrilla, barrilla zagua, boja bar...",5,Salsola oppositifolia,2023-05-09,Salsola oppositifolia
...,...,...,...,...,...,...
24526,Summary,"Dierama pulcherrimum, es una especie de planta...",24527,Dierama pulcherrimum,2023-05-09,Dierama pulcherrimum
24527,DescripciÃ³n,Se caracteriza por sus flores laxas de color g...,24528,Dierama pulcherrimum,2023-05-09,Dierama pulcherrimum
24528,TaxonomÃ­a,Dierama pulcherrimum fue descrita por (Hook.f....,24529,Dierama pulcherrimum,2023-05-09,Dierama pulcherrimum
24529,Referencias,,24530,Dierama pulcherrimum,2023-05-09,Dierama pulcherrimum


In [8]:
print("Initial Number of Categories:", df_WIKI_ES["WIKI_id"].nunique())

Initial Number of Categories: 1076


## German Wikipedia

In [9]:
df_WIKI_DE = pd.read_excel("..//Datasets//Initial Databases//WIKI_orig_descriptions_DE.xlsx")

In [10]:
df_WIKI_DE

Unnamed: 0,WIKI_id,description,source,name,Date Retrieved,Binomial Name
0,Summary,Macadamia integrifolia ist eine Pflanzenart au...,,Macadamia integrifolia,2023-08-09,Macadamia integrifolia
1,Beschreibung,Macadamia integrifolia wÃ¤chst als Baum und er...,,Macadamia integrifolia,2023-08-09,Macadamia integrifolia
2,Verbreitung,Macadamia integrifolia ist in einem kleinen Ge...,,Macadamia integrifolia,2023-08-09,Macadamia integrifolia
3,Verwendung,Die Samen von Macadamia integrifolia sind essb...,,Macadamia integrifolia,2023-08-09,Macadamia integrifolia
4,Einzelnachweise,\n\n== Weblinks ==,,Macadamia integrifolia,2023-08-09,Macadamia integrifolia
...,...,...,...,...,...,...
20270,Beschreibung,"Der Kaukasische Klee ist eine ausdauernde, kra...",,Trifolium ambiguum,2023-08-09,Trifolium ambiguum
20271,Verbreitung,"Kaukasischer Klee ist auf GerÃ¶llfeldern, an W...",,Trifolium ambiguum,2023-08-09,Trifolium ambiguum
20272,Nutzung,Kaukasischer Klee wird vor allem in Osteuropa ...,,Trifolium ambiguum,2023-08-09,Trifolium ambiguum
20273,Literatur,"Michael Zohary, David Heller: The Genus Trifol...",,Trifolium ambiguum,2023-08-09,Trifolium ambiguum


In [11]:
print("Initial Number of Categories:", df_WIKI_DE["WIKI_id"].nunique())

Initial Number of Categories: 1361


# Preprocessing

## 1. Remove artifacts and accents from text
Such as $<p>, \&nbsp, <i>, <b>, <br>, <em>, _x000D_, ", ', $

In [13]:
df_WIKI_ES.loc[:, "prep_description_1"] = df_WIKI_ES["description"].apply(remove_artifacts)
df_WIKI_ES.loc[:, "prep_description_1"] = df_WIKI_ES["prep_description_1"].apply(unidecode.unidecode)

df_WIKI_DE.loc[:, "prep_description_1"] = df_WIKI_DE["description"].apply(remove_artifacts)
df_WIKI_DE.loc[:, "prep_description_1"] = df_WIKI_DE["prep_description_1"].apply(unidecode.unidecode)

## 2. Remove digits, punctuation, lowercase & tokenize

In [14]:
df_WIKI_ES.loc[:, "QA_description"] = df_WIKI_ES["prep_description_1"].apply(lambda x: preprocess_description(x, False, False))
df_WIKI_ES.loc[:, "BERT_description"] = df_WIKI_ES["prep_description_1"].apply(lambda x: preprocess_description(x, True, False))
df_WIKI_ES.loc[:, "BOW_description"] = df_WIKI_ES["prep_description_1"].apply(lambda x: preprocess_description(x, True, True))

df_WIKI_DE.loc[:, "QA_description"] = df_WIKI_DE["prep_description_1"].apply(lambda x: preprocess_description(x, False, False))
df_WIKI_DE.loc[:, "BERT_description"] = df_WIKI_DE["prep_description_1"].apply(lambda x: preprocess_description(x, True, False))
df_WIKI_DE.loc[:, "BOW_description"] = df_WIKI_DE["prep_description_1"].apply(lambda x: preprocess_description(x, True, True))

# Save Data

In [15]:
df_WIKI_ES.to_excel("..//Datasets//Preprocessed Databases//WIKI_preprocessed_descriptions_ESP.xlsx", index = False)
df_WIKI_DE.to_excel("..//Datasets//Preprocessed Databases//WIKI_preprocessed_descriptions_DE.xlsx", index = False)