# NLP I: Uses and operations of NLTK

In this notebook we are going to put into practice the tokenisation of texts.

Tokenisation is the division of text into smaller pieces. It can be tokenised by words or phrases, although it is more common to tokenise by words.

## Libraries and installation
### NLTK

First we need to import the NLTK library.

In [None]:
import nltk

If you need to install `nltk`, here is the webpage:

https://pypi.org/project/nltk/

If you do not need to install `nltk`, here you can find the webpage documentation:

https://pypi.org/project/nltk/

;P

## Working with the data

First we will load a simple sentence to work with it and see examples in a clear way.

In [None]:
frase = 'Me he comprado un coche rojo. Ahora tenemos que encontrar un seguro de coches a todo riesgo'

### Word tokenisation

We will use the "word tokenize" that we have previously imported. To do this, we load the text from the web page that we obtained and cleaned up in the previous step.

Here is a brief explanation of the commands used: ".lower()" what we do is standardise the formatting of all the words. The ".isalpha()" command evaluates each token as true or flase depending on whether it is a word or not. With this we discard all punctuation marks, numbers, symbols, etc. ...

#### NLTK Word Tokenize

We import the Word Tokenize component of the NLTK library to generate the tokens of our text.

It is important to take into account that we will use the Spanish tokenisation in our case for the analysis of the text.

In [None]:
from nltk.tokenize import word_tokenize

### We get the tokens

To get the tokens we simply use the command `word_tokenize(t,i)` where;
* **t** would be the text to tokenize
* **i** would be the language, in our case `Spanish`.

In [None]:
def tokenize(_frase):
    """
    Tokeniza una frase en palabras individuales, eliminando cualquier carácter no alfabético y convirtiendo
    todas las palabras a minúsculas.

    Args:
    _frase (str): La frase que se quiere tokenizar.

    Returns:
    list: Una lista de tokens alfabéticos en minúsculas.
    """

    tokens = word_tokenize(_frase, "spanish")
    tokens = [word.lower() for word in tokens if word.isalpha()]

    return tokens

In [None]:
token_frase = tokenize(frase)
token_frase

Some errors might arise if you don't have all the necessary resources installed.

The NLTK library has subcomponents that are essential for various analyses. By running the following command:

`nltk.download()`

an execution window will launch, as illustrated in the image below:

<div style="text-align:center;">
<img src="Images/download.png" width="300">
</div>

However, this method isn't the most efficient way to download resources. A quicker approach is to specify the desired subcomponent within the parentheses. For example:

`nltk.download('module')`

This way, you can directly download the necessary modules without navigating through the execution window.

### Stop words

Stop words are those words that are not really relevant to our exercise, e.g. articles, conjunctions, determiners, auxiliary verbs, etc. ...

First we must import the NLTK package **stopwords**.

In [None]:
from nltk.corpus import stopwords

We can easily see the words contained within stopwords by executing the following command `stopwords.words('spanish')`.

In [None]:
print(f"There are {len(stopwords.words('spanish'))} stopwords in Spanish:", stopwords.words('spanish'))

To remove a stopword from the text, simply search for it in the list.

In [None]:
def clean_sw(_tokens, language='spanish'):
    """
    Removes stopwords from a list of tokens based on the specified language.

    Args:
    _tokens (list of str): List of tokens (words) from which the stopwords will be removed.
    language (str, optional): The language of the stopwords. Defaults to 'spanish'.

    Returns:
    list of str: A list of tokens with the stopwords removed.

    Note:
    The function uses the NLTK library's list of stopwords for the removal. Ensure that
    the 'stopwords' dataset from NLTK is downloaded before using this function.
    """
    clean_tokens = _tokens[:]

    for token in _tokens:
        if token in stopwords.words('spanish'):
            clean_tokens.remove(token)

    return clean_tokens

In [None]:
clean_tokens = clean_sw(token_frase)
clean_tokens

### Stemming

Backward derivation allows us to eliminate verb tenses, genders, plurals, ... in order to improve the counting and grouping of words in the analysed texts.

In our case, for Spanish, we will use the **Snowball** algorithm. We will import the `SnowballStemmer` into the **nltk.stem** package.

In [None]:
from nltk.stem import SnowballStemmer

As this stemmer is multi-language, we will have to specify which language we want to use.

You can consult all the available languages, along with more documentation at: https://www.nltk.org/_modules/nltk/stem/snowball.html

In [None]:
spanish_stemmer = SnowballStemmer('spanish')

Next, we have to load the tokens without the stopWords we have previously generated to get it (you can also load any token, even if it includes stopWords).

In [None]:
stem_tokens = []

for token in clean_tokens:
    stem_tokens.append(spanish_stemmer.stem(token))

stem_tokens

### Lemmatisation

Lemmatisation, by greatly simplifying its definition, allows us to obtain the original word, for example:

* Verbs: Eating -> Eat
* Plurals: Tables -> Table

With this we can make a much more optimal classification than with backward derivation.

To do this process in Spanish we must make use of the [spaCy library](https://spacy.io/), since NLTK does not perform this process in Spanish. The installation of spaCy is very simple, you have two options to install spaCy:

**Option 1:** In the webpage, click on "USAGE" and follow the instructions below "Install spaCy".

- You can choose pip as Package manager.
- For how to choose between processing, try to always choose CPU in Hardware.
- In Trained pipelines, unclick "English" and click on "Spanish"
- Select pipeline for efficiency.
- Finally, copy and paste the commands and execute them

**Option 2:** Just run the following commands in an **Anaconda Prompt** terminal:
* `conda install -c conda-forge spacy`.
* `python -m spacy download es_core_news_sm`.

Once installed, import the library with `import spacy` and load the Spanish package with `spacy.load('es_core_news_sm)`.

In [None]:
#pip install -U pip setuptools wheel
#!pip install -U spacy
#!python -m spacy download es_core_news_sm

# The ouput of the last command sais:
# You can now load the package via spacy.load('es_core_news_sm')

In [None]:
import spacy
nlp = spacy.load('es_core_news_sm')

Once the language has been imported and loaded, we will proceed to obtain the lemmas.

In [None]:
def lematize(_tokens):
    lem_tokens = []

    separator = ' '

    for token in nlp(separator.join(_tokens)):
        lem_tokens.append(token.lemma_)

    return lem_tokens

In [None]:
lematize(clean_tokens)

**Warm-up exercise:** Can you please build the whole proces with a frase taken from google?

In [None]:
# Type your code here:

sentence = ...


In [None]:
print(lem_tokens)

Can you see the changes when lematizing?

**Excercice:** Suppose you do not want to keep proper nouns, surnames or words like this on your tokens.

Create a function that `extra_clean()` your `lem_tokens` from these undesired words.

In [None]:
# Complete with your code below:

def extra_clean(_tokensIn):
    """
    Cleans a list of tokens by removing specific unwanted tokens.

    This function iterates through a list of tokens and removes any that are found
    in the predefined `_toDelete` list.

    Args:
    _tokensIn (list of str): The input list of tokens to be cleaned.

    Returns:
    list of str: A cleaned list of tokens with specific words removed.

    Note:
    This function creates a copy of the input list to ensure the original list remains unchanged.
    It is important to understand that direct modifications on the input list while iterating over it
    can lead to unexpected behavior. By working on a copy, such issues are avoided.
    """

    _toDelete = [...Here you can put the words to delete...]

    _tokens = _tokensIn[:] # Do you really understand why do we need a copy of the list?

    # Type your code here:
    #
    #

    return _tokens

In [None]:
final_tokens = extra_clean(lem_tokens)

In [None]:
print(final_tokens)

**Excercise:** Imagine you are developing a project where specific words, such as certain proper nouns and surnames, are considered undesirable and should be treated as stopwords. Your task is to create a library that allows users to customize their list of stopwords.

Build a library? Slow down cerebrito!!

Let us go step by step and create a minilibrary:

1) Create a folder in your tree directoy called 'Libraries'.
2) Create an empty file called `__init__.py`.
3) Create a file called `custom_stopwords.py` and inside, create a class called tools (this step is optional; i did it). Beneeth the methods of this class, define a function with your code from `extra_clean`. Save it and now you can use it! ;)


Don't know how to use it? Jmmm....

Follow these steps:

    1) Seriously!?!?!?! Import the library!!! At this point in life?!?!?!?

        `from my_folder_recently_created import my_recently_created_library`

    2) And then use it!!!

        `final_tokens = my_recently_created_library.the_name_of_my_class.my_method(lem_tokens)`

**Warning:** To ensure the changes in my library are reflected, we need to restart the kernel. This is because the kernel has already loaded the previous version of the library into memory. Upon restarting, all the loaded variables and libraries will be cleared, allowing you to load the updated version of the library the next time you import it.

**Observation:** You can implement a library without an `__init__.py` file, but when calling it you won't be able to call sublibraries. You must run `import Libraries` directly.

In [None]:
# Type your code here:

**Final Excercise:** Now, inside the 'Libraties' folder, create a library called 'tokenization.py' with all the main functions written in this notebook and use it ;)

In this manner, if you run the following cell, is equivalent to all the previous cells of the notebook :D

In [None]:
# Type your code here: