# Natural language processing
## Preprocessing with NLTK and Spacy

In [2]:
# Importing necessary libraries
import json
import string
import random 
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re

### 1 - Preprocessing with NLTK
- Transform each document into a list of terms
- Create a vector of non-repeated terms from all documents

In [3]:
import nltk
from nltk.tokenize import word_tokenize  
from nltk.corpus import stopwords

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# For using NLTK 3.6.6, you need to install OMW 1.4 
# (Open Multilingual WordNet)
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/florenciavela/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/florenciavela/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/florenciavela/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/florenciavela/nltk_data...


True

In [8]:
# Sample text for preprocessing
text = "Text preprocessing is a crucial step in any Natural Language Processing (NLP) project. Before gathering the features, we need to preprocess the text to ensure it is in a clean and structured format!!!1"
text

'Text preprocessing is a crucial step in any Natural Language Processing (NLP) project. Before gathering the features, we need to preprocess the text to ensure it is in a clean and structured format!!!1'

#### Tokenization

Tokenization is the process of splitting a sentence or document into individual words or terms. This is a fundamental step, as it simplifies the text into manageable pieces. For example, the sentence “NLP is fascinating” would be tokenized into [“NLP”, “is”, “fascinating”].

![Tokenization](../imgs/tokenization.png)

In [9]:
# Splitting the text into individual words
tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['Text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', 'project', '.', 'Before', 'gathering', 'the', 'features', ',', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format', '!', '!', '!', '1']


#### Removing puntuation

Punctuation marks are usually removed from the text since they often do not carry significant meaning for text analysis. This helps in reducing the complexity of the text data.

In [10]:
# Removing punctuation marks from tokens
tokens_no_punct = [word for word in tokens if word.isalnum()]
print("Tokens without punctuation:", tokens_no_punct)

Tokens without punctuation: ['Text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', 'NLP', 'project', 'Before', 'gathering', 'the', 'features', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format', '1']


#### Removing numbers

Numbers are also removed unless they carry significant meaning for the specific NLP task. This is because numbers can add noise to the data, affecting the performance of the model.

In [11]:
# Removing numerical values from tokens
tokens_no_numbers = [word for word in tokens_no_punct if not word.isdigit()]
print("Tokens without numbers:", tokens_no_numbers)

Tokens without numbers: ['Text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', 'NLP', 'project', 'Before', 'gathering', 'the', 'features', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format']


#### Removing HTML or special characters

In many cases, text data contains HTML tags or special symbols (e.g., &amp;, @, #). Removing these elements helps in cleaning the text, making it more suitable for analysis.

In [13]:
# Using regex to remove special characters
tokens_cleaned = [re.sub(r'\W+', '', word) for word in tokens_no_numbers]
print("Tokens without special characters:", tokens_cleaned)

Tokens without special characters: ['Text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', 'NLP', 'project', 'Before', 'gathering', 'the', 'features', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format']


#### Converting to lowercase

In [14]:
# Converting all tokens to lowercase to maintain uniformity
tokens_lower = [word.lower() for word in tokens_cleaned]
print("Tokens in lowercase:", tokens_lower)

Tokens in lowercase: ['text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'natural', 'language', 'processing', 'nlp', 'project', 'before', 'gathering', 'the', 'features', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format']


#### Removing stop words

Stop words are common words (e.g., “and”, “the”, “is”) that do not contribute much to the meaning of a sentence. Removing these words can improve the efficiency of the model by reducing the volume of text data.

In [15]:
# Removing common words that do not contribute much to the meaning
stop_words = set(stopwords.words('english'))
tokens_no_stopwords = [word for word in tokens_lower if word not in stop_words]
print("Tokens without stop words:", tokens_no_stopwords)

Tokens without stop words: ['text', 'preprocessing', 'crucial', 'step', 'natural', 'language', 'processing', 'nlp', 'project', 'gathering', 'features', 'need', 'preprocess', 'text', 'ensure', 'clean', 'structured', 'format']


#### Stemming

Stemming involves reducing words to their root form, which may not always be a dictionary word. For example, “running” and “runner” might both be reduced to “run”. Although this approach is simple, it can sometimes produce non-existent words.

Examples:
* “running” -> “run”
* “runners” -> “runner”
* “studies” -> “studi”

In [16]:
# Reducing words to their root form using PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens_no_stopwords]
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['text', 'preprocess', 'crucial', 'step', 'natur', 'languag', 'process', 'nlp', 'project', 'gather', 'featur', 'need', 'preprocess', 'text', 'ensur', 'clean', 'structur', 'format']


#### Lemmatization

Lemmatization reduces words to their base or dictionary form. For instance, “running” and “ran” would be reduced to “run”. Unlike stemming, lemmatization ensures that the root word belongs to the language dictionary, maintaining semantic meaning.

Examples:
* “running” -> “run”
* “runners” -> “runner”
* “studies” -> “study”

In [17]:
# Reducing words to their dictionary form using WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_no_stopwords]
print("Lemmatized Tokens:", lemmatized_tokens)

Lemmatized Tokens: ['text', 'preprocessing', 'crucial', 'step', 'natural', 'language', 'processing', 'nlp', 'project', 'gathering', 'feature', 'need', 'preprocess', 'text', 'ensure', 'clean', 'structured', 'format']


![Stemming vs Lemmatization](../imgs/stemming-lemmatization.png)

### NTLK preprocess function

In [35]:
def nltk_preprocess(text):
    print("Initial text:", text)
    # Tokenization
    nltk_tokenList = word_tokenize(text)
      
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    nltk_lemmaList = []
    for word in nltk_tokenList:
        nltk_lemmaList.append(lemmatizer.lemmatize(word))
    
    print("\nLemmatization:", nltk_lemmaList)

    # Stop words
    nltk_stop_words = set(stopwords.words("english"))
    filtered_sentence = [w for w in nltk_lemmaList if w not in nltk_stop_words]

    # Filter Punctuation
    filtered_sentence = [w for w in filtered_sentence if w not in string.punctuation]
    
    print("\nRemove stopword & Punctuation: ", filtered_sentence)
    return filtered_sentence

In [40]:
nltk_text = nltk_preprocess(text)

Initial text: Text preprocessing is a crucial step in any Natural Language Processing (NLP) project. Before gathering the features, we need to preprocess the text to ensure it is in a clean and structured format!!!1

Lemmatization: ['Text', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', 'project', '.', 'Before', 'gathering', 'the', 'feature', ',', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'is', 'in', 'a', 'clean', 'and', 'structured', 'format', '!', '!', '!', '1']

Remove stopword & Punctuation:  ['Text', 'preprocessing', 'crucial', 'step', 'Natural', 'Language', 'Processing', 'NLP', 'project', 'Before', 'gathering', 'feature', 'need', 'preprocess', 'text', 'ensure', 'clean', 'structured', 'format', '1']


In [37]:
#!pip3 install spacy
#!python3 -m spacy download en_core_web_sm

In [38]:
import spacy
# Cargar pipeline de preprocesamiento de inglés
nlp = spacy.load('en_core_web_sm')

def spacy_preprocess(text):
    doc = nlp(text)
    
    # Tokenization & lemmatization
    lemma_list = []
    for token in doc:
        lemma_list.append(token.lemma_)
    print("Tokenize+Lemmatize:")
    print(lemma_list)
    
    # Stop words
    filtered_sentence =[]
    for word in lemma_list:
        # 'word' is a string. To retrieve information from spaCy objects,
        # we need to use the string to get a lexeme, the spaCy object
        # that contains preprocessing information for each term
        # (we could also directly filter stopwords during the lemmatization step)
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False:
            filtered_sentence.append(word) 
    
    # Filter punctuation
    filtered_sentence = [w for w in filtered_sentence if w not in string.punctuation]

    print(" ")
    print("Remove stopword & punctuation: ")
    print(filtered_sentence)
    return filtered_sentence

In [41]:
spacy_text = spacy_preprocess(text)

Tokenize+Lemmatize:
['text', 'preprocessing', 'be', 'a', 'crucial', 'step', 'in', 'any', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', 'project', '.', 'before', 'gather', 'the', 'feature', ',', 'we', 'need', 'to', 'preprocess', 'the', 'text', 'to', 'ensure', 'it', 'be', 'in', 'a', 'clean', 'and', 'structured', 'format!!!1']
 
Remove stopword & punctuation: 
['text', 'preprocessing', 'crucial', 'step', 'Natural', 'Language', 'Processing', 'NLP', 'project', 'gather', 'feature', 'need', 'preprocess', 'text', 'ensure', 'clean', 'structured', 'format!!!1']


### Conclusions

Text preprocessing is an essential step in NLP that transforms raw text into a format that can be effectively analyzed and modeled.
By using techniques such as tokenization, lemmatization, and stop words removal, we can clean and structure the text data. Libraries like NLTK and spaCy provide robust tools for implementing these preprocessing steps.

In [44]:
#!pip3 install prettytable
from prettytable import PrettyTable
table = PrettyTable(['NLTK', 'spaCy'])
for nltk_word, spacy_word in zip(nltk_text, spacy_text):
    table.add_row([nltk_word, spacy_word])
print(table)

+---------------+---------------+
|      NLTK     |     spaCy     |
+---------------+---------------+
|      Text     |      text     |
| preprocessing | preprocessing |
|    crucial    |    crucial    |
|      step     |      step     |
|    Natural    |    Natural    |
|    Language   |    Language   |
|   Processing  |   Processing  |
|      NLP      |      NLP      |
|    project    |    project    |
|     Before    |     gather    |
|   gathering   |    feature    |
|    feature    |      need     |
|      need     |   preprocess  |
|   preprocess  |      text     |
|      text     |     ensure    |
|     ensure    |     clean     |
|     clean     |   structured  |
|   structured  |   format!!!1  |
+---------------+---------------+
