# Utils - David Rosado

This notebook contains some functions utils functions and their explanation used for the first project of NLP.

---

In [1]:
# Neecessary imports
import nltk
import re
from collections import Counter, defaultdict
import unidecode
import unicodedata
import pandas as pd

# Text cleaning

Text cleaning is the process of preparing text data for analysis by removing or modifying any unwanted or irrelevant information, which can include tasks such as removing punctuations symbols, tokenizaiton, remove stop words, normalize spaces, and the treatment of special tokens. Let me perform some functions in order to deal with this problems. 

In [2]:
# Tokenize a text
def tokenize_text(text):
    '''
    Args: 
      text (str): The input text to be tokenize
    
    Returns:
      list: The tokenized text in a list
    
    '''
    return [token.lower() for token in nltk.word_tokenize(text)]

# Remove punctuation symbols
def remove_punctuation(text, question_mark = True):
    '''
    Args:
      text (str): The input text to remove punctuations
      question_mark (bool, default=True): If True, the question_mark is removed
    
    Returns:
      str: The final text without punctuation symbols
    '''
    if question_mark:
        return re.sub(r'[^\w\s]', '', text)
    else:
        return re.sub(r'[^\w\s?]', '', text)
    
# Remove english stopwords
def remove_stopwords(text, stop_words):
    """
    Args:
      text (str): The input text to remove stop words
    
    Returns:
      str: The final text without stop words
    """
    
    # Tokenize the text
    words = tokenize_text(text)
    # Replace
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Replace all consecutive whitespace characters in the text string with a single space.
def normalize_spaces(text):
    '''
    Args:
      text (str): The input text to normalize
    Returns:
      str: The final normalized text
    '''
    return re.sub(r'\s+',' ',text)

# Replace all non-alphabetic characters in the text string with a single space.
def remove_nonAlphaWord(text, question_mark = True):
    '''
    Args:
      text (str): The input text to replace non alphabetic characters
      question_mark (bool, default=True): If True, the question_mark is removed
    Returns:
      str : The final replaced text
    '''
    if question_mark:
        return re.sub(r'[^a-zA-Z]', ' ',text)
    else: 
        return re.sub(r'[^a-zA-Z?]', ' ', text)


def remove_accents(text):
    '''
    Args:
      text (str): The input text to remove accent
    Return:
      str : The final text without accents
    '''
    return unidecode.unidecode(text)

# If a token appears less than max_count, we change the word to a general one
def special_tokens(text,word_counts_one, max_count = 1):
    '''
    Args:
      text (list): The input text to treat
      word_counts (dict): Counter of words that only appears once in the dataset
      max_count (int, default = 1): Number of times required for the word to appear. 
      Otherwise, it is change it to special_token
      
    Returns:
      str: The modified text 
    '''
    
    # Replace single-word occurrences with "special_token"
    
    modified_question = ' '.join('special_token' if word in word_counts_one.keys() else word for word in tokenize_text(text))
    return modified_question

## Examples

Let us make some example of the provided functions to understant how it works. Let us start by tokenize some text. Notice that our function returns the tokenize text in lower case.

In [3]:
# Tokenize text

txt = 'This text is an example for the first project of NLP'
print(f"From => {txt} -> {tokenize_text(txt)}")

From => This text is an example for the first project of NLP -> ['this', 'text', 'is', 'an', 'example', 'for', 'the', 'first', 'project', 'of', 'nlp']


Let us continue by showing how it works the remove_punctuation and the remove_nonAlphaWord function. Given some text, the first function will replace any character that is not a word character or whitespace character with nothing. This means that any non-word character (such as punctuation) will be removed entirely, while whitespace characters will be preserved. The second function, remove_nonAlphaWord, will replace any character that is not an English letter with a space. This means that any non-letter character (such as digits, punctuation, or whitespace) will be replaced with a space.

In [4]:
# Remove punctuation

txt = 'Wow! This is amazing, actually, it is truly amazing'
print(f"From => {txt} -> {remove_punctuation(txt)}")
txt = 'Can I get the ticket please?'
print(f"From => {txt} -> {remove_punctuation(txt,False)}")
print(f"From => {txt} -> {remove_punctuation(txt,True)}")

# Remove nonAlpha words
txt = 'I am 100% sure that this is amazing, it is truly amazing'
print(f"From => {txt} -> {remove_nonAlphaWord(txt)}")
txt = 'Can I get the ticket please?'
print(f"From => {txt} -> {remove_nonAlphaWord(txt,False)}")
print(f"From => {txt} -> {remove_nonAlphaWord(txt,True)}")

From => Wow! This is amazing, actually, it is truly amazing -> Wow This is amazing actually it is truly amazing
From => Can I get the ticket please? -> Can I get the ticket please?
From => Can I get the ticket please? -> Can I get the ticket please
From => I am 100% sure that this is amazing, it is truly amazing -> I am      sure that this is amazing  it is truly amazing
From => Can I get the ticket please? -> Can I get the ticket please?
From => Can I get the ticket please? -> Can I get the ticket please 


Let us continue showing how remove_stopwords works. This is a simple function to remove english stopwords of a given text

In [5]:
# Remove stop words
# Create the stop words vocabulary (customize)
stop_words = set([
    'the', 'and', 'to', 'in', 'of', 'that', 'is', 'it', 'for',
    'on', 'this', 'you', 'be', 'are', 'or', 'from', 'at', 'by', 'we',
    'an', 'not', 'have', 'has', 'but', 'as', 'if', 'so', 'they', 'their',
    'was', 'were','some', 'there', 'these', 'those', 'than', 'then', 'been', 'also',
    'much', 'many', 'other'
])
txt = 'Is there any other option?'
print(f"From => {txt} -> {remove_stopwords(txt,stop_words)}")

From => Is there any other option? -> any option ?


Let us go on with two more functions: normalize_spaces and remove_accents. The first one replace all consecutive whitespace characters in the text string with a single space and the second one, takes a string containing Unicode characters and returns a new string with those characters replaced by their closest ASCII equivalents. This can be useful for converting non-ASCII text to a more universally readable format.

In [6]:
# Normalize spaces

txt = 'This  text  has   many        spaces'
print(f"From => {txt} -> {normalize_spaces(txt)}")

# Remove accents
txt = "héllo wörld"
print(f"From => {txt} -> {remove_accents(txt)}")

From => This  text  has   many        spaces -> This text has many spaces
From => héllo wörld -> hello world


Finally, let me show you how it works the last function, special_tokens. The function takes the whole corpus and starts looking for strange words that only appear once to replace them with special_token. To test it, let us create a little dataset.

In [7]:
dataset = [
    "I like to read books", "Reading books is enjoyable for me",
    "She runs every morning", "Every morning she goes for a run",
    "The cat is sleeping", "The sleeping cat is cute",
    "I am learning to code", "Coding is a useful skill to learn",
    "He enjoys playing video games", "Playing video games is his favorite hobby",
    "The car stopped abruptly", "The abrupt stop of the car was surprising",
    "We went to the beach", "The beach was crowded and sunny",
    "She sings beautifully", "Her beautiful singing voice is captivating",
    "The restaurant serves delicious food", "The food at the restaurant is always tasty",
    "He is studying for an exam", "Studying is important for academic success",
    "The flowers are blooming", "The blooming flowers are a sign of spring",
    "The movie was entertaining", "I found the movie to be quite enjoyable",
    "She is a talented musician", "Music is her passion and she is very talented",
    "The building is very tall", "The tall building is an impressive feat of engineering",
    "He traveled to Europe last summer", "Last summer he went on a trip to Europe",
    "I love spending time with my family", "My family is very important to me",
    "The book was very suspenseful", "I found the book to be quite thrilling",
    "She enjoys painting and drawing", "Art is her favorite form of self-expression",
    "The sun is shining brightly today", "The bright sun is making everything look beautiful",
    "He is an excellent chef", "Cooking is his passion and he is very skilled"
]
# Create the word count
word_counts = Counter(word for sentence in dataset for word in tokenize_text(sentence))
# Create a defaultdict
word_counts = defaultdict(lambda: 0, word_counts)
# Words that only appears one
word_counts_one = {k: v for k, v in word_counts.items() if v == 1}
# Create the new dataset
dataset_special_tokens = special_tokens(dataset[1],word_counts_one)

In [8]:
dataset_special_tokens

'special_token books is enjoyable for me'

# Feature Engineering

Let me perform three different text features.


# Length information. 

Let us compute the following:

+ Count of number of words for a given text
+ Count of non ASCII words for a given text

In [9]:
# Number of words for a given text
def words_count(text):
    '''
    Args:
      text (str): The input text to count the number of words
      
    Returns:
      int : The number of words in the given text
    '''
    return len(tokenize_text(text))

# Number of non ASCII words for a given text
def nonAscii_word_count(text):
    '''
    Args:
      text (str): The input text to count the number of non-ASCII words
      
    Returns:
      int : The number of non-ASCII words in the given text
    '''
    
    # Split sentence into words
    words = tokenize_text(text)
    
    # Initialize counter for non-ASCII words
    non_ascii_word_count = 0
    
    # Loop through words and check if each one contains non-ASCII characters
    for word in words:
        # Normalize the word to its canonical form (NFKD) to separate diacritics
        normalized_word = unicodedata.normalize('NFKD', word)
        # Check if any character in the normalized word has a non-ASCII category
        if any(not c.isascii() for c in normalized_word):
            non_ascii_word_count += 1
    
    return non_ascii_word_count


## Examples

In [10]:
# Word count
txt = 'This text contains five words'
print(words_count(txt))

5


In [11]:
# Non - ASCII word count
txt = 'The café serves croissants and café au lait.'
print(nonAscii_word_count(txt))
txt = 'This is an example text with some non-ASCII words like café, résumé, Pokémon and 阿.'
print(nonAscii_word_count(txt))

2
4


# Common word intersection count

Let us make a function that calculates the number of common words that two sentences have in common.

In [12]:
def common_words_count(text1,text2):
    '''
    Args:
      text1 (str): First sentence
      text2 (str): Second sentence
    
    Return:
      int: The number of common words that the two sentences have in common
    '''
    # Compute the tokens for each sentence
    tokens1 = set(tokenize_text(text1))
    tokens2 = set(tokenize_text(text2))
    
    # Return the number of common words
    return len(tokens1 & tokens2)

## Example

In [13]:
# Common words count example
txt1 = 'This is a sentence to taste the implemented function'
txt2 = 'The aim of this sentence is to taste the implemented function'
'''
Common_words = {'function', 'implemented', 'is', 'sentence', 'taste', 'the', 'this', 'to'}
'''
common_words_count(txt1,txt2)

8

# Study of the beginning of the question

let us look at whether the start of the question is one of the following tokens: Who, Where, When, Why, What, Which, How. Let us create two different approaches in order to deal with this. The first one is to create a one hot encoding of the whole corpus.

In [14]:
def one_hot_begin(corpus):
    '''
    Args:
      corpus (list): The whole corpus to create the one hot encoding
    
    Return:
      dataframe : A dataframe with the one hot encoding
    '''
    # Define the one-hot encoding labels
    labels = ['who', 'where', 'when', 'why', 'what', 'which', 'how']
    
    # Initialize an empty list to store the one-hot encodings
    one_hot_encodings = []
    
    # Iterate through each sentence in the dataset
    for question in corpus:
        # Initialize a list of zeros
        one_hot_encoding = [0] * len(labels)
        
        # Split the sentence into individual words
        words = tokenize_text(question)
        
        # Check if the first word of the sentence is in the labels list
        if words[0] in labels:
            one_hot_encoding[labels.index(words[0])] = 1
        
        # Add the one-hot encoding to the list of encodings
        one_hot_encodings.append(one_hot_encoding)
    
    # Convert the list of encodings to a pandas dataframe
    df_one_hot = pd.DataFrame(one_hot_encodings, columns=labels)
    
    return df_one_hot
        

## Example

In [15]:
# Create a little dataset for test it
corpus = [
    'How do you do?', 'Shoud we play chess?',
    'When did you arrive?', 'Why are you crazy?',
    'Oh, is that you?', 'What about you?',
    'Where is the nearest restaurant', 'Amazing']

one_hot_begin(corpus)

Unnamed: 0,who,where,when,why,what,which,how
0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0
6,0,1,0,0,0,0,0
7,0,0,0,0,0,0,0


Another approach, simply returns True/False, if two questions starts with the same word or not.

In [16]:
def frist_word_is_same(text1,text2):
    '''
    Args:
      text1 (str): First sentence
      text2 (str): Second sentence
    
    Returns:
      bool: True if the first word is the same, otherwise, False
    '''
    # Tokenize the text
    tokens1 = tokenize_text(text1)
    tokens2 = tokenize_text(text2)
    # Return True/False
    return tokens1[0] == tokens2[0]

## Example

In [17]:
# Check if two questions start with the same word

txt1 = 'How are you?'
txt2 = 'How are you doing?'
print(frist_word_is_same(txt1,txt2))

txt1 = 'Why are you here?'
txt2 = 'Where is the party?'
print(frist_word_is_same(txt1,txt2))

True
False
