# Normalizing Text

## Definition

Normalizing text typically refers to the process of converting text into a standardized or canonical form. This can include:

1. **Lowercasing**:
   - Convert all characters in the text to lowercase or uppercase, depending on your normalization requirements. This helps in standardizing the case of the text.

2. **Removing Accents and Diacritics**:
   - Remove accents and diacritical marks from characters to simplify text representation. The `unicodedata` module can be useful for this task.

3. **Expanding Contractions**:
   - If needed, expand contractions to their full forms (e.g., "can't" to "cannot").

4. **Removing Special Characters**:
   - Remove or replace special characters, punctuation, or unwanted symbols as per the text normalization requirements.

5. **Whitespace Normalization**:
   - Normalize whitespace, such as converting multiple spaces to a single space or removing leading/trailing spaces.


## Example

In [1]:
import unicodedata
help(unicodedata.normalize)

Help on built-in function normalize in module unicodedata:

normalize(form, unistr, /)
    Return the normal form 'form' for the Unicode string unistr.
    
    Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.



In [2]:
text = "Thís ís a téxt wîth uneven     white   spaces       and \
    spëcial chàractèrs and WeIrd special Čharacte%%%%&&&&rs^#####"

### Lowercasing

In [3]:
text = text.lower()
text 

'thís ís a téxt wîth uneven     white   spaces       and     spëcial chàractèrs and weird special čharacte%%%%&&&&rs^#####'

### Removing accents and diacritics

In Unicode, some characters, known as *`combining characters,`* do not represent individual characters on their own. Instead, they modify the preceding character in a string, often providing accents, diacritics, or other modifications to the character before them. 

The funciton *`unicodedata.combining(char)`* determines if the given character is a combining character or not.

In [4]:
characters = ['a', 'á', 'ˆ', '1', '̈']  # Some example characters

for char in characters:
    is_combining = unicodedata.combining(char)
    print(f"character: {char} is a combining character: {bool(is_combining)}")

character: a is a combining character: False
character: á is a combining character: False
character: ˆ is a combining character: False
character: 1 is a combining character: False
character: ̈ is a combining character: True


In [5]:
# Normalization Form KC(NFKD) decomposes chracters into their base and diacritic parts.
# For instance, it separates accented characters inot their base character and the diacritical mark.
text = ''.join(char for char in unicodedata.normalize('NFKD', text)\
        if not unicodedata.combining(char))

text

'this is a text with uneven     white   spaces       and     special characters and weird special characte%%%%&&&&rs^#####'

### Removing special characters and punctuations

In [6]:
text = ''.join(char for char in text if char.isalnum() or char.isspace())
text 

'this is a text with uneven     white   spaces       and     special characters and weird special characters'

### Whitespace Normalization

In [7]:
text = ' '.join(text.split())
text

'this is a text with uneven white spaces and special characters and weird special characters'