<a href="https://colab.research.google.com/github/ghassenov/NLP_Basics/blob/main/case_folding%26SCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Folding
* Case folding is a text normalization technique in Natural Language Processing (NLP) where all characters in a text are converted to a single case (usually lowercase).
* Reduces Vocabulary Size (words like Apple, APPLE and apple are treated as the same token)
* Improves Consistency in Text Processing (Ensures that algorithms do not treat uppercase and lowercase versions of the same word as different entities)
* Helps in Better Feature Extraction

In [None]:
text = 'Hello, And Welcome'
text

'Hello, And Welcome'

In [None]:
x = text.casefold()
x

'hello, and welcome'

In [None]:
y = text.lower()
y

'hello, and welcome'

Special Character Removal
* Special character removal is a text preprocessing step where non-alphanumeric symbols (e.g., ! @ # $ % ^ & * ( )), punctuation marks (e.g., .,;:?!), and other irrelevant characters are stripped from the text.
* Reduces Noise in Data
* Simplifies Tokenization
* Improves Model Efficiency

In [None]:
import re

In [None]:
# input string
input_str = 'hello how@ are$ you!!'
# using regular expressions to remove special characters
clean_str = re.sub(r"[^a-zA-Z0-9\s]","",input_str)

In [None]:
clean_str

'hello how are you'

* re.sub() is used to replace matched patterns in the string
* r"[^a-zA-Z0-9\s]" : the regex pattern to match
* r denotes a raw string(to avoid escaping backslashes)
* [^..] is a negated character class(matches any character not listed inside
* a-zA-Z0-9 : matches alphanumeric characters (letters and digits)
* \s matches whitespace(spaces,tabs,newlines)

=> Match any character that is not alphanumeric or whitespace

* "" : replacement string (empty,meaning matched characters are deleted)
* input_str : the input string to process


## libraries in the field of NLP
Spacy
* Natural language processing library in Python that can be used to tokenize and process textual data

NLTK

In [None]:
import spacy

In [None]:
#load the spacy model
nlp = spacy.load('en_core_web_sm')

this loads spacy's pre-trained English language model which provides:
* Tokenization rules
* Linguistic annotations

In [None]:
#input string
input_str = 'hello how@ are$ you!!'

In [None]:
# function to clean the string
def clean_text(text):
  cleaned_text = ''.join([char for char in text if char.isalpha() or char.isspace()])
  doc = nlp(cleaned_text)
  return ' '.join(token.text for token in doc)

Purpose of the function:
* removes all non alphabetic characters (except spaces) from the input string.
* char.isalpha() keeps only alphabetic characters (a-z,A-Z)
* char.isspace() preserves whitespaces
* ''.join(../ ) combines the filtered characters back into a string

* doc = nlp(cleaned_text) : processes the cleaned string using spacy's nlp pipeline. (tokenizes the text into words/punctuation and adds linguistic annotations)

* return ' '.join(token.text for token in doc) : Reconstructs the processed text by joining spaCy tokens with spaces.
* token.text: gets the text of each token
* ' '.join(token.text for token in doc) combines tokens into a single string separated by spaces.

In [None]:
# getting the final output
clean_str = clean_text(input_str)
clean_str

'hello how are you'

In [33]:
import nltk
nltk.download('punkt_tab',quiet= True)
input_str = 'hello how$$ are# you!!'

In [34]:
# Tokenization
tokens = nltk.word_tokenize(input_str)
tokens

['hello', 'how', '$', '$', 'are', '#', 'you', '!', '!']

In [36]:
# remove special characters
clear_tokens = [token for token in tokens if token.isalnum()]
clear_tokens

['hello', 'how', 'are', 'you']

In [38]:
# final output string
clear_str = ' '.join(clear_tokens)
clear_str

'hello how are you'