### Text Preprocessing
The first and very important stage of the NLP process is Preprocessing. Raw text data goes through several steps before being converted into a format that computers can understand and process. These steps are performed to clean, correct, and shape the data. By doing so, it helps improve the accuracy and efficiency of the model.


![text-preprocessing](../images/1/1-text-preprocessing.png)

---

#### Text Preprocessing Steps:

- Lowercasing: 
  * All letters in the text are converted to lowercase. This prevents words like "Dog" and "dog" from being treated as different entities despite having the same meaning.

- Removing Special Characters and Numbers: 
  * Unnecessary special characters (such as "@", "#", "%") and numbers that do not aid the model’s understanding are removed. This step cleans up elements that do not change the meaning of the text.

- Removing Stop Words: 
  * Stop words are words that do not carry significant meaning and do not notably change the text’s meaning (such as "and", "a", "the", "for"). These words are removed in most NLP applications.

- Whitespace Removal:  
  * Extra spaces, especially unnecessary spaces between sentences, are removed to ensure the text is processed correctly.

- Punctuation Removal: 
  * Punctuation marks (., ?, !, :) are sometimes removed as they do not add meaning during analysis. However, in some cases (such as question marks indicating questions or punctuation indicating sentence endings), these marks may retain significance, and thus, they are removed only when necessary.

- Spelling Correction:
  * Spelling errors in the text are corrected to obtain cleaner and more accurate data. In NLP applications, automatic spell checkers (e.g., TextBlob or Hunspell) are used to correct misspelled words.

- Removing HTML Tags and URLs:
  * HTML tags (such as \<p>, \<div>, \<br>) and URLs (e.g., www.site.com) are removed from the text to obtain clean content.

- Tokenization: 
  * Tokenization is the process of splitting the text into meaningful units (words, sentences). For example, the sentence "Dogs are very cute" would be split into tokens like "Dogs", "are", "very", "cute". This makes processing and understanding each token easier.

- Stemming:
  * Stemming is the process of converting or reducing inflected words to their root form. In this method, suffixes are removed from the inflected word, so it becomes its root form.
  
    For example, from the word "Going," the "ing" suffix is removed, and the inflected word "Going" becomes the root form "Go."

- Lemmatization:
  * Lemmatization is the process of converting words to their root forms by understanding the context of the word in the sentence. Unlike stemming, which simply removes suffixes, lemmatization takes the meaning of the word into account and returns the word to its correct base form (lemma).
    
    For example, "went" is lemmatized to "go," because "went" is the past tense of "go."

- Language Normalization: 
  * Texts in different languages may undergo language-specific transformations, such as removing accent marks or normalizing letter variations.

- TF-IDF (Term Frequency-Inverse Document Frequency) Calculation: 
  * This method identifies important words in a text by calculating the frequency of words (TF) and the rarity of the word across documents (IDF). It highlights words that are more likely to carry meaning within the text.

These steps are especially important when working with machine learning models and deep learning algorithms. Properly preprocessed data allows the model to work more efficiently, leading to improved overall accuracy.