# 🧹 Text Preprocessing (Make Your Text Model-Ready!)

## 🔍 What is Text Preprocessing?
**🔹 Easy Meaning:**
Text preprocessing means cleaning and organizing your raw text data so that machines can understand it.

🧠 Think of it like washing vegetables before cooking. Clean text = Better results.

---

## 🎯 Why is It Important?
**🔹 Because raw text is:**
- Messy (symbols, emojis, punctuation)
- Inconsistent (capital vs small letters)
- Redundant (repeating words, stopwords)
- Hard for machines to understand

✨ Preprocessing helps your model focus on what really matters.

---

## 🪜 Common Text Preprocessing Steps

### ⚙️ 1. Lowercasing
**🔹 Meaning:** Convert all words to lowercase  
**🧠 Example:** “Hello” → “hello”

### ⚙️ 2. Remove Punctuation
**🔹 Meaning:** Delete symbols (! ? . ,)  
**🧠 Example:** “Wow!” → “wow”

### ⚙️ 3. Tokenization
**🔹 Meaning:** Break text into words  
**🧠 Example:** “I love NLP” → ["I", "love", "NLP"]

### ⚙️ 4. Remove Stopwords
**🔹 Meaning:** Remove common words (the, is, at...)  
**🧠 Example:** “I love the NLP” → ["love", "NLP"]

### ⚙️ 5. Stemming
**🔹 Meaning:** Cut words to their root form  
**🧠 Example:** “Playing” → “play”

### ⚙️ 6. Lemmatization
**🔹 Meaning:** Convert to dictionary form  
**🧠 Example:** “Better” → “good”

### ⚙️ 7. Remove Numbers
**🔹 Meaning:** Remove unnecessary digits  
**🧠 Example:** “Python 3 is cool” → “Python is cool”

### ⚙️ 8. Remove URLs/Emails
**🔹 Meaning:** Clean links and emails  
**🧠 Example:** “Visit xyz.com” → “Visit”

### ⚙️ 9. Spell Correction
**🔹 Meaning:** Fix typos  
**🧠 Example:** “heello” → “hello”

---

## 🛠️ Python Tools for Preprocessing

- `nltk` → Stopwords, tokenizing, stemming  
- `spaCy` → Lemmatization, POS tagging  
- `re` → Regex cleaning (remove links, numbers)  
- `Ekphrasis` → For social media: emojis, hashtags, slangs  
- `sklearn` → CountVectorizer, TfidfVectorizer includes preprocessing options  

---

## 📌 Choose Steps Based on Your Task

**🗂️ Example-Based Tips:**
- Sentiment Analysis ➤ Keep "not", "never" (don’t remove all stopwords)
- Spam Detection ➤ Keep numbers, links — they’re useful!
- Topic Modeling ➤ Remove stopwords to discover main topics

---

## ✅ Summary:
"Text Preprocessing turns noisy, messy human language into clean, structured input that machines can learn from."

🎉 Clean data = Smart Models

---

In [1]:
import pandas as pd

df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Lower Case
this is the First step in the Preprocessing


In [2]:
#Convert one row to the lowercase
df['review'][1].lower()

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [3]:
#Convert the Whole dataset or total rows and the specific column
df['review']=df['review'].str.lower()
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


# Remove The HTML Tags
now we will use the regular Expression in this case

In [4]:
df['review'][1]

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [5]:
# Import the Re Library that will help on the Regular Expression

import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub (r'',text)

In [6]:
df['review'].apply(remove_html_tags)

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

# Remove URLs

In [7]:
import re

def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub('', text)

In [8]:
# now we remove the URL from the review column
df['review'] = df['review'].apply(remove_url)

# Remove the Punctuation Marks
"Remove punctuation marks" means stripping out all non-alphanumeric symbols (like . , ! ? @ # $ %) from text data to clean and standardize it for analysis. This is a common step in Natural Language Processing (NLP) and text preprocessing.

In [9]:
import string

exclude = string.punctuation  # Note: this is string.punctuation (no underscore)

def remove_punc(text):
    for char in exclude:
        text = text.replace(char, '')
    return text

# Example usage
text = 'string. With. Punctuation!'

cleaned_text = remove_punc(text)
print(cleaned_text)  # Output: 'string With Punctuation'

# start = time.time()
# print(remove_punc(text))
# time1 = time.time() - start
# print(time1)

string With Punctuation


In [10]:
import time
import string

text = "string With Function from 0:0003:0004:0005:0006: + 00a + Markovon"

def remove_punc(text):
    return text.translate(str.maketrans('', '', string.punctuation))


#showing the total execution time
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

string With Function from 00003000400050006  00a  Markovon
0.0004169940948486328


# Chat word Treatment 

In text preprocessing, "Chat word Treatment" refers to handling informal, colloquial, or slang words commonly used in chats, social media, and messaging platforms (e.g., "u" for "you," "brb" for "be right back," "lol" for "laugh out loud"). These words are non-standard and can affect NLP tasks if not normalized.

In [11]:
chat_words = {
    'U': 'you',
    'BRB': 'be right back',
    'LOL': 'laugh out loud',
    'FYI': 'for your information',
    'OMG': 'oh my god'
    # Add more chat word mappings as needed
}

def char_conversion(text):
    new_text = []
    
    for word in text.split():
        if word.upper() in chat_words:
            new_text.append(chat_words[word.upper()])
        else:
            new_text.append(word)
    return ' '.join(new_text)


# 🔍 What is Ekphrasis?
Ekphrasis is a Python library used to clean and understand messy, casual, or social media text.

 **Imagine people type like this:**

idk what to do lol 😂

OMG I'm sooo tired rn!!!

visit http://example.com & check out!!!

These kinds of text are hard for machines to understand.
➡️ Ekphrasis helps convert them into proper, clean text.

In [12]:
pip install ekphrasis


from ekphrasis.classes.preprocessor import TextPreProcessor

# Initialize the text processor
text_processor = TextPreProcessor(
    normalize=['url', 'email', 'percent', 'money', 'time', 'date', 'number'],
    unpack_contractions=True,        # expands "can't" to "can not"
    spell_correct_elong=True,        # corrects elongated words like "soooo" → "so"
    annotate={"hashtag", "allcaps", "elongated", "repeated"},
    fix_html=True,
    segmenter="twitter",
    corrector="twitter",
    tokenizer="social"               # social tokenizer
)

# Define a new function (replacement for chat_conversion)
def normalize_chat_text(text):
    processed = text_processor.pre_process_doc(text)
    return " ".join(processed)


SyntaxError: invalid syntax (3669915737.py, line 1)

In [None]:

# Example usage
text = "U should know this LOL"
converted_text = char_conversion(text)
print(converted_text)  # Output: "you should know this laugh out loud"

# Spelling correction
Spelling correction is the process of identifying and fixing misspelled words in text data to improve consistency and accuracy for NLP tasks
used for that profuse  like TextBlob  


In [None]:
from textblob import TextBlob  

incorrect_text = 'ceertain conditions during seveal generations are modified in the same maner.'

# Initialize TextBlob object
text_blob = TextBlob(incorrect_text)

# Correct spelling and get the result
corrected_text = text_blob.correct()

print("Original text:", incorrect_text)
print("Corrected text:", corrected_text)

# Removing Stop Word
Stop words are common words (like "the", "and", "is") that are often filtered out from text data because they:

Don't add meaningful information for many NLP tasks

Increase noise without improving analysis

Reduce processing efficiency (more words to process)



In [13]:
from  nltk.corpus import stopwords

In [None]:
stopwords.words('English')

In [None]:
from nltk.corpus import stopwords
import nltk

# Download stopwords data (only needed once)
nltk.download('stopwords')

def remove_stopwords(text):
    """
    Removes English stopwords from the input text.
    
    Args:
        text (str): Input text to process
        
    Returns:
        str: Text with stopwords removed
    """
    stop_words = set(stopwords.words('english'))
    new_text = []
    
    for word in text.split():
        if word.lower() not in stop_words:  # Check lowercase version
            new_text.append(word)
    
    return ' '.join(new_text)

# Example usage
sample_text = "this is an example sentence with some stop words"
print("Original:", sample_text)
print("Filtered:", remove_stopwords(sample_text))

# Handling Emojis

In [None]:
import re

#The re library is used for searching, matching,..
#and manipulating text patterns using regular expressions (also called regex).
#Think of it like a powerful search tool — but instead of looking for exact words, it looks for patterns in text.



def remove_emoji(text):
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'  # dingbats
        u'\U000024C2-\U0001F251'  # enclosed characters
        ']+', flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Example usage
sample_text = "Hello! 😊 This is a test 🚀 with emojis 🇺🇸 and symbols ✨"
print(f"Original: {sample_text}")
print(f"Cleaned: {remove_emoji(sample_text)}")

# Convert Emoji into the meanings 

In [None]:
import emoji
print(emoji.demojize('Hello! 😊'))

#  Word tokenization

**Word tokenization means:**

Breaking a sentence into individual words.

**Why is it Important in NLP?**
Before a computer can understand text, it needs to break it 
down into parts (words).

**This helps in:**

Sentiment analysis

Emotion detection

Text classification

Machine translation

In [None]:
# Word tokenization
sent1 = "I am going to Delhi"
word_tokens = sent1.split()
print("Word tokens:", word_tokens)

In [None]:
# Sentence tokenization
sent2 = "I am going to Delhi. I will stay there for 3 days. Let's hope the trip to be great."
sentence_tokens = sent2.split('.')
print("Sentence tokens:", [s.strip() for s in sentence_tokens if s])  # Cleaned output

In [None]:
# Problems with split function
sent3 = "I   am  going   to  Delhi"  # Multiple spaces
print("Simple split:", sent3.split())  # Handles multiple spaces but not punctuation

sent4 = "I'm going to Delhi!"
print("Split with punctuation:", sent4.split())  # Doesn't separate punctuation

# ✅ NLTK (Natural Language Toolkit)
**🟢 Easy Definition:**
NLTK is like a schoolbook for learning language processing in Python.

**📚 Use it when:**

You want to learn how language works

You want to try small text experiments

You are a student or beginner



In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt')  # Download required data (first time only)

# Better word tokenization
print("NLTK word tokens:", word_tokenize(sent4))

# Better sentence tokenization
print("NLTK sentence tokens:", sent_tokenize(sent2))

# ✅ spaCy
**🔵 Easy Definition:**
spaCy is like a powerful tool made for real-world apps.

**🚀 Use it when:**

You want to build real projects (like chatbots, apps)

You need speed and accuracy

You are working on a job or company-level project



In [None]:
pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')

# Example sentences
sent1 = "I am going to Delhi"
sent5 = "Apple is looking at buying U.K. startup for $1 billion"
sent6 = "This is a sample sentence for NLP processing"
sent7 = "The quick brown fox jumps over the lazy dog"

# Process each sentence with spaCy
doc1 = nlp(sent1)
doc2 = nlp(sent5)
doc3 = nlp(sent6)
doc4 = nlp(sent7)

# Stemming
Stemming is a text normalization technique in Natural Language Processing (NLP) that reduces words to their base or root form (called a "stem") by removing suffixes and prefixes.

**Suffixes and Prefixes in Linguistics**

Suffixes and prefixes are types of affixes (word parts added to a base word to modify its meaning or grammatical function).

In [3]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
def stem_words(text):
    return ''.join([ps.stem(word) for word in text.split()])

In [None]:
sample = 'walk walks walking  walked'
stem_words(sample)

# Lemmatization 
Lemmatization is a technique in Natural Language Processing (NLP) that reduces words to their base or dictionary form (called a lemma), ensuring the result is a valid word. Unlike stemming (which crudely chops off suffixes), lemmatization **uses:**

Morphological analysis (grammar rules)

Vocabulary/dictionary lookup

In [None]:
import nltk
import os
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# Initialize lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations = '?:!.,;'

# Tokenize and remove punctuation
sentence_words = word_tokenize(sentence)
sentence_words = [word for word in sentence_words if word not in punctuations]

# Print header
print("{:<15} {:<15}".format("Word", "Lemma"))

# Lemmatize each word and print results
for word in sentence_words:
    lemma = wordnet_lemmatizer.lemmatize(word, pos='v')  # Default is noun, 'v' for verbs
    print("{:<15} {:<15}".format(word, lemma))