# 📌 Cleaning Tasks:
You need to apply the following modern NLP preprocessing techniques:

1️⃣ Lowercasing: Convert the text to lowercase while keeping important words standardized (e.g., "NLP" should remain unchanged).
2️⃣ Removing Special Characters & Emojis: Strip out emojis (🤖📚🚀) and unnecessary punctuation (!, ..., etc.).
3️⃣ Removing Stopwords: Eliminate common words that do not contribute much meaning (e.g., "and", "but", "also").
4️⃣ Lemmatization: Reduce words to their base forms (e.g., "learning" → "learn").
5️⃣ Spelling Normalization: Ensure all variations of NLP are standardized to "NLP".

# 📌 Recommended Tools & Techniques:
You should use modern NLP libraries to clean the text efficiently:

spaCy → For tokenization, stopword removal, and lemmatization.
Hugging Face Tokenizer → For advanced token processing.
Regular Expressions (RegEx) → For removing emojis and special characters.
Custom Normalization Rules → To standardize words like "Nlp" to "NLP".
📌 Your Task:
1️⃣ Implement the cleaning steps and return the final cleaned text.
2️⃣ If you get stuck, tell me where you need help, and I’ll guide you.

🚀 Let’s see your cleaned version!

In [11]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
from transformers import AutoTokenizer

In [14]:
uncleaned_data = "Hello!!! I'm learning Natural Language Processing (NLP)...🤖 This field 📚 is evolving very fast. However, some words (and, with, also, but) might be unnecessary! Also, different spelling variations exist; for example, NLP can be written as NLP, Nlp, or nlp. We need to normalize this as well. 🚀"
# Load spaCy's English model (optimized for speed)
nlp = spacy.load("en_core_web_sm")
# Hugging Face Tokenizer (can handle text normalization efficiently)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [15]:
new_data = re.sub(r'[^\w\s]', '', uncleaned_data.lower())
print(new_data)

hello im learning natural language processing nlp this field  is evolving very fast however some words and with also but might be unnecessary also different spelling variations exist for example nlp can be written as nlp nlp or nlp we need to normalize this as well 


In [18]:
doc = nlp(new_data)
filtered_tokens = [token.lemma_ for token in doc if not token.is_stop]
print(filtered_tokens)

['hello', 'm', 'learn', 'natural', 'language', 'processing', 'nlp', 'field', ' ', 'evolve', 'fast', 'word', 'unnecessary', 'different', 'spelling', 'variation', 'exist', 'example', 'nlp', 'write', 'nlp', 'nlp', 'nlp', 'need', 'normalize']


In [19]:
normalize_tokens = ['NLP' if token.lower() in ['nlp', 'nlp','nlp'] else token for token in filtered_tokens]

In [None]:
encode_tokens = tokenizer.convert_tokens_to_string(normalize_tokens)
print(encode_tokens)