# NLP Preprocessing Techniques

This notebook provides a hands-on guide for essential preprocessing steps in Natural Language Processing (NLP), inspired by the NLP course playlist on YouTube. Each section includes explanations, examples, and code snippets to help you understand the concepts and apply them in real-world projects.

## Table of Contents

1. [Tokenization](#tokenization)
2. [Stop Words](#stop-words)
3. [Stemming](#stemming)
4. [Lemmatization](#lemmatization)
5. [Regular Expressions](#regular-expressions)
6. [Part of Speech (POS) Tagging](#pos-tagging)
7. [Named Entity Recognition (NER)](#ner)
8. [Spell Checking](#spell-checking)

## 1. Tokenization <a name="tokenization"></a>

Tokenization is the process of splitting text into smaller parts, such as words or sentences. This is the first step in many NLP tasks like **sentiment analysis**.

### Example:

```python
from nltk.tokenize import sent_tokenize, word_tokenize

text = "NLP is fascinating. It helps computers understand human language!"
sentences = sent_tokenize(text)
words = word_tokenize(text)


In [10]:
import nltk 
pragraph = 'welcome our students? this is the first day in NLP course , let\'s start'
words = nltk.word_tokenize(pragraph)
sent = nltk.sent_tokenize(pragraph)
print(words)

['welcome', 'our', 'students', '?', 'this', 'is', 'the', 'first', 'day', 'in', 'NLP', 'course', ',', 'let', "'s", 'start']


## 2. Stop Words <a name="stop-words"></a>
Stop words are common words that don't contribute much meaning to a sentence. Removing them helps speed up training and reduce noise.

In [21]:
import nltk 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
words = word_tokenize(pragraph)
english_stopwords = set(stopwords.words('english'))
new_pragraph_words = [word for word in words if word not in english_stopwords]
print(len(words))
print(len(new_pragraph_words))


16
11


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KMR\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Stemming <a name="stemming"></a>
Stemming reduces words to their root form. For instance, "running" becomes "run". This helps group similar words together.

In [27]:
pragraph = '''Running every day is a good habit. 
Some people run for fitness, while others run for fun. 
Running can be challenging, 
but the health benefits of being a runner are undeniable.'''
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
words_before = word_tokenize(pragraph)
words_after = []
for word in words_before : 
    word = stemmer.stem(word)
    words_after.append(word)
print(words_after)


['run', 'everi', 'day', 'is', 'a', 'good', 'habit', '.', 'some', 'peopl', 'run', 'for', 'fit', ',', 'while', 'other', 'run', 'for', 'fun', '.', 'run', 'can', 'be', 'challeng', ',', 'but', 'the', 'health', 'benefit', 'of', 'be', 'a', 'runner', 'are', 'undeni', '.']


In [41]:
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KMR\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\KMR\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## 4. Lemmatization <a name="lemmatization"></a>
Lemmatization is similar to stemming but ensures that the reduced form is a valid word. It looks at the word's meaning and context.

In [45]:
import nltk
from nltk.stem import WordNetLemmatizer,PorterStemmer
# Download necessary resources
nltk.download('wordnet')
nltk.download('punkt')
# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()
stemmer= PorterStemmer()
# Sample text
text = "The cats are chasing the mice and the dog is barking."
# Tokenize the text
words = nltk.word_tokenize(text)
# Apply lemmatization
lemmas = [lemmatizer.lemmatize(word) for word in words]
stems = [stemmer.stem(word) for word in words]
# Print original tokens and their lemmas
print("Original words:", words)
print("Lemmatized words:", lemmas)
print("Lemmatized words:", stems)


Original words: ['The', 'cats', 'are', 'chasing', 'the', 'mice', 'and', 'the', 'dog', 'is', 'barking', '.']
Lemmatized words: ['The', 'cat', 'are', 'chasing', 'the', 'mouse', 'and', 'the', 'dog', 'is', 'barking', '.']
Lemmatized words: ['the', 'cat', 'are', 'chase', 'the', 'mice', 'and', 'the', 'dog', 'is', 'bark', '.']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KMR\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KMR\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 5. Regular Expressions <a name="regular-expressions"></a>
Regular expressions help match patterns in text, such as extracting emails or phone numbers, and cleaning data.

In [133]:
import re

# Sample paragraph
text = """
On September 22, 2024, John Doe visited the beautiful city of Paris. 
He spent $150.75 on souvenirs and $75.50 on meals. 
During his trip, he posted about his experiences on Twitter: @JohnDoe. 
He loved visiting the Eiffel Tower and took many photos. 
For more updates, check his blog at www.johndoetravels.com. 
He also plans to visit London next month!
"""

# 1. Extract Dates
date_pattern = r"\b\w+\s\d{1,2},\s\d{4}\b"
dates = re.findall(date_pattern, text)

# 2. Extract Monetary Amounts
money_pattern = r"\$\d+\.\d{2}"
money_amounts = re.findall(money_pattern, text)

# 3. Extract Mentions
mention_pattern = r"@\w+"
mentions = re.findall(mention_pattern, text)

# 4. Extract URLs
url_pattern = r"https?://[^\s]+|www\.[^\s]+"
urls = re.findall(url_pattern, text)

# 5. Extract Place Names
place_pattern = r"\b(Paris|Eiffel Tower|London)\b"
places = re.findall(place_pattern, text)

# Print results
print("Dates:", dates)
print("Monetary Amounts:", money_amounts)
print("Mentions:", mentions)
print("URLs:", urls)
print("Places:", places)


['$54.5', '$555.255']


## 6. Part of Speech (POS) Tagging <a name="pos-tagging"></a>
POS tagging labels each word in a sentence with its corresponding part of speech (e.g., noun, verb, adjective). This is useful in various NLP tasks like translation or text summarization.

In [12]:
import spacy

nlp = spacy.load('en_core_web_sm')
senence = 'lets study machine learning before playing football karim!'
tokens = nlp(senence)
print(nouns)

for token in tokens :
    print(f'{token.text} ==== {token.pos_}')



[machine, football, karim]
lets ==== PROPN
study ==== VERB
machine ==== NOUN
learning ==== VERB
before ==== ADP
playing ==== VERB
football ==== NOUN
karim ==== NOUN
! ==== PUNCT


## 7. Named Entity Recognition (NER) <a name="ner"></a>
NER identifies key entities like people, locations, and dates within text. This helps extract structured information from unstructured text.

In [9]:
import spacy

text = "On April 5, 2021, Apple Inc. launched a new iPhone in San Francisco."
nlp = spacy.load('en_core_web_sm')

tokens = nlp(text)
for ent in tokens.ents:
    print(f'ent name : {ent}||  describation : {ent.label_} || explantion: ||{spacy.explain(ent.label_)}')


ent name April 5, 2021||  describation : DATE || explantion ||Absolute or relative dates or periods
ent name Apple Inc.||  describation : ORG || explantion ||Companies, agencies, institutions, etc.
ent name iPhone||  describation : ORG || explantion ||Companies, agencies, institutions, etc.
ent name San Francisco||  describation : GPE || explantion ||Countries, cities, states


## 8. Spell Checking <a name="spell-checking"></a>
Spell checking identifies and corrects spelling mistakes in text. It can be crucial in ensuring the accuracy of text data.

In [35]:
from textblob import TextBlob
text = 'wlcome in egpt ths iss the wrng txt'
corrected = TextBlob(text).correct()
print(corrected)

welcome in egypt the iss the wrong txt
