<a href="https://colab.research.google.com/github/dhyeyppatel/Natural-Language-Processing/blob/main/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob
import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Removing numbers and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Removing extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Spelling correction (can be slow for large texts)
    # text = str(TextBlob(text).correct())

    # Tokenization
    tokens = text.split() # Simple split for demonstration

    # Removing stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]

    # Lemmatization using spaCy
    # Note: Stemming and Lemmatization are often used alternatively,
    # spaCy's lemmatization is generally preferred for better accuracy
    # doc = nlp(" ".join(tokens))
    # tokens = [token.lemma_ for token in doc]


    return " ".join(tokens)

# Example usage:
sample_text = "This is an example sentence with some Punctuation! and numbers like 123. It also has some words that need correction, like 'wrds' and 'procesing'."
processed_text = preprocess_text(sample_text)
print("Original Text:", sample_text)
print("Processed Text:", processed_text)

Original Text: This is an example sentence with some Punctuation! and numbers like 123. It also has some words that need correction, like 'wrds' and 'procesing'.
Processed Text: exampl sentenc punctuat number like also word need correct like wrd proces


In [9]:
# Step 1: Install necessary libraries
!pip install -q nltk textblob spacy
!python -m spacy download en_core_web_sm
import nltk
nltk.download('stopwords')

# Step 2: Import libraries
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import TextBlob
import spacy

# Step 3: Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Step 4: Preprocessing function
def preprocess_text(text, do_spelling_correction=True):
    print("Original Text:", text)

    # Lowercasing
    text = text.lower()

    # Remove numbers and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Spelling correction (optional)
    if do_spelling_correction:
        text = str(TextBlob(text).correct())

    # Tokenization (simple split)
    tokens = text.split()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]

    # Lemmatization using spaCy
    doc = nlp(" ".join(stemmed_tokens))
    lemmatized_tokens = [token.lemma_ for token in doc]

    # Final cleaned text
    final_text = " ".join(lemmatized_tokens)
    print("Processed Text:", final_text)
    return final_text

# Step 5: Run on sample text
sample_text = "This is an example sentence with some Punctuation! and numbers like 123. It also has some words that need correction, like 'wrds' and 'procesing'."
preprocess_text(sample_text, do_spelling_correction=True)


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Original Text: This is an example sentence with some Punctuation! and numbers like 123. It also has some words that need correction, like 'wrds' and 'procesing'.
Processed Text: exampl sentenc punctuat number like also word need correct like word process


'exampl sentenc punctuat number like also word need correct like word process'

Let's start by installing the necessary libraries: NLTK for tokenization, stop words, stemming, and lemmatization, and spaCy for a more robust approach to lemmatization and other tasks. We'll also install `textblob` for spelling correction.