Install Required Libraries:

In [None]:
!pip install nltk
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import Libraries:

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import spacy

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# **Tokenization:**

Use NLTK's word_tokenize function to tokenize the text into words:

In [None]:
text = "This is an example sentence."
tokens = word_tokenize(text)
print(tokens)

['This', 'is', 'an', 'example', 'sentence', '.']


# **Stop Word Removal:**

Download the stop words corpus from NLTK and remove them from the tokenized text:

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['example', 'sentence', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# **Cleaning and Normalization:**

In the code below, we use the NLTK package for different text cleaning and normalization tasks:

**Tokenization**: We use word_tokenize to tokenize the text into individual words.

**Lowercasing:** We iterate over the tokens and convert each token to lowercase using the lower() method.

**Removing stopwords:** We utilize the stopwords corpus from NLTK to get a set of common stopwords. We then remove these stopwords from the tokenized text.

**Lemmatization**: We use the WordNetLemmatizer from NLTK to lemmatize the tokens, converting them to their base or dictionary forms.

The clean_text function takes a text input, applies the cleaning and normalization techniques using NLTK, and returns a list of cleaned tokens.

Feel free to customize the function based on your specific requirements and explore other NLTK functionalities for text processing and normalization.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def clean_text(text):
    # Tokenization
    tokens = word_tokenize(text)

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Handling contractions
    contractions = {
        "n't": "not",
        "'s": "is",
        "'re": "are",
        # Add more contractions as needed
    }
    tokens = [contractions[token] if token in contractions else token for token in tokens]

    # Removing stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and token not in string.punctuation]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

# Example usage
text = "I do not like NLP. It's too complicated!"
cleaned_tokens = clean_text(text)
print(cleaned_tokens)

['like', 'nlp', 'complicated']


# **Lemmatization and Stemming:**

Use spaCy library for lemmatization or NLTK's PorterStemmer for stemming:


In [None]:
import spacy
from nltk.stem import PorterStemmer

# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

# Create a spaCy Doc object by processing the input text
doc = nlp("This is an example sentence.")

# Extract lemmas using spaCy
lemmas = [token.lemma_ for token in doc]
print(lemmas)

# Define tokens from the processed Doc object
tokens = [token.text for token in doc]

# Create an instance of PorterStemmer from NLTK
stemmer = PorterStemmer()

# Apply stemming on the tokens using PorterStemmer
stems = [stemmer.stem(word) for word in tokens]
print(stems)

['this', 'be', 'an', 'example', 'sentence', '.']
['thi', 'is', 'an', 'exampl', 'sentenc', '.']


The first print statement outputs the lemmas extracted using spaCy:

['this', 'be', 'an', 'example', 'sentence', '.'].

The lemmas are obtained using the lemma_ attribute of each token in the doc object. These lemmas represent the base or dictionary form of the corresponding words in the input text.


The second print statement outputs the stems obtained using the PorterStemmer:

 ['thi', 'is', 'an', 'exampl', 'sentenc', '.'].

 The stems are obtained by applying the stem() method of the PorterStemmer to each token in the tokens list. The stems represent the root form of the corresponding words in the input text, derived using the Porter stemming algorithm.

