In [11]:
!pip install textblob

Defaulting to user installation because normal site-packages is not writeable
Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.8/636.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: textblob
Successfully installed textblob-0.17.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m



### Import Statements
- `re`: The `re` module is imported for regular expression operations. It is used to perform pattern matching and substitutions, which can be useful for tasks like HTML tag removal and URL detection.
- `Counter` from `collections`: The `Counter` class from the `collections` module is imported to facilitate the counting of word occurrences. It is used in operations like frequent word removal and rare word removal.
- `ABC` and `abstractmethod` from `abc`: The `ABC` (Abstract Base Class) and `abstractmethod` are imported to define abstract classes and methods. These are used to create a common interface for text preprocessing classes.
- `Union` and `List` from `typing`: The `Union` and `List` types from the `typing` module are imported to provide type hinting for method parameters and return values, improving code readability and maintainability.
- `string`: The `string` module is imported for easy access to string-related constants, such as punctuation characters. This is used in the punctuation removal process.

### NLTK Library
- `nltk`: The `nltk` library, short for the Natural Language Toolkit, is imported to leverage its extensive set of tools and resources for natural language processing (NLP). NLTK provides functionalities for tokenization, stopwords, wordnet, and more.

### NLTK Stopwords and WordNet
- `stopwords` and `wordnet` from `nltk.corpus`: The `stopwords` and `wordnet` modules from `nltk.corpus` are imported to access NLTK's predefined lists of stopwords (commonly used words to be filtered out) and WordNet data (a lexical database for English).

### NLTK Lemmatization and Stemming
- `WordNetLemmatizer` and `PorterStemmer` from `nltk.stem`: The `WordNetLemmatizer` and `PorterStemmer` classes from `nltk.stem` are imported to perform lemmatization and stemming, respectively, on word tokens. Lemmatization reduces words to their base or dictionary form, while stemming reduces words to their root form.

### NLTK Tokenization
- `word_tokenize` from `nltk.tokenize`: The `word_tokenize` function from `nltk.tokenize` is imported for word tokenization. It splits text into individual words or tokens, which is a fundamental step in many text preprocessing tasks.

### NLTK Sentence Tokenization
- `sent_tokenize` from `nltk.tokenize`: The `sent_tokenize` function from `nltk.tokenize` is imported for sentence tokenization. It divides text into individual sentences, allowing for more granular analysis and processing.

### TextBlob for Spelling Correction
- `Word` from `textblob`: The `Word` class from the `textblob` library is imported to perform spelling correction. It can correct common spelling errors in text data, enhancing text quality and readability.


In [28]:
import re
from collections import Counter
from abc import ABC, abstractmethod
from typing import Union, List
import string

#importing nltk library
import nltk

#importing nltk libraries for stopwords and wordnet
from nltk.corpus import stopwords, wordnet

#importing nltk libraries for lemmatization and stemming
from nltk.stem import WordNetLemmatizer, PorterStemmer

#importing nltk libraries for word tokenization
from nltk.tokenize import word_tokenize

#importing nltk libraries for sentence tokenization
from nltk.tokenize import sent_tokenize

#importing textblob for spelling correction
from textblob import Word

# Create a different classes for different text preprocessing

## Text Preprocessing Code Documentation

### Introduction
This documentation provides an overview of a Python code module for text preprocessing. The code defines several classes and methods for common text preprocessing tasks using the Natural Language Toolkit (NLTK) and other libraries.

### `BasePreprocessor` Abstract Class
- `BasePreprocessor(ABC)` is an abstract base class that defines a template for text preprocessing classes.
- The `preprocessor` method, marked as an abstract method using `@abstractmethod`, expects a text input and returns either a string or a list of strings.
- This class serves as a blueprint for various text preprocessing operations.

### `DownloadRequirement` Class
- `DownloadRequirement` class manages the download of NLTK resources required for text preprocessing.
- The `download` method downloads necessary NLTK resources like stopwords and WordNet data.

### Preprocessing Classes
- Several classes are defined for specific text preprocessing operations, such as lowercasing, HTML tag removal, URL removal, tokenization, stopword removal, punctuation removal, frequent word removal, rare word removal, spelling correction, lemmatization, and stemming.
- Each class has a `preprocessor` method that takes input text and performs the respective preprocessing operation.
- Type hints and docstrings are provided for each method to describe their purpose, expected input, and output types.

### Usage
- To use this text preprocessing module, create an instance of the `TextPreprocessor` class, providing the text you want to preprocess.
- Call the various methods on the `TextPreprocessor` instance to perform specific preprocessing tasks in a sequence.
- The results of each preprocessing step are stored in instance variables and can be accessed as needed.

### Example Usage
```python
text = "Sample text with <b>HTML tags</b> and URLs: http://example.com"
preprocessor = TextPreprocessor(text)
preprocessor.lower_case()
preprocessor.remove_tag()
preprocessor.remove_url()
tokens = preprocessor.tokenize_word()
preprocessor.remove_punctuation()
preprocessor.remove_frequent_word()
preprocessor.remove_rare_word()
preprocessor.correct_spelling()
preprocessor.lemmatizer()
preprocessor.stemmer()


In [29]:
class BasePreprocessor(ABC):
    @abstractmethod
    def preprocessor(self, text: str) -> Union[str, List[str]]:
        """
        Abstract method for text preprocessing.

        Args:
            text (str): Input text to be preprocessed.

        Returns:
            Union[str, List[str]]: Processed text or list of processed tokens.
        """
        pass

class DownloadRequirement:
    def download(self):
        """
        Downloads NLTK resources for text preprocessing.
        """
        download()
        download('stopwords')
        download('wordnet')

class LowerCase(BasePreprocessor):
    def preprocessor(self, text: str) -> str:
        """
        Converts the input text to lowercase.

        Args:
            text (str): Input text.

        Returns:
            str: Lowercased text.
        """
        return text.lower()

class RemoveTag(BasePreprocessor):
    def preprocessor(self, text: str) -> str:
        """
        Removes HTML tags from the input text.

        Args:
            text (str): Input text containing HTML tags.

        Returns:
            str: Text with HTML tags removed.
        """
        regex = re.compile(r'<[^>]+>')
        return regex.sub('', text)

class RemoveURL(BasePreprocessor):
    def preprocessor(self, text: str) -> str:
        """
        Removes URLs from the input text.

        Args:
            text (str): Input text containing URLs.

        Returns:
            str: Text with URLs removed.
        """
        url_search = re.search('http://\S+|https://\S+', text)
        if url_search:
            url_group = url_search.group(0)
            return text.replace(url_group, '')
        else:
            return text

class TokenizeWord(BasePreprocessor):
    def preprocessor(self, text: str) -> List[str]:
        """
        Tokenizes the input text into words.

        Args:
            text (str): Input text.

        Returns:
            List[str]: List of word tokens.
        """
        return word_tokenize(text)

class TokenizeSentence(BasePreprocessor):
    def preprocessor(self, text: str) -> List[str]:
        """
        Tokenizes the input text into sentences.

        Args:
            text (str): Input text.

        Returns:
            List[str]: List of sentence tokens.
        """
        return sent_tokenize(text)

class RemoveStopword(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> List[str]:
        """
        Removes stopwords from a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            List[str]: List of word tokens with stopwords removed.
        """
        words = [w for w in text if w not in stopwords.words("english")]
        return words

class RemovePunctuation(BasePreprocessor):
    def preprocessor(self, text: str) -> str:
        """
        Removes punctuation from the input text.

        Args:
            text (str): Input text.

        Returns:
            str: Text with punctuation removed.
        """
        pun_string = text.translate(str.maketrans('', '', string.punctuation))
        return pun_string

class RemoveFrequentWord(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> List[str]:
        """
        Removes the most frequent words from a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            List[str]: List of word tokens with frequent words removed.
        """
        cnt = Counter()
        for word in text:
            cnt[word] += 1
        FREQWORD = set([w for (w, wc) in cnt.most_common(10)])
        return [word for word in text if word not in FREQWORD]

class RemoveRareWord(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> List[str]:
        """
        Removes the rarest words from a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            List[str]: List of word tokens with rare words removed.
        """
        cnt = Counter()
        for word in text:
            cnt[word] += 1
        n_rare_words = 10
        RAREWORD = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
        return [word for word in text if word not in RAREWORD]

class CorrectSpelling(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> List[str]:
        """
        Corrects the spelling of words in a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            List[str]: List of word tokens with corrected spelling.
        """
        temp = []
        for word in text:
            word = Word(word)
            result = word.correct()
            temp.append(result)
        return temp

class Lemmatizer(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> str:
        """
        Lemmatizes a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            str: Text with lemmatized words.
        """
        lemmed = " ".join([WordNetLemmatizer().lemmatize(w) for w in text])
        return lemmed

class Stemmer(BasePreprocessor):
    def preprocessor(self, text: List[str]) -> str:
        """
        Stems a list of word tokens.

        Args:
            text (List[str]): List of word tokens.

        Returns:
            str: Text with stemmed words.
        """
        stemmed = " ".join([PorterStemmer().stem(w) for w in text])
        return stemmed

## Text Preprocessor Class Documentation


### Class Initialization
- `__init__(self, text: str)`: The constructor initializes a `TextPreprocessor` instance with the input text to be preprocessed.
  - Args:
    - `text (str)`: The input text to be preprocessed.

### Class Attributes
- The class contains several instance attributes, each representing a stage of text preprocessing. Variable names have been updated for clarity.
  - `self.lower_text: str`: Stores the lowercase version of the input text.
  - `self.removed_tags: str`: Stores the text with HTML tags removed.
  - `self.removed_urls: str`: Stores the text with URLs removed.
  - `self.tokenized_words: List[str]`: Stores a list of word tokens.
  - `self.tokenized: List[str]`: Stores a list of tokens after punctuation removal.
  - `self.removed_punctuation: str`: Stores text with punctuation removed.
  - `self.removed_frequent: List[str]`: Stores a list of words with frequent words removed.
  - `self.removed_rare: List[str]`: Stores a list of words with rare words removed.
  - `self.spell_corrected: List[str]`: Stores a list of words with corrected spelling.
  - `self.lemma: str`: Stores text with lemmatized words.
  - `self.stem: str`: Stores text with stemmed words.

### Text Preprocessing Methods
- The class defines several methods for performing specific text preprocessing tasks. Each method returns the result of the operation and updates the corresponding instance attribute.
- Method names, docstrings, and return types are provided for clarity.
  - `lower_case(self) -> str`: Converts the input text to lowercase and returns the lowercase text.
  - `remove_tag(self) -> str`: Removes HTML tags from the text and returns the cleaned text.
  - `remove_url(self) -> str`: Removes URLs from the text and returns the cleaned text.
  - `tokenize_word(self) -> List[str]`: Tokenizes the text into words and returns a list of word tokens.
  - `remove_punctuation(self) -> str`: Removes punctuation from the text and returns the cleaned text.
  - `remove_frequent_word(self) -> List[str]`: Removes frequent words from the text and returns a list of cleaned words.
  - `remove_rare_word(self) -> List[str]`: Removes rare words from the text and returns a list of cleaned words.
  - `correct_spelling(self) -> List[str]`: Corrects spelling errors in the text and returns a list of corrected words.
  - `lemmatizer(self) -> str`: Lemmatizes the text and returns the lemmatized text.
  - `stemmer(self) -> str`: Stems the text and returns the stemmed text.

### Usage Example
```python
text = "Sample text with <b>HTML tags</b> and URLs: http://example.com"
preprocessor = TextPreprocessor(text)
preprocessor.lower_case()
preprocessor.remove_tag()
preprocessor.remove_url()
tokens = preprocessor.tokenize_word()
preprocessor.remove_punctuation()
preprocessor.remove_frequent_word()
preprocessor.remove_rare_word()
preprocessor.correct_spelling()
preprocessor.lemmatizer()
preprocessor.stemmer()


In [30]:
class TextPreprocessor:
    def __init__(self, text: str):
        """
        Initializes the TextPreprocessor instance.

        Args:
            text (str): The input text to be preprocessed.
        """
        self.text = text
        self.download_requirement = DownloadRequirement()
        self.download_requirement.download()
        self.lower_text: str = ""
        self.removed_tags: str = ""  # Variable name changed from 'self.removed_tags' to 'self.removed_tags'
        self.removed_urls: str = ""  # Variable name changed from 'self.removed_urls' to 'self.removed_urls'
        self.tokenized_words: List[str] = []  # Variable name changed from 'self.tokenized_words' to 'self.tokenized_words'
        self.tokenized: List[str] = []  # Variable name changed from 'self.tokenized' to 'self.tokenized'
        self.removed_punctuation: str = ""  # Variable name changed from 'self.removed_punctuation' to 'self.removed_punctuation'
        self.removed_frequent: List[str] = []
        self.removed_rare: List[str] = []
        self.spell_corrected: List[str] = []  # Variable name changed from 'self.spell_corrected' to 'self.spell_corrected'
        self.lemma: str = ""
        self.stem: str = ""
    
    def lower_case(self) -> str:
        """
        Converts the input text to lowercase.

        Returns:
            str: The input text in lowercase.
        """
        text_processor = LowerCase()
        self.lower_text = text_processor.preprocessor(self.text)
        return self.lower_text
    
    def remove_tag(self) -> str:
        """
        Removes HTML tags from the text.

        Returns:
            str: Text with HTML tags removed.
        """
        text_processor = RemoveTag()
        self.removed_tags = text_processor.preprocessor(self.lower_text)
        return self.removed_tags

    def remove_url(self) -> str:
        """
        Removes URLs from the text.

        Returns:
            str: Text with URLs removed.
        """
        text_processor = RemoveURL()
        self.removed_urls = text_processor.preprocessor(self.removed_tags)
        return self.removed_urls
    
    def tokenize_word(self) -> List[str]:
        """
        Tokenizes the text into words.

        Returns:
            List[str]: List of word tokens.
        """
        text_processor = TokenizeWord()
        self.tokenized_words = text_processor.preprocessor(self.removed_urls)
        return self.tokenized_words
    
    def remove_punctuation(self) -> str:
        """
        Removes punctuation from the text.

        Returns:
            str: Text with punctuation removed.
        """
        self.tokenized_words = " ".join(self.tokenized_words)
        text_processor = RemovePunctuation()
        self.removed_punctuation = text_processor.preprocessor(self.tokenized_words)
        return self.removed_punctuation
    
    def remove_frequent_word(self) -> List[str]:
        """
        Removes the most frequent words from the text.

        Returns:
            List[str]: List of words with frequent words removed.
        """
        text_processor = TokenizeWord()
        self.tokenized = text_processor.preprocessor(self.removed_punctuation)
        text_processor = RemoveFrequentWord()
        self.removed_frequent = text_processor.preprocessor(self.tokenized)
        return self.removed_frequent
    
    def remove_rare_word(self) -> List[str]:
        """
        Removes the rarest words from the text.

        Returns:
            List[str]: List of words with rare words removed.
        """
        text_processor = RemoveRareWord()
        self.removed_rare = text_processor.preprocessor(self.tokenized)
        return self.removed_rare
    
    def correct_spelling(self) -> List[str]:
        """
        Corrects the spelling of words in the text.

        Returns:
            List[str]: List of words with corrected spelling.
        """
        text_processor = CorrectSpelling()
        self.spell_corrected = text_processor.preprocessor(self.removed_frequent)
        return self.spell_corrected
    
    def lemmatizer(self) -> str:
        """
        Lemmatizes the text.

        Returns:
            str: Text with lemmatized words.
        """
        text_processor = Lemmatizer()
        self.lemma = text_processor.preprocessor(self.spell_corrected)
        return self.lemma
    
    def stemmer(self) -> str:
        """
        Stems the text.

        Returns:
            str: Text with stemmed words.
        """
        text_processor = Stemmer()
        self.stem = text_processor.preprocessor(self.spell_corrected)
        return self.stem

## Test the above code

In [25]:
def main():
    text = "<html> @bshesh Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers. https://www.bisheshwor.com.np"
    text_preprocessor = TextPreprocessor(text)
    print(text_preprocessor.lower_case())
    print(text_preprocessor.remove_tag())
    print(text_preprocessor.remove_url())
    print(text_preprocessor.tokenize_word())
    print(text_preprocessor.remove_punctuation())
    print(text_preprocessor.remove_frequent_word())
    print(text_preprocessor.remove_rare_word())
    print(text_preprocessor.correct_spelling())
    print(text_preprocessor.lemmatizer())
    print(text_preprocessor.stemmer())

if __name__ == '__main__':
    main()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


[nltk_data] Downloading package stopwords to /home/bish/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bish/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<html> @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. https://www.bisheshwor.com.np
 @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. https://www.bisheshwor.com.np
 @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. 
['@', 'bshesh', 'dr.', 'smith', 'graduated', 'from', 'the', 'university', 'of', 'washington', '.', 'he', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']
 bshesh dr smith graduated from the university of washington  he later started an analytics firm called lux  which catered to enterprise customers 
['later', 'started', 'an', 'analytics', 'firm', 'called', 'lux', 'which', 'catered', 'to', 'enterprise'

In [3]:
text = "<html> @bshesh Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers. https://www.bisheshwor.com.np"
text_processor = LowerCase()
lower_text = text_processor.preprocessor(text)
print(lower_text)

<html> @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. https://www.bisheshwor.com.np


In [4]:
text_processor = RemoveTag()
remove_tag = text_processor.preprocessor(lower_text)
print(remove_tag)

 @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. https://www.bisheshwor.com.np


In [5]:
text_processor = RemoveURL()
remove_url = text_processor.preprocessor(remove_tag)
print(remove_url)

 @bshesh dr. smith graduated from the university of washington. he later started an analytics firm called lux, which catered to enterprise customers. 


In [6]:
text_processor = TokenizeWord()
tokenize_word = text_processor.preprocessor(remove_url)
print(tokenize_word)

['@', 'bshesh', 'dr.', 'smith', 'graduated', 'from', 'the', 'university', 'of', 'washington', '.', 'he', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [7]:
tokenize_word = " ".join(tokenize_word)
text_processor = RemovePunctuation()
remove_punctuation = text_processor.preprocessor(tokenize_word)
print(remove_punctuation)

 bshesh dr smith graduated from the university of washington  he later started an analytics firm called lux  which catered to enterprise customers 


In [8]:
text_processor = TokenizeWord()
tokenize = text_processor.preprocessor(remove_punctuation)
text_processor = RemoveFrequentWord()
remove_frequent = text_processor.preprocessor(tokenize)
print(remove_frequent)

['later', 'started', 'an', 'analytics', 'firm', 'called', 'lux', 'which', 'catered', 'to', 'enterprise', 'customers']


In [9]:
text_processor = RemoveRareWord()
remove_rare = text_processor.preprocessor(tokenize)
print(remove_rare)

['bshesh', 'dr', 'smith', 'graduated', 'from', 'the', 'university', 'of', 'washington', 'he', 'later', 'started']


In [10]:
text_processor = CorrectSpelling()
spell_correct = text_processor.preprocessor(remove_frequent)
print(spell_correct)

['later', 'started', 'an', 'analysis', 'firm', 'called', 'klux', 'which', 'watered', 'to', 'enterprise', 'customers']


In [11]:
text_processor = Lemmatizer()
lemma = text_processor.preprocessor(spell_correct)
print(lemma)

later started an analysis firm called klux which watered to enterprise customer


In [15]:
text_processor = Stemmer()
stem = text_processor.preprocessor(spell_correct)
print(stem)

later start an analysi firm call klux which water to enterpris custom
