### Text Preprocessing Steps
- **Segmentation**: Also known as sentence tokenization, is a process of breaking down a large block of text into individual sentences. It makes it easier to analyze and extract useful information from the text.
- **Removal of punctuations, special characters and URLs**: These are noise in text and may confuse machine learning models. Should be removed to improve the quality of the text.
- **Lowercasing**: Otherwise machine will consider upercase and lowercase same word as different words.
- **Tokenization**: word tokenization, separates each word of a sentence, even punctuation is treated as a separate word. Useful for details analysis of each word in the text.
- **Parts of Speech Tagging**: POS tagging involves classifying each word of a sentence into its respective parts of speech. Useful for understanding the cotext and relationships between words more accurately. Thus helps to understand the semantic meaning of the text.
- **Removing Stopwords**: Often do not carry significant meaning and are typically discarded. EX: a, an, the, etc. These might also create noise in the train data if not removed.
- **Text Normalization**
- **Stemming**
- **Lemmatization**

In [2]:
!pip install -q nltk

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/emrulk1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
from nltk.tokenize import sent_tokenize

# Sample text
text = "The weather is nice today. I think I'll go for a walk. Maybe I'll take the dog with me."

# Tokenize the text into sentences
# Perform Sentence segmentation
sentences = sent_tokenize(text)

# Print the tokenized(segmented) sentences
for i, sentence in enumerate(sentences):
    print(f"Sentence {i+1}: {sentence}")

Sentence 1: The weather is nice today.
Sentence 2: I think I'll go for a walk.
Sentence 3: Maybe I'll take the dog with me.


In [5]:
import re
import nltk
from nltk.tokenize import word_tokenize

# Make sure to download the necessary resources
# nltk.download('punkt')

# Sample text
text = "Hello! Check out my website: http://example.com. It's awesome! excited @user $100."

# Define a regular expression pattern to remove URLS
url_pattern = r"http\S+|www\S+"

# Define a regular expression pattern to remove punctuations and special characters
# punctuation_pattern = r"[^\w\s]"
punctuation_pattern = r"[^a-zA-Z0-9\s]"

# Remove URLs
cleaned_text = re.sub(url_pattern, '', text)

# Remove punctuations and special characters
cleaned_text = re.sub(punctuation_pattern, '', cleaned_text)

# Tokenize the text
tokens = word_tokenize(cleaned_text)

# Print the cleaned text and tokenized text
print("Cleaned Text:", cleaned_text)
print("Tokenized Text:", tokens)
print()

# Print the cleaned, lowercased and tokenized text
cleaned_text = cleaned_text.lower()
tokens = word_tokenize(cleaned_text)

print("Cleaned and Lowercased Text:", cleaned_text)
print("Tokenized Text:", tokens)


Cleaned Text: Hello Check out my website  Its awesome excited user 100
Tokenized Text: ['Hello', 'Check', 'out', 'my', 'website', 'Its', 'awesome', 'excited', 'user', '100']

Cleaned and Lowercased Text: hello check out my website  its awesome excited user 100
Tokenized Text: ['hello', 'check', 'out', 'my', 'website', 'its', 'awesome', 'excited', 'user', '100']


In [6]:
# Sample sentence
sentence = "Text preprocessing is an important step in natural language processing."

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Print the tokens
print("Tokens:", tokens)

Tokens: ['Text', 'preprocessing', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']


In [10]:
# nltk.download('averaged_perceptron_tagger')
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [11]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
#sample text
sentence = "The quick brown fox jumps over the lazy dog."

#tokenize the sentence into words
tokens = word_tokenize(sentence)

#perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

#print the pos tagged tokens
print("Pos Tagged Tokens:", tagged_tokens)

Pos Tagged Tokens: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


## References
- https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences
- https://www.geeksforgeeks.org/python-perform-sentence-segmentation-using-spacy/
- https://www.kaggle.com/discussions/general/331411 
- https://stackoverflow.com/questions/35861482/nltk-lookup-error
- https://github.com/cocktailpeanutlabs/openvoice2/issues/3