# 1 . Robust Data Processing

robust data preprocessing, which involves handling irrelevant text data by cleaning and transforming the data into a format that can be effectively 
analyzed. This might mean undertaking several steps, such as tokenization, stopword removal, stemming or lemmatization, and noise removal from the text.

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
import re
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

True

In [2]:

def preprocess_text(text):
    text = text.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words]
    combined_text = ' '.join(tokens_without_stopwords) 
    processed_text = re.sub(r'[^\w\s]', '', combined_text)
    return processed_text

In [3]:
text = "I'll be going to the park, and we're meeting at 3 o'clock. It's a beautiful day!"

In [4]:
preprocessed_text = preprocess_text(text)
print(preprocessed_text)

going park meeting 3 clock beautiful day


# Iterative Refinement

Iterative refinement is a powerful technique for improving text data quality by progressively enhancing it without losing valuable information. 
In the following code example, we utilize Python’s spaCy library to demonstrate iterative refinement’s effectiveness in removing stopwords and 
punctuations from a short corpus of text.


In [5]:
import nltk
from nltk.corpus import stopwords
import re
nltk.download('stopwords', quiet=True)

True

In [15]:
corpus = [
    "We'll check out our exclusive sale!",
    "You'll earn money from home with this amazing opportunity!",
    "Get rid of junk emails and spam.",
    "Unsubscribe from irrelevant newsletters.",
]

In [16]:
def remove_stopwords(text):
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words]
    combined_text = ' '.join(tokens_without_stopwords)
    return combined_text
    
def remove_punctuations(text):
    return re.sub(r'[^\w\s]', ' ', text)

In [17]:
print("Original corpus:", corpus)

Original corpus: ["We'll check out our exclusive sale!", "You'll earn money from home with this amazing opportunity!", 'Get rid of junk emails and spam.', 'Unsubscribe from irrelevant newsletters.']


In [20]:
for i in range(3):
    refined_corpus = []
    for doc in corpus:
        cleaned_text = remove_stopwords(doc)
        cleaned_text = remove_punctuations(cleaned_text)
        refined_corpus.append(cleaned_text)
    corpus = refined_corpus
print(f"\nIteration {i}: {corpus}")


Iteration 2: ['check exclusive sale', 'earn money home amazing opportunity', 'Get rid junk emails spam', 'Unsubscribe irrelevant newsletters']
