# Text and Document Preparation

This notebook demonstrates a complete workflow for text and document preparation using a sample document. Each section covers a key step in the process, with code, explanations, and answers to logic questions.


## 1. Tokenization

We begin by tokenizing the sample document into sentences and words. Tokenization is the process of splitting text into smaller units (tokens), such as sentences or words.

**Sample Document:**

> Artificial intelligence (AI) is revolutionizing various industries by providing innovative solutions. AI and & Data Science are the technologies, and very hot topic in Cambodia.

**Questions:**

- What types of tokenization will you apply? (sentence, word, or both)
- Why is tokenization important in IR?
- What is `word_tokenize()` and `sent_tokenize()`?

**Answers:**

- Both sentence and word tokenization will be applied. Sentence tokenization splits text into sentences, while word tokenization splits sentences into words.
- Tokenization is important in Information Retrieval (IR) because it enables systems to process and analyze text at the word or sentence level, improving search and retrieval accuracy.
- `word_tokenize()` splits text into words; `sent_tokenize()` splits text into sentences.

Let's see this in code.


In [6]:
# Import required libraries
import nltk
from sklearn.feature_extraction.text import CountVectorizer
#nltk.download('punkt')
sample_doc = "Artificial intelligence (AI) is revolutionizing various industries by providing innovative solutions. AI and & Data Science are the technologies, and very hot topic in Cambodia."

# Sentence tokenization
sentences = nltk.sent_tokenize(sample_doc)
print("Sentence Tokenization:", sentences)

# Word tokenization (nltk)
words_nltk = nltk.word_tokenize(sample_doc)
print("Word Tokenization (nltk):", words_nltk)

# Word tokenization (scikit-learn)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform([sample_doc])
words_sklearn = vectorizer.get_feature_names_out()
print("Word Tokenization (scikit-learn):", words_sklearn)

Sentence Tokenization: ['Artificial intelligence (AI) is revolutionizing various industries by providing innovative solutions.', 'AI and & Data Science are the technologies, and very hot topic in Cambodia.']
Word Tokenization (nltk): ['Artificial', 'intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'various', 'industries', 'by', 'providing', 'innovative', 'solutions', '.', 'AI', 'and', '&', 'Data', 'Science', 'are', 'the', 'technologies', ',', 'and', 'very', 'hot', 'topic', 'in', 'Cambodia', '.']
Word Tokenization (scikit-learn): ['ai' 'and' 'are' 'artificial' 'by' 'cambodia' 'data' 'hot' 'in'
 'industries' 'innovative' 'intelligence' 'is' 'providing'
 'revolutionizing' 'science' 'solutions' 'technologies' 'the' 'topic'
 'various' 'very']


## 2. Stop-word Removal

Stop-words are common words (such as 'the', 'is', 'and') that are often removed from text to focus on meaningful content. Domain-specific stopwords (e.g., 'AI', 'Data') may also be considered.

**Questions:**

- List the stop-words you removed, including domain-specific stopwords.
- Explain how removing stop-words impacts understanding and processing of text.
- Why remove contextual stopwords in IR systems?

**Answers:**

- Removing stop-words reduces noise and improves the relevance of retrieved information. Contextual stopwords (domain-specific) help tailor retrieval to the subject area.

Let's remove stop-words from our tokenized words.


In [7]:
from nltk.corpus import stopwords
#nltk.download('stopwords')

# Standard English stopwords
stop_words = set(stopwords.words('english'))
# Add domain-specific stopwords
domain_stopwords = {'ai', 'data', 'science'}
stop_words.update(domain_stopwords)

# Remove stopwords from scikit-learn tokenized words
filtered_words = [word for word in words_sklearn if word.lower() not in stop_words]
print("Stop-words removed:", sorted(domain_stopwords.union(stopwords.words('english'))))
print("Filtered words:", filtered_words)

Stop-words removed: ['a', 'about', 'above', 'after', 'again', 'against', 'ai', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'data', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', '

In [7]:
# Custom stop-word removal
custom_stopwords = {'is', 'the', 'are', 'and', 'in', 'by', 'ai', 'data'}

# Remove custom stopwords from scikit-learn tokenized words
filtered_words = [word for word in words_sklearn if word.lower() not in custom_stopwords]
print("Custom stop-words:", custom_stopwords)
print("Filtered words:", filtered_words)

Custom stop-words: {'ai', 'the', 'by', 'is', 'data', 'in', 'and', 'are'}
Filtered words: ['artificial', 'cambodia', 'hot', 'industries', 'innovative', 'intelligence', 'providing', 'revolutionizing', 'science', 'solutions', 'technologies', 'topic', 'various', 'very']


## 3. Stemming and Lemmatization

Stemming reduces words to their root form, while lemmatization converts words to their base or dictionary form.

**Questions:**

- Select one word and show its stemmed and lemmatized forms.
- What are the advantages and disadvantages of stemming vs. lemmatization?
- When is stemming preferred over lemmatization?

**Answers:**

- Stemming is faster but less accurate; lemmatization is more accurate but slower. Stemming may be preferred for speed in large-scale IR systems.

Let's apply both to our filtered words.


In [8]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Select one word for demonstration
demo_word = filtered_words[0] if filtered_words else 'revolutionizing'
stemmed = stemmer.stem(demo_word)
lemmatized = lemmatizer.lemmatize(demo_word)

print(f"Original word: {demo_word}")
print(f"Stemmed form: {stemmed}")
print(f"Lemmatized form: {lemmatized}")

# Apply stemming and lemmatization to all filtered words
stemmed_words = [stemmer.stem(word) for word in filtered_words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print("Stemmed words:", stemmed_words)
print("Lemmatized words:", lemmatized_words)

Original word: artificial
Stemmed form: artifici
Lemmatized form: artificial
Stemmed words: ['artifici', 'cambodia', 'hot', 'industri', 'innov', 'intellig', 'provid', 'revolution', 'solut', 'technolog', 'topic', 'variou']
Lemmatized words: ['artificial', 'cambodia', 'hot', 'industry', 'innovative', 'intelligence', 'providing', 'revolutionizing', 'solution', 'technology', 'topic', 'various']


## 4. Normalization and Case Folding

Normalization and case folding convert all words to lowercase and standardize text format.

**Question:**

- How does normalization help reduce vocabulary size?

**Answer:**

- Normalization reduces vocabulary size by treating words like "AI" and "ai" as the same token, improving consistency and reducing redundancy.

Let's normalize our lemmatized words.


In [9]:
# Normalize and case fold
normalized_words = [word.lower() for word in lemmatized_words]
print("Normalized words:", normalized_words)

Normalized words: ['artificial', 'cambodia', 'hot', 'industry', 'innovative', 'intelligence', 'providing', 'revolutionizing', 'solution', 'technology', 'topic', 'various']


## 5. Handling Special Characters, Numbers, and Punctuation

Cleaning text involves removing special characters, numbers, and punctuation to focus on meaningful words.

**Questions:**

- Which characters will you remove? Why?
- What problems may occur if special characters are not removed?

**Answers:**

- Remove characters like `&`, `(`, `)`, `.`, `,` and numbers because they do not contribute to semantic meaning and may introduce noise.
- If not removed, special characters can cause errors in analysis, inflate vocabulary size, and reduce retrieval accuracy.

Let's clean the normalized words.


In [10]:
import re
# Remove special characters, numbers, and punctuation from normalized words
cleaned_words = [re.sub(r'[^a-z]', '', word) for word in normalized_words if re.sub(r'[^a-z]', '', word)]
print("Cleaned words:", cleaned_words)

Cleaned words: ['artificial', 'cambodia', 'hot', 'industry', 'innovative', 'intelligence', 'providing', 'revolutionizing', 'solution', 'technology', 'topic', 'various']


## 6. Flowchart: Text Processing Workflow

Below is a flowchart illustrating the text processing steps described above.

```
Sample Document
     |
     v
Tokenization (Sentence & Word)
     |
     v
Stop-word Removal
     |
     v
Stemming & Lemmatization
     |
     v
Normalization & Case Folding
     |
     v
Remove Special Characters, Numbers, Punctuation
     |
     v
Cleaned Tokens
```

This workflow ensures that text is prepared for further analysis or information retrieval tasks.
