# Text Analytics Fundamentals

## Learning Objectives
* Understand why text preprocessing is essential for analysis
* Apply core text preprocessing techniques
* Perform basic text analysis using Python
* Gain hands-on experience with NLTK library

## Part 1: Why Text Preprocessing? 

### The Text Analysis Challenge

Consider these challenges with raw text data:
* Inconsistent formatting (uppercase/lowercase)
* Punctuation and special characters
* Common words that don't add meaning ("the", "and", "is")
* Different forms of the same word ("run", "running", "ran")

Let's see a practical example:

In [2]:
# Raw customer feedback example
raw_feedback = """
GREAT product!!! I've been using it for 3 months now...
The customer service team was really helpful :)
Would definitely recommend to others!!!
"""

print("Raw text contains:")
print("- Mixed case:", any(c.isupper() for c in raw_feedback))
print("- Punctuation:", any(c in "!?.," for c in raw_feedback))
print("- Special characters:", ":)" in raw_feedback)
print("- Multiple lines:", "\n" in raw_feedback)

Raw text contains:
- Mixed case: True
- Punctuation: True
- Special characters: True
- Multiple lines: True


## Part 2: Text Preprocessing Steps

Let's set up our environment first:

In [4]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from collections import Counter

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

print("Environment ready!")

[nltk_data] Downloading package punkt_tab to /home/vscode/nltk_data...


Environment ready!


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /home/vscode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vscode/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 2.1 Tokenisation

Tokenisation breaks text into smaller pieces (tokens) - either sentences or words:

In [None]:
# Example text
text = "The product exceeded my expectations! Customer service was excellent. Would recommend."

# Sentence tokenisation
sentences = sent_tokenize(text)
print("Sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent}")

# Word tokenisation
words = word_tokenize(text)
print("\nWords:", words)
print(f"Number of words: {len(words)}")

Sentences:
1. The product exceeded my expectations!
2. Customer service was excellent.
3. Would recommend.

Words: ['The', 'product', 'exceeded', 'my', 'expectations', '!', 'Customer', 'service', 'was', 'excellent', '.', 'Would', 'recommend', '.']
Number of words: 14


### 2.2 Case Normalisation

Converting text to consistent case helps with analysis:

In [6]:
# Mixed case example
mixed_text = "The PRODUCT was AMAZING and I LOVED it!"

normalized = mixed_text.lower()
print("Original:", mixed_text)
print("Normalized:", normalized)

# Why this matters
print("\nWithout normalization:")
print('"PRODUCT" == "product":', "PRODUCT" == "product")
print('\nWith normalization:')
print('"PRODUCT".lower() == "product".lower():', "PRODUCT".lower() == "product".lower())

Original: The PRODUCT was AMAZING and I LOVED it!
Normalized: the product was amazing and i loved it!

Without normalization:
"PRODUCT" == "product": False

With normalization:
"PRODUCT".lower() == "product".lower(): True


### 2.3 Stop Word Removal

Stop words are common words that usually don't add meaning to our analysis:

In [7]:
# Get English stop words
stop_words = set(stopwords.words('english'))

# Example text
text = "The product is very good and I would recommend it to others"
words = word_tokenize(text.lower())

# Remove stop words
filtered_words = [word for word in words if word not in stop_words]

print("Original words:", words)
print("After removing stop words:", filtered_words)
print(f"\nReduced from {len(words)} to {len(filtered_words)} words")

Original words: ['the', 'product', 'is', 'very', 'good', 'and', 'i', 'would', 'recommend', 'it', 'to', 'others']
After removing stop words: ['product', 'good', 'would', 'recommend', 'others']

Reduced from 12 to 5 words


### 2.4 Stemming and Lemmatisation

These techniques help find the root form of words:

In [None]:
# Initialize stemmers and lemmatisers
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "runs", "ran", "easily", "playing", "better", "good"]

# Apply stemming and lemmatisation
stems = [stemmer.stem(word) for word in words]
lemmas = [lemmatizer.lemmatize(word) for word in words]

# Compare results
print("Original\tStem\t\tLemma")
print("-" * 40)
for original, stem, lemma in zip(words, stems, lemmas):
    print(f"{original}\t{stem}\t\t{lemma}")

Original	Stem		Lemma
----------------------------------------
running	run		running
runs	run		run
ran	ran		ran
easily	easili		easily
playing	play		playing
better	better		better
good	good		good


#### What is the differnece between stemming and lemmatization?

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form — generally a written word form. Example: "running" -> "run". Stemming makes use of an algorithmic approach, which allows for the reduction of words to their root form. 

Lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Example: "better" -> "good". Lemmatisation makes use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

| Aspect | Stemming | Lemmatisation |
|--------|----------|---------------|
| **Definition** | Removes prefixes/suffixes using rules | Converts to dictionary base form |
| **Output** | Can produce non-dictionary words | Always produces valid dictionary words |
| **Speed** | Fast | Slower |
| **Accuracy** | Lower | Higher |
| **Example 1** | running → run | running → run |
| **Example 2** | better → bet | better → good |
| **Example 3** | easily → easili | easily → easy |
| **Use Case** | Search engines, large-scale processing | Text analysis, NLP tasks |
| **Resource Use** | Minimal (rule-based) | Higher (needs dictionary) |
| **Context Aware** | No | Yes |

## Part 3: Basic Text Analysis (10 minutes)

Let's apply what we've learned to analyze some text:

In [None]:
def analyse_text(text):
    """Basic text analysis function"""
    # Normalise the text
    normalised_text = text.lower()

    # Tokenise the normalised text
    words = word_tokenize(normalised_text)

    # Remove stop words and punctuation
    words = [word for word in words
             if word not in stop_words and word.isalnum()]

    # Get word frequencies
    word_freq = Counter(words)

    return {
        'total_words': len(words),
        'unique_words': len(set(words)),
        'most_common': word_freq.most_common(5)
    }

# Example text
customer_reviews = """
Great product, really helpful for data analysis. The visualisation features are amazing.
Easy to use and great support from the team. Would recommend for data science projects.
Data visualisation has never been easier. Great tool for analysis.
"""

results = analyse_text(customer_reviews)

print("Analysis Results:")
print(f"Total words: {results['total_words']}")
print(f"Unique words: {results['unique_words']}")
print("\nMost common words:")
for word, count in results['most_common']:
    print(f"- {word}: {count} times")

Analysis Results:
Total words: 26
Unique words: 20

Most common words:
- great: 3 times
- data: 3 times
- analysis: 2 times
- visualization: 2 times
- product: 1 times


### Exercise (10 minutes)

Try analysing some text from your own organisation (e.g., customer feedback, product documentation, or internal communications). Use the template below:

In [None]:
# Your text here
your_text = """
Replace this with your own text
"""

# Analyze it
your_results = analyse_text(your_text)

# Print results
print("Your Text Analysis:")
print(f"Total words: {your_results['total_words']}")
print(f"Unique words: {your_results['unique_words']}")
print("\nMost common words:")
for word, count in your_results['most_common']:
    print(f"- {word}: {count} times")

Your Text Analysis:
Total words: 2
Unique words: 2

Most common words:
- replace: 1 times
- text: 1 times


### Summary

In this session, we've covered:
1. Why text preprocessing is necessary
2. Key preprocessing techniques:
   - Tokenisation
   - Case normalisation
   - Stop word removal
   - Stemming and lemmatisation
3. Basic text analysis methods

### Next Steps
- Explore more advanced text analysis techniques
- Learn about sentiment analysis
- Practice with real-world text data

### Additional Resources
- [NLTK Documentation](https://www.nltk.org/)
- [Text Analytics with Python](https://www.nltk.org/book/)
- [Practical Text Mining with Python](https://realpython.com/nltk-nlp-python/)