# Intro to Text Processing
Text processing (or text cleaning) is the crucial first step in any text analysis project. It involves preparing raw text data for analysis by making it more uniform. We'll use string manipulation skills from the last notebook to build a simple cleaning pipeline

### The goal of Text Processing
Raw text from websites, books or user reviews is often messy. It can contain:
- Inconsistent capitalisation (e.g. Apple, apple)
- Punctuation Marks that we don't need
- Common words (the, a, is) called **stop words** that add little meaning

Our goal is to normalise the text to make analysis easier and more accurate

## A simple cleaning workflow
Let's create a basic function to clean a single piece of text. Our steps will be:
1. Convert to lowercase
2. Remove punctuation
3. Split the text into words

In [None]:
import string # Python's string module contains useful constants

# Example text
document = "WOW! This is an example sentence. It's for demonstrating text processing, of course."

# 1. Convert to lowercase
doc_lower = document.lower()
print(f"Lowercase:\n{doc_lower}")

# 2. Remove punctuation
# We'll use replace() in a loop
for punc in string.punctuation:
    doc_lower = doc_lower.replace(punc, "")
print(f"\nNo Punctuation:\n{doc_lower}")

# 3. Tokenize (split into words)
tokens = doc_lower.split()
print(f"\nTokens:\n{tokens}")

### Removing Stop Words
Stop words are common words that are often filtered out. While there are advanced libraries like NLTK or spaCy for this, we can do it manually with a small, predefined list.

In [None]:
# A small list of common English stop words
stop_words = ["i", "me", "my", "a", "an", "the", "is", "it", "in", "on", "of", "for", "and", "this", "that"]

# 'tokens' is the list of words we created in the previous cell
filtered_tokens = []
for word in tokens:
    if word not in stop_words:
        filtered_tokens.append(word)

print(f"Tokens after removing stop words:\n{filtered_tokens}")