# Section 2: Text Preprocessing

In this section, we will dive deeper into text preprocessing techniques. These techniques are crucial for cleaning and preparing text data for machine learning models.

**Our goals for this section are:**
1. Learn about stemming and lemmatization.
2. Understand how to use regular expressions for text cleaning.

## 1. Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For example, the stem of the words `running`, `ran`, and `runs` is `run`.

### Stemming with NLTK

In [None]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "ran", "runs", "easily", "fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print("Original Words:", words)
print("Stemmed Words:", stemmed_words)

## 2. Lemmatization

Lemmatization is similar to stemming, but it brings context to the words. It links words with similar meanings to one word. For example, the words `run`, `running`, and `ran` are all forms of the word `run`, so they would all be lemmatized to `run`.

### Lemmatization with NLTK

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "ran", "runs", "easily", "fairly"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words] # pos='v' for verb

print("Original Words:", words)
print("Lemmatized Words:", lemmatized_words)

### Lemmatization with spaCy

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = "The quick brown foxes are jumping over the lazy dogs."
doc = nlp(text)

lemmatized_words = [token.lemma_ for token in doc]

print("Original Text:", text)
print("Lemmatized Words:", lemmatized_words)

## 3. Regular Expressions

Regular expressions (regex) are a powerful tool for finding and replacing patterns in text. They are often used to remove punctuation, numbers, and other unwanted characters from text.

### Using Regex to Clean Text

In [None]:
import re

text = "This is a sample sentence with punctuation! And numbers 123."

# Remove punctuation
clean_text = re.sub(r'[^\w\s]', '', text)

print("Original Text:", text)
print("Cleaned Text:", clean_text)

## Exercise

Now it's your turn!
1. Create a new text variable with a sentence of your choice.
2. Lemmatize the text using either NLTK or spaCy.
3. Use a regular expression to remove all the numbers from the text.

In [None]:
# Your code here

## End of Section 2

Congratulations on completing the second section! You've learned about stemming, lemmatization, and regular expressions.

**Next Steps:**
1. Save this notebook.
2. Commit your changes to Git with the message 'Complete Section 2'.
3. When you're ready, ask me to proceed to **Section 3: Feature Engineering**.