# Text Preprocessing for Sentiment Analysis

Text preprocessing is a crucial step in sentiment analysis because raw text often contains noise that can confuse machine learning or deep learning models. Below are the key preprocessing steps with explanations and examples:

---

### Original Sentence:
"The movie was absolutely wonderful, and I loved every moment of it!"

### After Preprocessing:

**Lowercased**:

`the movie was absolutely wonderful, and i loved every moment of it!`

**Remove punctuation**:

`the movie was absolutely wonderful and i loved every moment of it`

**Remove stopwords**:

`movie absolutely wonderful loved every moment`

**Tokenize**:

`["movie", "absolutely", "wonderful", "loved", "every", "moment"]`

**Lemmatize**:

`["movie", "absolutely", "wonderful", "love", "every", "moment"]`

---

## 1. Lowercasing  
**Purpose:** Normalize the text so that "Happy" and "happy" are treated the same.  

**Example:**
```python
Text = "I am HAPPY!"
Lowercased = "i am happy!"

In [1]:
text = "I am HAPPY!"
text.lower()

'i am happy!'

## 2. Removing Punctuation
Purpose: Punctuation usually does not carry sentiment and can be removed to simplify the text.

Example:
```python
Text = "I am happy!!! Are you?"
Without_punctuation = "I am happy Are you"

In [2]:
import string

punctuation = string.punctuation
text = "I am happy!!! Are you?"
without_punctuation = "".join(char for char in text if char not in punctuation)
without_punctuation

'I am happy Are you'

## 3. Tokenization
Purpose: Split the sentence into individual words (tokens).

Example:
```python
Text = "I love this product"
Tokens = ["I", "love", "this", "product"]

In [5]:
Text = "I love this product"
Text.split()

['I', 'love', 'this', 'product']

In [1]:
from nltk.tokenize import word_tokenize

Text = "I love this product"
tokens = word_tokenize(Text)
tokens

['I', 'love', 'this', 'product']

### Why nltk tokenizer

- Using `.split()`:

In [11]:
Text = "Hello World!"
print(Text.split())

['Hello', 'World!']


- Using  `nltk.word_tokenize()`:

In [21]:
from nltk.tokenize import word_tokenize
Text = "Hello World!"
word_tokenize(Text)

['Hello', 'World', '!']

## 4. Removing Stopwords
Purpose: Stopwords (like is, am, the, in, and) are frequent words that don’t add much meaning for tasks like sentiment classification. Removing them helps reduce noise.

Example (using NLTK stopwords):
```python
Text = "I am very happy with the service"
After_removing_stopwords = "happy service"

In [2]:
import nltk

stopwords = nltk.corpus.stopwords.words('english')
Text = "I am very happy with the service"
tokens = word_tokenize(Text.lower())
After_removing_stopwords = " ".join([char for char in tokens if char not in stopwords])
After_removing_stopwords

'happy service'

## 5. Stemming
- **Goal**: Cut off prefixes or suffixes to get the root form of a word (*stem*).
- **How it works**: Uses simple rules to chop off word endings. It doesn’t check if the result is a valid word.
- **Examples**:
  - `"playing"` → `"play"`
  - `"played"` → `"play"`
  - `"flies"` → `"fli"` ❌ *(not a real word)*
- **Think of it like**: A quick-and-dirty way of shortening words.
- **Use case**: When speed matters more than accuracy (e.g., search engines).

In [3]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

print("Stemming")
print("Playing ->", st.stem("Playing"))
print("Played ->", st.stem("Played"))
print("flies ->", st.stem("flies"))

Stemming
Playing -> play
Played -> play
flies -> fli


## 6. Lemmatization
- **Goal**: Convert a word to its **dictionary form** (*lemma*), considering its **meaning** and **part of speech**.
- **How it works**: Uses vocabulary and grammar rules to return real words.
- **Examples**:
  - `"playing"` → `"play"`
  - `"better"` → `"good"`
  - `"flies"` → `"fly"` ✅ *(real word)*
- **Think of it like**: A smarter, more accurate version of stemming.
- **Use case**: When understanding and accuracy matter (e.g., chatbots, NLP pipelines).

In [4]:
from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()

print("Lemmatization")
print("Playing ->", lm.lemmatize("Playing"))
print("Played ->", lm.lemmatize("Played"))
print("flies ->", lm.lemmatize("flies"))

Lemmatization
Playing -> Playing
Played -> Played
flies -> fly
