<a href="https://colab.research.google.com/github/abhayryad/shizzz/blob/main/Stemming_and_its_types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Stemming

Stemming is a process in natural language processing (NLP) that reduces words to their root or base form, called a "stem". The goal is to group together words that have similar meanings but different endings (e.g., "running", "runs", and "ran" would all be reduced to the stem "run").

**How it works:**

Stemming typically works by removing suffixes from words based on a set of rules or algorithms. For example, a simple rule might be to remove "ing" if the word ends with "ing" and has at least one vowel before the "ing". More complex algorithms, like the Porter stemming algorithm, use multiple steps and rules to handle various suffixes and irregular words.

**Why use stemming?**

*   **Reduce vocabulary size:** By reducing words to their stems, you can significantly reduce the number of unique words in your dataset, which can be beneficial for tasks like text classification or information retrieval.
*   **Improve search results:** When searching for information, stemming can help find relevant documents even if the query uses a different form of a word. For example, searching for "running" might also return documents containing "run" or "ran".

**Limitations of stemming:**

*   **Over-stemming:** Stemming can sometimes remove too much of a word, resulting in stems that are not actual words or that group together words with different meanings (e.g., "universal" and "university" might be stemmed to the same root).
*   **Under-stemming:** Conversely, stemming might not remove enough of a word, failing to group together words that should be considered the same.
*   **Doesn't consider context:** Stemming is a rule-based process that doesn't take into account the context of a word, which can lead to incorrect stemming in some cases.

Due to its limitations, stemming is often used in conjunction with or replaced by **lemmatization**, which is a more sophisticated process that uses a dictionary and considers the part of speech of a word to reduce it to its base or dictionary form (lemma). Lemmatization generally produces better results but is also more computationally expensive.

In [None]:
##Classification problem
## comments of product is a psotive review or negative review
## Reviews ---> eating, eat, eaten, [going, gone, goes]---->go

## RegexpStemmer

The `RegexpStemmer` is a simple stemming algorithm provided by the NLTK library in Python. It uses regular expressions to remove suffixes from words.

**How it works:**

You provide `RegexpStemmer` with a regular expression pattern. The stemmer then attempts to match this pattern at the end of a word and remove it if it matches. For example, you could use a regular expression like `ing$` to remove the "ing" suffix.

**When to use it:**

`RegexpStemmer` is useful when you have specific suffixes you want to remove and can define a regular expression to capture them. It's simpler and faster than more complex algorithms like Porter or Snowball, but it's also less comprehensive and may not handle irregular words or complex suffix combinations as well.

**Limitations:**

*   Requires knowledge of regular expressions.
*   May over-stem or under-stem depending on the chosen pattern.
*   Doesn't consider the context or part of speech of words.

Here's an example of how to use `RegexpStemmer` in NLTK:

In [None]:
from nltk.stem import RegexpStemmer

stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['running', 'runs', 'runner', 'fairly', 'adjustable', 'good']
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['runn', 'run', 'runner', 'fairly', 'adjust', 'good']


In [None]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$',min=4)

In [None]:
reg_stemmer.stem('eating')

'eat'

## PorterStemmer

The `PorterStemmer` is one of the most common and widely used stemming algorithms for English. It's an algorithmic stemmer that applies a series of rules to remove suffixes from words.

**How it works:**

The Porter stemming algorithm works in multiple steps, with each step applying a set of rules to transform the word. These rules are designed to handle various English suffixes in a specific order to produce the correct stem. The algorithm is deterministic, meaning it will always produce the same stem for a given word.

**When to use it:**

The `PorterStemmer` is a good general-purpose stemmer for English text. It's relatively efficient and has been widely tested. It's often used in information retrieval and text mining tasks.

**Limitations:**

*   **Not a dictionary-based stemmer:** It doesn't use a dictionary to check if the resulting stem is a valid word. This can sometimes lead to over-stemming or under-stemming.
*   **Specific to English:** The rules are designed for English and may not work well for other languages.
*   **Can be aggressive:** In some cases, it can be too aggressive in removing suffixes, leading to stems that are not intuitive.

Here's an example of how to use `PorterStemmer` in NLTK:

In [None]:
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

word_list = ["program", "programs", "programmer", "programming", "programmers"]
stemmed_words = [porter_stemmer.stem(word) for word in word_list]

print(stemmed_words)

['program', 'program', 'programm', 'program', 'programm']


## SnowballStemmer

The `SnowballStemmer` is an improved version of the PorterStemmer and supports multiple languages. It is also known as Porter2 stemming algorithm.

**How it works:**

Similar to the PorterStemmer, the SnowballStemmer applies a series of rules to remove suffixes. However, it includes additional rules and is more aggressive in some cases. The key advantage is its support for various languages by specifying the language during initialization.

**When to use it:**

Use `SnowballStemmer` when you need a more aggressive stemmer than PorterStemmer or when working with text in languages other than English.

**Limitations:**

*   Like other algorithmic stemmers, it doesn't use a dictionary and may produce stems that are not valid words.
*   Can be more aggressive than desired in some cases.

Here's an example of how to use `SnowballStemmer` in NLTK for English:

In [None]:
from nltk.stem import SnowballStemmer

# Initialize SnowballStemmer for English
snowball_stemmer = SnowballStemmer("english")

words_to_stem = ["generous", "generation", "generously", "generate"]
stemmed_words = [snowball_stemmer.stem(word) for word in words_to_stem]

print(stemmed_words)

['generous', 'generat', 'generous', 'generat']


In [None]:
 stemmer.stem('fairly'),stemmer.stem('sportingly')

('fairly', 'sportingly')

In [None]:
snowball_stemmer.stem('fairly'),snowball_stemmer.stem('sportingly')
#

('fair', 'sport')

In [None]:
snowball_stemmer.stem('goes')

'goe'