# Text Pre-processing

Text preprocessing in natural language processing (NLP) refers to the process of cleaning and preparing text data for analysis or model training. It involves several steps to transform raw text into a format that is suitable for further processing. Text preprocessing is crucial in NLP tasks because it helps improve the quality of data and enhances the performance of downstream tasks such as text classification, sentiment analysis, and machine translation.

Here are the common steps involved in text preprocessing along with Python example code for each step:

1. **Lowercasing**: Convert all text to lowercase to ensure consistency in the text data.

```python
text = "Hello, World!"
text = text.lower()
print(text)  # Output: hello, world!
```

**Significance**: Lowercasing helps in standardizing the text, reducing the vocabulary size, and treating words with the same characters but different cases as identical.

2. **Tokenization**: Split the text into individual words or tokens.

```python
import nltk
nltk.download('punkt')

text = "Tokenization is the process of splitting text into words or tokens."
tokens = nltk.word_tokenize(text)
print(tokens)  # Output: ['Tokenization', 'is', 'the', 'process', 'of', 'splitting', 'text', 'into', 'words', 'or', 'tokens', '.']
```

**Significance**: Tokenization breaks down the text into smaller units, making it easier to analyze and process. It forms the basic units for further analysis, such as identifying parts of speech or extracting features.

3. **Removing Punctuation**: Eliminate punctuation marks from the text.

```python
import string

text = "Hello, World!"
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)  # Output: Hello World
```

**Significance**: Removing punctuation marks can help in simplifying the text and reducing noise in the data, especially for tasks like sentiment analysis or topic modeling.

4. **Removing Stopwords**: Remove common words that do not carry much meaning, known as stopwords.

```python
from nltk.corpus import stopwords
nltk.download('stopwords')

text = "This is a sample sentence, demonstrating the removal of stopwords."
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
print(tokens)  # Output: ['This', 'sample', 'sentence', ',', 'demonstrating', 'removal', 'stopwords', '.']
```

**Significance**: Stopwords like "is", "the", "of" etc., occur frequently but often don't contribute much to the meaning of the text. Removing them reduces noise and improves the efficiency of downstream tasks.

5. **Stemming or Lemmatization**: Reduce words to their base or root form.

```python
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "running"
stemmed_word = porter.stem(word)
lemmatized_word = lemmatizer.lemmatize(word)

print("Stemmed Word:", stemmed_word)  # Output: run
print("Lemmatized Word:", lemmatized_word)  # Output: running
```

**Significance**: Stemming and lemmatization reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. This helps in reducing the vocabulary size and improving model generalization.

Text preprocessing is a critical step in NLP pipelines as it improves the quality of text data, enhances the efficiency of downstream tasks, and facilitates better model performance. Each preprocessing step plays a vital role in cleaning and transforming the raw text into a format suitable for analysis or modeling, making it an essential component in NLP applications across various industries.

### Stemming and Lemmatization

Stemming and Lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root form, but they operate differently and serve different purposes.

**Stemming**:

Stemming is the process of removing suffixes from words to reduce them to their root or base form, known as the stem. Stemming algorithms work by applying heuristic rules to chop off suffixes. The resulting stems may not always be actual words, but they are typically shorter versions that represent the core meaning of the word.

For example, stemming the word "running" would result in "run", and stemming "cats" would yield "cat".

**Example**:

```python
from nltk.stem import PorterStemmer

porter = PorterStemmer()

word = "running"
stemmed_word = porter.stem(word)

print("Stemmed Word:", stemmed_word)  # Output: run
```

**Significance of Stemming**:

- Stemming is computationally less expensive compared to lemmatization, making it faster.
- It's useful in information retrieval systems or search engines where speed is crucial.
- Stemmed words are often used in applications where exact meaning isn't as important as matching similar words.

**Lemmatization**:

Lemmatization, on the other hand, is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, lemmatization considers the context of the word and morphological analysis to produce the root form. This means that the resulting lemma is a valid word that can be found in a dictionary.

For example, lemmatizing the word "running" would result in "run", and lemmatizing "cats" would still yield "cat".

**Example**:

```python
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "running"
lemmatized_word = lemmatizer.lemmatize(word)

print("Lemmatized Word:", lemmatized_word)  # Output: running
```

**Significance of Lemmatization**:

- Lemmatization produces valid words, which can be advantageous in tasks where semantic meaning is important.
- It's particularly useful in applications like question answering systems, chatbots, or sentiment analysis, where accurate understanding of the text is required.
- Lemmatization can help improve the interpretability of text data by reducing words to their canonical form.

In summary, while stemming and lemmatization both aim to reduce words to their base forms, stemming is a more simplistic and heuristic-based approach, often resulting in non-dictionary words, whereas lemmatization provides valid dictionary words by considering the context and morphological analysis. The choice between stemming and lemmatization depends on the specific requirements of the NLP task at hand.

### Contractions

Contractions are shortened versions of words or phrases that are formed by combining two words and replacing one or more letters with an apostrophe. They are commonly used in informal writing and speech to make communication more concise and efficient. Examples of contractions include "can't" (cannot), "won't" (will not), "I'm" (I am), and "he's" (he is).

In natural language processing (NLP), dealing with contractions is important during text preprocessing to ensure that the text is properly tokenized and understood by models. Failing to handle contractions can lead to incorrect tokenization and misinterpretation of text data.

Here's how you can handle contractions in Python:

```python
contractions = {
    "ain't": "am not / are not",
    "aren't": "are not / am not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he had / he would",
    "he'd've": "he would have",
    "he'll": "he shall / he will",
    "he'll've": "he shall have / he will have",
    "he's": "he has / he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how has / how is",
    "I'd": "I had / I would",
    "I'd've": "I would have",
    "I'll": "I shall / I will",
    "I'll've": "I shall have / I will have",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it had / it would",
    "it'd've": "it would have",
    "it'll": "it shall / it will",
    "it'll've": "it shall have / it will have",
    "it's": "it has / it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she had / she would",
    "she'd've": "she would have",
    "she'll": "she shall / she will",
    "she'll've": "she shall have / she will have",
    "she's": "she has / she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as / so is",
    "that'd": "that would / that had",
    "that'd've": "that would have",
    "that's": "that has / that is",
    "there'd": "there had / there would",
    "there'd've": "there would have",
    "there's": "there has / there is",
    "they'd": "they had / they would",
    "they'd've": "they would have",
    "they'll": "they shall / they will",
    "they'll've": "they shall have / they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we had / we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what shall / what will",
    "what'll've": "what shall have / what will have",
    "what're": "what are",
    "what's": "what has / what is",
    "what've": "what have",
    "when's": "when has / when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where has / where is",
    "where've": "where have",
    "who'll": "who shall / who will",
    "who'll've": "who shall have / who will have",
    "who's": "who has / who is",
    "who've": "who have",
    "why's": "why has / why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you had / you would",
    "you'd've": "you would have",
    "you'll": "you shall / you will",
    "you'll've": "you shall have / you will have",
    "you're": "you are",
    "you've": "you have"
}

text = "I ain't got time for this, I'll be there in a sec."

for contraction, expansion in contractions.items():
    text = text.replace(contraction, expansion)

print("Expanded Text:", text)
```

**Output**:
```
Expanded Text: I am not got time for this, I will be there in a sec.
```

**Significance of Handling Contractions**:

- By expanding contractions, we ensure that words are correctly tokenized during text preprocessing.
- It helps in maintaining the integrity of the text data and improves the accuracy of downstream

# Coding Exercise

In [1]:
text = "Hello, World!"
print(text)

text = text.lower()
print(text)  # Output: hello, world!

Hello, World!
hello, world!


In [2]:
import pandas as pd

# Sample DataFrame
data = {
    'text': ["Hello, World!", "How Are You?", "Python Is Awesome!"]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,"Hello, World!"
1,How Are You?
2,Python Is Awesome!


In [3]:
# Convert text to lowercase
df['text'] = df['text'].str.lower()

print(df)


                 text
0       hello, world!
1        how are you?
2  python is awesome!


#### Tokenization

In [5]:
!pip install nltk -q

In [7]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
text = "Tokenization is the process of splitting text into words or tokens."
print(text)
print()
tokens = nltk.word_tokenize(text)
print(tokens)

Tokenization is the process of splitting text into words or tokens.

['Tokenization', 'is', 'the', 'process', 'of', 'splitting', 'text', 'into', 'words', 'or', 'tokens', '.']


In [9]:
# Sample DataFrame
data = {
    'text': ["Tokenization is the process of splitting text into words or tokens.",
             "Natural language processing (NLP) is a fascinating field."]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,Tokenization is the process of splitting text ...
1,Natural language processing (NLP) is a fascina...


In [11]:
pd.set_option('max_colwidth', None)

In [13]:
# Tokenize text
df['tokens'] = df['text'].apply(nltk.word_tokenize)
df

Unnamed: 0,text,tokens
0,Tokenization is the process of splitting text into words or tokens.,"[Tokenization, is, the, process, of, splitting, text, into, words, or, tokens, .]"
1,Natural language processing (NLP) is a fascinating field.,"[Natural, language, processing, (, NLP, ), is, a, fascinating, field, .]"


### Removing Punctuation

In [14]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
tweet = """Next stop... #GoblinTown?

#CanTheDevsDoSomething?

#Cryptocrash https://t.co/iIoebv9Aaz"""
print(tweet,"\n\n")


import re
clean_tweet = re.sub(r'[^\w\s]','',tweet)
print("Clean Tweet:\n")
print(clean_tweet)

Next stop... #GoblinTown? 

#CanTheDevsDoSomething?

#Cryptocrash https://t.co/iIoebv9Aaz 


Clean Tweet:

Next stop GoblinTown 

CanTheDevsDoSomething

Cryptocrash httpstcoiIoebv9Aaz


In [16]:
# Sample DataFrame
data = {
    'tweet': [
        """Next stop... #GoblinTown?

#CanTheDevsDoSomething?

#Cryptocrash https://t.co/iIoebv9Aaz""",
        "This tweet contains no punctuation!",
        "What's up with all these #hashtags and @mentions?"
    ]
}
df = pd.DataFrame(data)

df

Unnamed: 0,tweet
0,Next stop... #GoblinTown? \n\n#CanTheDevsDoSomething?\n\n#Cryptocrash https://t.co/iIoebv9Aaz
1,This tweet contains no punctuation!
2,What's up with all these #hashtags and @mentions?


In [17]:
# Remove punctuation
df['clean_tweet'] = df['tweet'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

df

Unnamed: 0,tweet,clean_tweet
0,Next stop... #GoblinTown? \n\n#CanTheDevsDoSomething?\n\n#Cryptocrash https://t.co/iIoebv9Aaz,Next stop GoblinTown \n\nCanTheDevsDoSomething\n\nCryptocrash httpstcoiIoebv9Aaz
1,This tweet contains no punctuation!,This tweet contains no punctuation
2,What's up with all these #hashtags and @mentions?,Whats up with all these hashtags and mentions


### Removing Stopwords:

In [18]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [19]:
len(stopwords.words())

10405

In [20]:
len(stopwords.words("english"))

179

In [22]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [23]:
len(stopwords.fileids())

29

In [24]:
text = "This is a sample sentence, demonstrating the removal of stopwords."
print(text)

stop_words = set(stopwords.words('english'))
tokens = [word for word in text.split(" ") if word.lower() not in stop_words]
print(tokens)

This is a sample sentence, demonstrating the removal of stopwords.
['sample', 'sentence,', 'demonstrating', 'removal', 'stopwords.']


In [25]:
# Sample DataFrame
data = {
    'text': [
        "This is a sample sentence, demonstrating the removal of stopwords.",
        "NLTK provides a list of common stopwords.",
        "Removing stopwords is essential for text preprocessing.",
        "Stopwords include words like 'the', 'is', 'a', 'and', 'of'.",
        "After removing stopwords, text data becomes more meaningful for analysis."
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,"This is a sample sentence, demonstrating the removal of stopwords."
1,NLTK provides a list of common stopwords.
2,Removing stopwords is essential for text preprocessing.
3,"Stopwords include words like 'the', 'is', 'a', 'and', 'of'."
4,"After removing stopwords, text data becomes more meaningful for analysis."


In [26]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
df['clean_text'] = df['text'].apply(lambda x: ' '.join(
    [word for word in x.split() if word.lower() not in stop_words]))

df

Unnamed: 0,text,clean_text
0,"This is a sample sentence, demonstrating the removal of stopwords.","sample sentence, demonstrating removal stopwords."
1,NLTK provides a list of common stopwords.,NLTK provides list common stopwords.
2,Removing stopwords is essential for text preprocessing.,Removing stopwords essential text preprocessing.
3,"Stopwords include words like 'the', 'is', 'a', 'and', 'of'.","Stopwords include words like 'the', 'is', 'a', 'and', 'of'."
4,"After removing stopwords, text data becomes more meaningful for analysis.","removing stopwords, text data becomes meaningful analysis."


### Stemming

https://www.geeksforgeeks.org/introduction-to-stemming/

In [27]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

word = "running"
stemmed_word = porter.stem(word)

print("Stemmed Word:", stemmed_word)

Stemmed Word: run


In [28]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

word = "leaves"
stemmed_word = porter.stem(word)

print("Stemmed Word:", stemmed_word)

Stemmed Word: leav


In [29]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

word = "happy"
stemmed_word = porter.stem(word)

print("Stemmed Word:", stemmed_word)

Stemmed Word: happi


In [30]:
# Implementation of Porter Stemmer

from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()

# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily", "leaves"]

# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)


Original words: ['running', 'jumps', 'happily', 'running', 'happily', 'leaves']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili', 'leav']


In [31]:
# Implementation of Snowball Stemmer

from nltk.stem import SnowballStemmer

# Choose a language for stemming, for example, English
stemmer = SnowballStemmer(language='english')

# Example words to stem
words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes', "leaves"]

# Apply Snowball Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)


Original words: ['running', 'jumped', 'happily', 'quickly', 'foxes', 'leaves']
Stemmed words: ['run', 'jump', 'happili', 'quick', 'fox', 'leav']


In [32]:
# Implementation of Lancaster Stemmer

from nltk.stem import LancasterStemmer

# Create a Lancaster Stemmer instance
stemmer = LancasterStemmer()

# Example words to stem
words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes', "leaves", "children"]

# Apply Lancaster Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)


Original words: ['running', 'jumped', 'happily', 'quickly', 'foxes', 'leaves', 'children']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox', 'leav', 'childr']


In [33]:
# Sample DataFrame
data = {
    'text': [
        "running is fun",
        "I am running",
        "They run every morning",
        "The runner ran fast",
        "Runners are competitive"
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,running is fun
1,I am running
2,They run every morning
3,The runner ran fast
4,Runners are competitive


In [34]:
# Initialize PorterStemmer
porter = PorterStemmer()

# Apply PorterStemmer to each row in the DataFrame
df['stemmed_text'] = df['text'].apply(
    lambda x: ' '.join([porter.stem(word) for word in x.split()]))

df

Unnamed: 0,text,stemmed_text
0,running is fun,run is fun
1,I am running,i am run
2,They run every morning,they run everi morn
3,The runner ran fast,the runner ran fast
4,Runners are competitive,runner are competit


### Contractions

https://pypi.org/project/contractions/

In [37]:
!pip install contractions -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/289.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m286.7/289.9 kB[0m [31m10.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/110.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [38]:
import contractions
contractions.fix("Shouldn't")

'Should not'

In [39]:
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''

# creating an empty list
expanded_words = []
for word in text.split():
    # using contractions.fix to expand the shortened words
    expanded_words.append(contractions.fix(word))

expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print("\n\n")
print('Expanded_text: ' + expanded_text)


Original text: I'll be there within 5 min. Shouldn't you be there too? 
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.



Expanded_text: I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.


### Lemmatization

In [41]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [42]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "running"
lemmatized_word = lemmatizer.lemmatize(word)

print("Lemmatized Word:", lemmatized_word)  # Output: running


Lemmatized Word: running


In [43]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "leaves"
lemmatized_word = lemmatizer.lemmatize(word)

print("Lemmatized Word:", lemmatized_word)  # Output: running


Lemmatized Word: leaf


In [44]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

word = "children"
lemmatized_word = lemmatizer.lemmatize(word)

print("Lemmatized Word:", lemmatized_word)  # Output: running


Lemmatized Word: child


In [45]:
import pandas as pd
from nltk.stem import WordNetLemmatizer

# Sample DataFrame
data = {
    'text': [
        "children play in the park",
        "The child is playing outside",
        "There are many children in the school",
        "She has two children",
        "The children's toys are scattered around"
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,children play in the park
1,The child is playing outside
2,There are many children in the school
3,She has two children
4,The children's toys are scattered around


In [46]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply WordNetLemmatizer to each row in the DataFrame
df['lemmatized_text'] = df['text'].apply(
    lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

df

Unnamed: 0,text,lemmatized_text
0,children play in the park,child play in the park
1,The child is playing outside,The child is playing outside
2,There are many children in the school,There are many child in the school
3,She has two children,She ha two child
4,The children's toys are scattered around,The children's toy are scattered around


## End to End Example

In [47]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

In [48]:
# Sample DataFrame with unclean text data
data = {
    'text': [
        "This is a sample sentence, demonstrating the removal of stopwords.",
        "NLTK provides a list of common stopwords.",
        "Removing stopwords is essential for text preprocessing.",
        "Stopwords include words like 'the', 'is', 'a', 'and', 'of'.",
        "After removing stopwords, text data becomes more meaningful for analysis."
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,text
0,"This is a sample sentence, demonstrating the removal of stopwords."
1,NLTK provides a list of common stopwords.
2,Removing stopwords is essential for text preprocessing.
3,"Stopwords include words like 'the', 'is', 'a', 'and', 'of'."
4,"After removing stopwords, text data becomes more meaningful for analysis."


In [49]:
# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [50]:
# Initialize NLTK components
stop_words = set(stopwords.words('english'))
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [51]:
# Function to apply all preprocessing steps to text
def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Tokenization
    tokens = word_tokenize(text)

    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Stemming or Lemmatization
    stemmed_tokens = [porter.stem(word) for word in tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return {
        'original_text': text,
        'tokens': tokens,
        'stemmed_tokens': stemmed_tokens,
        'lemmatized_tokens': lemmatized_tokens
    }

In [52]:
# Apply preprocessing to each row in the DataFrame
preprocessed_data = []
for index, row in df.iterrows():
    preprocessed_result = preprocess_text(row['text'])
    preprocessed_data.append(preprocessed_result)

# Create a new DataFrame with preprocessed data
preprocessed_df = pd.DataFrame(preprocessed_data)

# Print the preprocessed DataFrame
preprocessed_df

Unnamed: 0,original_text,tokens,stemmed_tokens,lemmatized_tokens
0,"this is a sample sentence, demonstrating the removal of stopwords.","[sample, sentence, demonstrating, removal, stopwords]","[sampl, sentenc, demonstr, remov, stopword]","[sample, sentence, demonstrating, removal, stopwords]"
1,nltk provides a list of common stopwords.,"[nltk, provides, list, common, stopwords]","[nltk, provid, list, common, stopword]","[nltk, provides, list, common, stopwords]"
2,removing stopwords is essential for text preprocessing.,"[removing, stopwords, essential, text, preprocessing]","[remov, stopword, essenti, text, preprocess]","[removing, stopwords, essential, text, preprocessing]"
3,"stopwords include words like 'the', 'is', 'a', 'and', 'of'.","[stopwords, include, words, like, 'the, 'is, 'and, 'of]","[stopword, includ, word, like, 'the, 'i, 'and, 'of]","[stopwords, include, word, like, 'the, 'is, 'and, 'of]"
4,"after removing stopwords, text data becomes more meaningful for analysis.","[removing, stopwords, text, data, becomes, meaningful, analysis]","[remov, stopword, text, data, becom, meaning, analysi]","[removing, stopwords, text, data, becomes, meaningful, analysis]"


# Happy Learning