# Preprocessing Text Steps
- Remove end and start spaces if any
- Remove unwanted spaces between words
- Remove Special characters
- Conversion to LowerCase
- Stopwords Removal
- Conversion of word to base form(Stemming)
- Removal of HTML Tags

# Sample Input

###  I love eating them and they are good for watching TV and looking at movies! It is not too sweet. I like to transfer them to a zip lock baggie so they stay fresh so I can take my time eating them.

###  I was so glad Amazon carried these batteries. I have a hard time finding them elsewhere because they are such a unique size. I need them for my garage door opener.<br />Great deal for the price.

###  Watch your prices with this. While the assortment was good, and I did get this on a gold box purchase, the price for this was<br />$3-4 less at Target.

# Sample Output
###  love eat good watch tv look movi sweet like transfer zip lock baggi stay fresh take time eat

### glad amazon carri batteri hard time find elsewher uniqu size need garag door open great deal price

### watch price assort good get gold box purchas price less target

In [81]:
# importing important libraries
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer

In [82]:
# Storing all the stopwords in english
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

In [83]:
print(stop_words)

{'y', 'won', 'itself', 'hasn', 'both', 'will', 'if', 'a', 'few', 'more', 'to', 'her', 'nor', 'over', 'why', 'than', 'these', 's', 'mustn', 'does', 'herself', 'or', 'don', 'as', 'we', 're', 'yourselves', 'me', 'm', "wasn't", 'yourself', 'it', 'hadn', 'up', 've', 'at', 'are', 'below', "mightn't", 'our', 'was', 'off', 'had', 'mightn', "haven't", 'doesn', 'you', "that'll", 'your', 'while', 'where', 'all', 'not', "won't", 'that', 'about', 'whom', 'hers', 'being', 'no', "couldn't", 'ma', 'by', 'after', "shouldn't", 'against', 'then', 'with', "aren't", 'until', 'so', 'between', 'should', 'they', 'themselves', 'be', 'wouldn', "you'd", 'who', 'into', 'how', "hasn't", 'been', 'this', 'himself', 'd', 'o', 'ourselves', 'on', 'myself', "needn't", 'during', "didn't", 'shan', 'each', 'further', 'him', 'have', 't', "mustn't", "wouldn't", "don't", 'again', "doesn't", 'there', 'of', 'any', 'just', 'when', 'haven', 'yours', 'my', 'out', 'here', 'under', 'other', 'from', 'only', 'did', 'weren', 'is', 'the

In [84]:
# adding the sentence in the list
k=[]
k.append('I love eating them and they are good for watching TV and looking at movies! It is not too sweet. I like to transfer them to a zip lock baggie so they stay fresh so I can take my time eating them.')
k.append('I was so glad Amazon carried these batteries. I have a hard time finding them elsewhere because they are such a unique size. I need them for my garage door opener.<br />Great deal for the price.')
k.append('Watch your prices with this. While the assortment was good, and I did get this on a gold box purchase, the price for this was<br />$3-4 less at Target.')
print(k[0])
print()
print(k[1])
print()
print(k[2])

I love eating them and they are good for watching TV and looking at movies! It is not too sweet. I like to transfer them to a zip lock baggie so they stay fresh so I can take my time eating them.

I was so glad Amazon carried these batteries. I have a hard time finding them elsewhere because they are such a unique size. I need them for my garage door opener.<br />Great deal for the price.

Watch your prices with this. While the assortment was good, and I did get this on a gold box purchase, the price for this was<br />$3-4 less at Target.


In [85]:
# Input - String which has to be preprocessed
# Output - String after preprocessing

def preprocessing_text(text):
        final_text = []
        string = ""
        
        # remove html tags
        text = re.sub('<.*?>',' ',text)
        
        # replace every special char with space
        text = re.sub('[^a-zA-Z0-9\n]', ' ', text)
        
        # replace multiple spaces with single space
        text = re.sub('\s+',' ', text)
        
        # converting all the chars into lower-case.
        text = text.lower()
        
        for word in text.split():
        # if the word is a not a stop word then retain that word from the data and word is not numbers
            if not word in stop_words and word.isalpha():
                
                string += ps.stem(word) + " "
                
        # Clearing starting and ending spaces if any
        string = string.strip()
                
        final_text.append(string)
                
        return final_text[0]

In [94]:
# Calling the function and storing in the same list
p = []
p.append(preprocessing_text(k[0]))
print("**************")
print("--Orignal Text--")
print(k[0])
print()
print("--Preprocessed Text--")
print(p[0])

**************
--Orignal Text--
I love eating them and they are good for watching TV and looking at movies! It is not too sweet. I like to transfer them to a zip lock baggie so they stay fresh so I can take my time eating them.

--Preprocessed Text--
love eat good watch tv look movi sweet like transfer zip lock baggi stay fresh take time eat


In [93]:
p.append(preprocessing_text(k[1]))
print("**************")
print("--Orignal Text--")
print(k[1])
print()
print("--Preprocessed Text--")
print(p[1])

**************
--Orignal Text--
I was so glad Amazon carried these batteries. I have a hard time finding them elsewhere because they are such a unique size. I need them for my garage door opener.<br />Great deal for the price.

--Preprocessed Text--
glad amazon carri batteri hard time find elsewher uniqu size need garag door open great deal price


In [92]:
print("**************")
p.append(preprocessing_text(k[2]))
print("--Orignal Text--")
print(k[2])
print()
print("--Preprocessed Text--")
print(p[2])

**************
--Orignal Text--
Watch your prices with this. While the assortment was good, and I did get this on a gold box purchase, the price for this was<br />$3-4 less at Target.

--Preprocessed Text--
watch price assort good get gold box purchas price less target


## - Feel free to use this code in your preprocessing steps and edit it according to your requirement

## - That's all folks Thanks