## Text preprocessing using Textacy Library


**Author: Abhishek Dey**


**Documentation Link: https://textacy.readthedocs.io/en/latest/api_reference/preprocessing.html** 

## Installation

In [1]:
!pip3 install textacy

Defaulting to user installation because normal site-packages is not writeable
    extract-msg (<=0.29.*)
                 ~~~~~~~^[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


## Documents

In [2]:
d1="The sun rises in  the 'East'."
d2="Hello World ! ### !$ ..."
d3="Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\\www.google.com', Mob: 9435478273"
d4="<!DOCTYPE> M.S Dhoni is the captain of Chennai Super Kings. #Whistle-podu"

## Corpus

In [3]:
corpus=[d1,d2,d3,d4]

In [4]:
corpus

["The sun rises in  the 'East'.",
 'Hello World ! ### !$ ...',
 "Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\\www.google.com', Mob: 9435478273",
 '<!DOCTYPE> M.S Dhoni is the captain of Chennai Super Kings. #Whistle-podu']

## Text preprocessing:

In [5]:
from textacy import preprocessing as tp

## Remove punctuation

In [6]:
filtered_doc = tp.remove.punctuation(d1)


print(d1)
print(filtered_doc)

The sun rises in  the 'East'.
The sun rises in  the  East  


## Remove html tags

In [7]:
filtered_doc = tp.remove.html_tags(d4)

print(d4)
print(filtered_doc)

<!DOCTYPE> M.S Dhoni is the captain of Chennai Super Kings. #Whistle-podu
M.S Dhoni is the captain of Chennai Super Kings. #Whistle-podu


## Replace currency symbol with white space

In [8]:
filtered_doc = tp.replace.currency_symbols(d2,'')

print(d2)
print(filtered_doc)

Hello World ! ### !$ ...
Hello World ! ### ! ...


## Replace email with white space

In [9]:
filtered_doc = tp.replace.emails(d3,'')

print(d3)
print(filtered_doc)

Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\www.google.com', Mob: 9435478273
Contact me : E-Mail Id: , Webiste: 'https:\www.google.com', Mob: 9435478273


## Replace phone numbers with white space

In [10]:
filtered_doc = tp.replace.phone_numbers(d3,'')

print(d3)
print(filtered_doc)

Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\www.google.com', Mob: 9435478273
Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\www.google.com', Mob: 


## Replace urls with white space

In [11]:
filtered_doc = tp.replace.urls(d3,'')

print(d3)
print(filtered_doc)

Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\www.google.com', Mob: 9435478273
Contact me : E-Mail Id: xyz@gmail.com, Webiste: 'https:\', Mob: 9435478273


## Remove white space

In [12]:
filtered_doc = tp.normalize.whitespace(d1)

print(d1)
print(filtered_doc)

The sun rises in  the 'East'.
The sun rises in the 'East'.


## Lowercase text

In [13]:
filtered_doc = d1.lower()

print(d1)
print(filtered_doc)

The sun rises in  the 'East'.
the sun rises in  the 'east'.


## Create a text pre-processing function

In [14]:
def preprocess_text(text):
    
    text = tp.remove.html_tags(text)
    text = tp.replace.emails(text,'')
    text = tp.replace.currency_symbols(text,'')
    text = tp.replace.phone_numbers(text,'')
    text = tp.replace.urls(text,'')
    text = tp.remove.punctuation(text)
    text = tp.normalize.whitespace(text)
    text = text.lower()
    
    return text
    

## Filtered Corpus after text pre-processing

In [16]:
filtered_corpus = [ preprocess_text(doc) for doc in corpus]

for doc in filtered_corpus:
    print(doc)

the sun rises in the east
hello world
contact me e mail id webiste https mob
m s dhoni is the captain of chennai super kings whistle podu
