# Text Preprocessing using Texthero

This notebook demostrates how you can use texthero to preprocess your data. Texthero is a very powerful python package for cleaning, visualizing and performing some NLP tasks on text data in pandas dataframe or Series.

## Setting up the environment

Install texthero using __pip install texthero__

In [3]:
#!pip install texthero

In [4]:
# import required libraries
import pandas as pd
import texthero as hero

# supress warnings
import warnings
warnings.filterwarnings('ignore')

## Load data

After importing the libraries, we will then import our dataset. In this tutorial, I have use a dataset I downloaded from UCL Machine learning repository. It contains youtube reviews datasets. We will use it to demostrate how we can preprocess text data using texthero.

So let us import the dataset and perform a few checks before we dive into text preprocessing.

In [5]:
# load data
data = pd.read_csv('Youtube02-KatyPerry.csv')
data.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12pgdhovmrktzm3i23es5d5junftft3f,lekanaVEVO1,2014-07-22T15:27:50,i love this so much. AND also I Generate Free ...,1
1,z13yx345uxepetggz04ci5rjcxeohzlrtf4,Pyunghee,2014-07-27T01:57:16,http://www.billboard.com/articles/columns/pop-...,1
2,z12lsjvi3wa5x1vwh04cibeaqnzrevxajw00k,Erica Ross,2014-07-27T02:51:43,Hey guys! Please join me in my fight to help a...,1
3,z13jcjuovxbwfr0ge04cev2ipsjdfdurwck,Aviel Haimov,2014-08-01T12:27:48,http://psnboss.com/?ref=2tGgp3pV6L this is the...,1
4,z13qybua2yfydzxzj04cgfpqdt2syfx53ms0k,John Bello,2014-08-01T21:04:03,Hey everyone. Watch this trailer!!!!!!!! http...,1


We will first create a dataframe that has only one column 'CONTENT' by removing the columns we donet need. 

In [6]:
# create a dataframe with the content column only
df = data[['CONTENT']]
df.head()

Unnamed: 0,CONTENT
0,i love this so much. AND also I Generate Free ...
1,http://www.billboard.com/articles/columns/pop-...
2,Hey guys! Please join me in my fight to help a...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...
4,Hey everyone. Watch this trailer!!!!!!!! http...


## Text Preprocessing

Using __.clean()__ function

In [7]:
# using the .clean() function
df['clean_content'] = hero.clean(df['CONTENT'])
df.head(10)

Unnamed: 0,CONTENT,clean_content
0,i love this so much. AND also I Generate Free ...,love much also generate free leads auto pilot ...
1,http://www.billboard.com/articles/columns/pop-...,http www billboard com articles columns pop sh...
2,Hey guys! Please join me in my fight to help a...,hey guys please join fight help abused mistrea...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,http psnboss com ref 2tggp3pv6l song
4,Hey everyone. Watch this trailer!!!!!!!! http...,hey everyone watch trailer http believemefilm ...
5,check out my rapping hope you guys like it ht...,check rapping hope guys like https soundcloud ...
6,"Subscribe pleaaaase to my instagram account , ...",subscribe pleaaaase instagram account subscrib...
7,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit channel pleaase searching dream
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,nice http www barnesandnoble com bdp csrftoken...
9,http://www.twitch.tv/daconnormc﻿,http www twitch tv daconnormc


From the dataframe above, we can see that our content has been cleaned. What __.clean()__ function does is that it uses a pipeline to clean the dataset. This is usually the default preprocessing pipeline.

## Custom Preprocessing

Sometimes you may want to customize your preprocessing depending on the task you are doing. For instance, you may not want to remove block of digits from your text for a particular reason. So to achieve this, you can still use texthero to customize your preprocessing. Let us explore different functions to preprocess our text one by one.

### Lower casing.

Here, we want to convert our text to lower case without perfomring any other preprocessing task on it. We will use lowercasae(s) function as folows:

In [8]:
df['lower'] = hero.lowercase(df['CONTENT'])
df.head(10)

Unnamed: 0,CONTENT,clean_content,lower
0,i love this so much. AND also I Generate Free ...,love much also generate free leads auto pilot ...,i love this so much. and also i generate free ...
1,http://www.billboard.com/articles/columns/pop-...,http www billboard com articles columns pop sh...,http://www.billboard.com/articles/columns/pop-...
2,Hey guys! Please join me in my fight to help a...,hey guys please join fight help abused mistrea...,hey guys! please join me in my fight to help a...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,http psnboss com ref 2tggp3pv6l song,http://psnboss.com/?ref=2tggp3pv6l this is the...
4,Hey everyone. Watch this trailer!!!!!!!! http...,hey everyone watch trailer http believemefilm ...,hey everyone. watch this trailer!!!!!!!! http...
5,check out my rapping hope you guys like it ht...,check rapping hope guys like https soundcloud ...,check out my rapping hope you guys like it ht...
6,"Subscribe pleaaaase to my instagram account , ...",subscribe pleaaaase instagram account subscrib...,"subscribe pleaaaase to my instagram account , ..."
7,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit channel pleaase searching dream,hey guys!! visit my channel pleaase (i'm searc...
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,nice http www barnesandnoble com bdp csrftoken...,nice! http://www.barnesandnoble.com/s/bdp?csrf...
9,http://www.twitch.tv/daconnormc﻿,http www twitch tv daconnormc,http://www.twitch.tv/daconnormc﻿


From the column 'lower', we have converted our text to lower case without making any other further changes. For instance, stopwords and punctuations are still there. 

### Remove URLS

Let us remove urls present in the text using remove_urls() functions as follows;

In [9]:
df['no_urls'] = hero.remove_urls(df['CONTENT'])
df.head(10)

Unnamed: 0,CONTENT,clean_content,lower,no_urls
0,i love this so much. AND also I Generate Free ...,love much also generate free leads auto pilot ...,i love this so much. and also i generate free ...,i love this so much. AND also I Generate Free ...
1,http://www.billboard.com/articles/columns/pop-...,http www billboard com articles columns pop sh...,http://www.billboard.com/articles/columns/pop-...,Vote for SONES please....we're against vips....
2,Hey guys! Please join me in my fight to help a...,hey guys please join fight help abused mistrea...,hey guys! please join me in my fight to help a...,Hey guys! Please join me in my fight to help a...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,http psnboss com ref 2tggp3pv6l song,http://psnboss.com/?ref=2tggp3pv6l this is the...,this is the song﻿
4,Hey everyone. Watch this trailer!!!!!!!! http...,hey everyone watch trailer http believemefilm ...,hey everyone. watch this trailer!!!!!!!! http...,Hey everyone. Watch this trailer!!!!!!!!
5,check out my rapping hope you guys like it ht...,check rapping hope guys like https soundcloud ...,check out my rapping hope you guys like it ht...,check out my rapping hope you guys like it ...
6,"Subscribe pleaaaase to my instagram account , ...",subscribe pleaaaase instagram account subscrib...,"subscribe pleaaaase to my instagram account , ...","Subscribe pleaaaase to my instagram account , ..."
7,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit channel pleaase searching dream,hey guys!! visit my channel pleaase (i'm searc...,hey guys!! visit my channel pleaase (i'm searc...
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,nice http www barnesandnoble com bdp csrftoken...,nice! http://www.barnesandnoble.com/s/bdp?csrf...,Nice! ﻿
9,http://www.twitch.tv/daconnormc﻿,http www twitch tv daconnormc,http://www.twitch.tv/daconnormc﻿,


Great. We have removed all the urls in the text without making any further changes. 

### Remove Punctuations

Next, we will remove punctuations only. Let us use the remove_punctuation() functions as follows:

In [10]:
df['no_punctuations'] = hero.remove_punctuation(df['CONTENT'])
df.head(10)

Unnamed: 0,CONTENT,clean_content,lower,no_urls,no_punctuations
0,i love this so much. AND also I Generate Free ...,love much also generate free leads auto pilot ...,i love this so much. and also i generate free ...,i love this so much. AND also I Generate Free ...,i love this so much AND also I Generate Free ...
1,http://www.billboard.com/articles/columns/pop-...,http www billboard com articles columns pop sh...,http://www.billboard.com/articles/columns/pop-...,Vote for SONES please....we're against vips....,http www billboard com articles columns pop sh...
2,Hey guys! Please join me in my fight to help a...,hey guys please join fight help abused mistrea...,hey guys! please join me in my fight to help a...,Hey guys! Please join me in my fight to help a...,Hey guys Please join me in my fight to help a...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,http psnboss com ref 2tggp3pv6l song,http://psnboss.com/?ref=2tggp3pv6l this is the...,this is the song﻿,http psnboss com ref 2tGgp3pV6L this is the song﻿
4,Hey everyone. Watch this trailer!!!!!!!! http...,hey everyone watch trailer http believemefilm ...,hey everyone. watch this trailer!!!!!!!! http...,Hey everyone. Watch this trailer!!!!!!!!,Hey everyone Watch this trailer http believ...
5,check out my rapping hope you guys like it ht...,check rapping hope guys like https soundcloud ...,check out my rapping hope you guys like it ht...,check out my rapping hope you guys like it ...,check out my rapping hope you guys like it ht...
6,"Subscribe pleaaaase to my instagram account , ...",subscribe pleaaaase instagram account subscrib...,"subscribe pleaaaase to my instagram account , ...","Subscribe pleaaaase to my instagram account , ...",Subscribe pleaaaase to my instagram account ...
7,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit channel pleaase searching dream,hey guys!! visit my channel pleaase (i'm searc...,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit my channel pleaase i m search...
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,nice http www barnesandnoble com bdp csrftoken...,nice! http://www.barnesandnoble.com/s/bdp?csrf...,Nice! ﻿,Nice http www barnesandnoble com s BDP csrfTo...
9,http://www.twitch.tv/daconnormc﻿,http www twitch tv daconnormc,http://www.twitch.tv/daconnormc﻿,,http www twitch tv daconnormc﻿


Our 'no_punctuations' column has the text data with no punctuations with nothing else done on it. 

From the examples above, we can use texthero functions as we wish to perform what we want on our text data. But in addition to that, we can create our custom pipeline to include two or more functions that we need. This make the preprocessing happen faster and can save us alot of time instead of performing them one by one.

In the example below, we will combine the three functions above to get a clean text in lower case, and without urls and punctuations. 

In [11]:
from texthero import preprocessing

custom_pipeline = [preprocessing.lowercase, 
                   preprocessing.remove_urls, 
                   preprocessing.remove_punctuation]
custom_pipeline

[<function texthero.preprocessing.lowercase(input: pandas.core.series.Series) -> pandas.core.series.Series>,
 <function texthero.preprocessing.remove_urls(s: pandas.core.series.Series) -> pandas.core.series.Series>,
 <function texthero.preprocessing.remove_punctuation(input: pandas.core.series.Series) -> pandas.core.series.Series>]

Now that we have created a custom pipeline, let us apply it on our dataframe to clean the text by aplying the three functions.

In [12]:
df['custom_clean'] = df['CONTENT'].pipe(hero.clean, custom_pipeline)
df.head(10)

Unnamed: 0,CONTENT,clean_content,lower,no_urls,no_punctuations,custom_clean
0,i love this so much. AND also I Generate Free ...,love much also generate free leads auto pilot ...,i love this so much. and also i generate free ...,i love this so much. AND also I Generate Free ...,i love this so much AND also I Generate Free ...,i love this so much and also i generate free ...
1,http://www.billboard.com/articles/columns/pop-...,http www billboard com articles columns pop sh...,http://www.billboard.com/articles/columns/pop-...,Vote for SONES please....we're against vips....,http www billboard com articles columns pop sh...,vote for sones please we re against vips ple...
2,Hey guys! Please join me in my fight to help a...,hey guys please join fight help abused mistrea...,hey guys! please join me in my fight to help a...,Hey guys! Please join me in my fight to help a...,Hey guys Please join me in my fight to help a...,hey guys please join me in my fight to help a...
3,http://psnboss.com/?ref=2tGgp3pV6L this is the...,http psnboss com ref 2tggp3pv6l song,http://psnboss.com/?ref=2tggp3pv6l this is the...,this is the song﻿,http psnboss com ref 2tGgp3pV6L this is the song﻿,this is the song﻿
4,Hey everyone. Watch this trailer!!!!!!!! http...,hey everyone watch trailer http believemefilm ...,hey everyone. watch this trailer!!!!!!!! http...,Hey everyone. Watch this trailer!!!!!!!!,Hey everyone Watch this trailer http believ...,hey everyone watch this trailer
5,check out my rapping hope you guys like it ht...,check rapping hope guys like https soundcloud ...,check out my rapping hope you guys like it ht...,check out my rapping hope you guys like it ...,check out my rapping hope you guys like it ht...,check out my rapping hope you guys like it ...
6,"Subscribe pleaaaase to my instagram account , ...",subscribe pleaaaase instagram account subscrib...,"subscribe pleaaaase to my instagram account , ...","Subscribe pleaaaase to my instagram account , ...",Subscribe pleaaaase to my instagram account ...,subscribe pleaaaase to my instagram account ...
7,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit channel pleaase searching dream,hey guys!! visit my channel pleaase (i'm searc...,hey guys!! visit my channel pleaase (i'm searc...,hey guys visit my channel pleaase i m search...,hey guys visit my channel pleaase i m search...
8,Nice! http://www.barnesandnoble.com/s/BDP?csrf...,nice http www barnesandnoble com bdp csrftoken...,nice! http://www.barnesandnoble.com/s/bdp?csrf...,Nice! ﻿,Nice http www barnesandnoble com s BDP csrfTo...,nice ﻿
9,http://www.twitch.tv/daconnormc﻿,http www twitch tv daconnormc,http://www.twitch.tv/daconnormc﻿,,http www twitch tv daconnormc﻿,


The custom_pipeline has efficiently achived waht we specified i.e. it has removes the urls, punctuations and converted the text to lower case. This shows the potnetial of texthero in performing text cleaning.

There are many additional functions that you can explore in texthero to make your text preprocessing easy. Keep learning!