# Text Pre-processing
- Basic
    - Lowercasing
    - Remove html tag
    - Remove URLs
    - Remove Punctuation
    - Chat word treatment
    - Spelling Correction
    - Removing stop words
    - Handling Emojis
    - Tokenization
    - Stemming
    - Lemmatization
- Advance
    - POS tag
    - Chunking
    - Parsing
    - Co-reference Resolution

### Lowercasing

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.shape

(50000, 2)

In [3]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
df['review']=df['review'].str.lower()

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### Remove html tag

In [13]:
import re
def remove_html_tags(text):
    pattern=re.compile('<.*?>')
    return pattern.sub(r'',text)
text='''basically there's a family where a little boy (jake)
thinks there's a zombie in his closet & his parents are 
fighting all the time.<br /><br />this movie is slower than 
a soap opera... and suddenly, jake decides to become rambo 
and kill the zombie.<br /><br />ok, first of all when you're 
going to make a film you must decide if its a thriller or a 
drama! as a drama the movie is watchable. parents are divorcing 
& arguing like in real life. and then we have jake with his 
closet which totally ruins all the film! i expected to see a 
boogeyman similar movie, and instead i watched a drama with some 
meaningless thriller spots.<br /><br />3 out of 10 just for the 
well playing parents & descent dialogs. as for the shots with 
jake: just ignore them.'''
remove_html_tags(text)

"basically there's a family where a little boy (jake)\nthinks there's a zombie in his closet & his parents are \nfighting all the time.this movie is slower than \na soap opera... and suddenly, jake decides to become rambo \nand kill the zombie.ok, first of all when you're \ngoing to make a film you must decide if its a thriller or a \ndrama! as a drama the movie is watchable. parents are divorcing \n& arguing like in real life. and then we have jake with his \ncloset which totally ruins all the film! i expected to see a \nboogeyman similar movie, and instead i watched a drama with some \nmeaningless thriller spots.3 out of 10 just for the \nwell playing parents & descent dialogs. as for the shots with \njake: just ignore them."

In [9]:
df['review']=df['review'].apply(remove_html_tags)

In [10]:
df['review'][1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

### Remove URLs

In [12]:
def remove_url(text):
    pattern=re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)
text1='for my notebook click https://www.kaggle.com/code/saquib7hussain/nlp-pipeline/notebook'
remove_url(text1)

'for my notebook click '

### Remove Punction

In [42]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [43]:
exclude=string.punctuation

In [47]:
''' below code is take much more time for large 
number of text(rows=50000) hence we used another method'''
# method-1
def remove_punc(text):
    for char in exclude:
        text=text.replace(char,'')
    return text
text='string. with. Punctuation?'*100000
start=time.time()
remove_punc(text)
time1=time.time() - start
print(time1)

0.06323933601379395


In [48]:
# method-2
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))
start=time.time()
remove_punc1(text)
time2=time.time() - start
print(time2)

0.0058977603912353516


In [50]:
# time2 is 10 times faster than time1
time1/time2

10.722601770626996

In [51]:
df['review']=df['review'].apply(remove_punc1)

In [52]:
df['review'][1]

'A wonderful little production The filming technique is very unassuming very oldtimeBBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great masters of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwells murals decorating every surface are terribly well done'

### Chat word treatment

In [2]:
def chat_conversion(text):
    new_text=[]
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        new_text.append(w)
    return " ".join(new_text)
chat_conversion("IMHO he is the best")

# output: In my humble opinion he is the best

### Spelling Correction

In [4]:
from textblob import TextBlob
incorrect_text='certain conditionas duriing seveal ggenerations arre moodified in the saame namer'
textBlb=TextBlob(incorrect_text)
textBlb.correct().string

'certain conditions during several generations are modified in the same name'

### Removing stop words

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')
def remove_stopwords(text):
    new_text=[]
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        new_text.append(word)
    x=new_text[:]
    new_text.clear()
    return " ".join(x)
remove_stoprwords('certain conditions during severalgenerations are modified in the same name')

In [None]:
# for apply on whole dataset
df['review'].apply(remove_stopwords)

### Handling Emojis

In [13]:
!pip install clean-text

Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
     -------------------------------------- 175.4/175.4 kB 2.1 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
     ---------------------------------------- 53.1/53.1 kB 2.7 MB/s eta 0:00:00
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py): started
  Building wheel for emoji (setup.py): finished with status 'done'
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171059 sha256=2a0756496f73bfc60c81f108dc6254c9310ff12b0b291f3f718e0a7f9d867530
  Stored in directory: c:\users\91975\appdata\local\pip\cache\wheels\37\b1\70\d87e2dddea71a019314970e3ea065b63e27b9be29e4a579b13
Successfully built emoji
Installing collected packages: emoji, ftfy, clea

In [14]:
# for handle this
# 1- remove it
# 2- replace with its meaning

# 1- remove it
#import clean function
from cleantext import clean

#provide string with emojis
text = "This sample text contains laughing emojis 😀 😃 😄 😁 😆 😅 😂 🤣"

#print text after removing the emojis from it
print(clean(text, no_emoji=True))

this sample text contains laughing emojis


In [15]:
# 2- replace with its meaning
import emoji
print(emoji.demojize('Python is 🤣'))

Python is :rolling_on_the_floor_laughing:


### Tokenization

Problem wih tokenization
- prefix : $("
- Suffix : km),.!"
- Infix : - -- / _
- Exception : let's U.S.

#### 1. Using the split function

In [17]:
# word tokenization
sent1='I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [18]:
# sentence tokenization
sent2='I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to the great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to the great"]

In [19]:
# problem with split function
sent3='I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [20]:
# problem with split function
sent4='where do think I should go? I have 3 days Holiday'
sent4.split('.')

['where do think I should go? I have 3 days Holiday']

#### 2. Regular Expression

In [21]:
import re
sent3='I am going to delhi!'
tokens=re.findall("[\w']+",sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

#### 3. NLTK

In [22]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [23]:
sent1='I am going to delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'delhi', '!']

In [26]:
sent5='I have Ph.D in A.I'
sent6="we're here to help! mail us at nks@gmail.com"
sent7='A 5km ride cost $10.50'
print(word_tokenize(sent5))
print(word_tokenize(sent6))
print(word_tokenize(sent7))

['I', 'have', 'Ph.D', 'in', 'A.I']
['we', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


#### 4. Spacy

In [28]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.6.1-cp310-cp310-win_amd64.whl (12.0 MB)
     ---------------------------------------- 12.0/12.0 MB 3.7 MB/s eta 0:00:00
Collecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.2-py3-none-any.whl (27 kB)
Collecting thinc<8.2.0,>=8.1.8
  Downloading thinc-8.1.12-cp310-cp310-win_amd64.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 5.2 MB/s eta 0:00:00
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
     -------------------------------------- 181.6/181.6 kB 5.5 MB/s eta 0:00:00
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.9-cp310-cp310-win_amd64.whl (122 kB)
     -------------------------------------- 122.2/122.2 kB 7.0 MB/s eta 0:00:00
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.5-py3-none-any.whl (22 kB)
Collecting typer<0.10.0,>=0.3.0
  Downloading typer-0.9.0-py3-none-any.whl (45 kB)
     ---------------------------------------- 45.9/45.9 kB

In [8]:
!python -m spacy download en_core_web_sm

Traceback (most recent call last):
  File "C:\Users\91975\anaconda3\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "C:\Users\91975\anaconda3\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
    raise err
  File "C:\Users\91975\anaconda3\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\91975\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\91975\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 386, in _make_request
   

In [10]:
import spacy

from spacy.lang.en import English
nlp=English()
# Text to be tokenized
text = "This is an example sentence."

# Process the text with spaCy
doc = nlp(text)

# Iterate through the tokens and print them
for token in doc:
    print(token.text)


This
is
an
example
sentence
.


In [12]:
import spacy

from spacy.lang.en import English
nlp=English()
# Text to be tokenized
text='I have Ph.D in A.I'

# Process the text with spaCy
doc = nlp(text)

# Iterate through the tokens and print them
for token in doc:
    print(token.text)

I
have
Ph
.
D
in
A.I


In [13]:
import spacy

from spacy.lang.en import English
nlp=English()
# Text to be tokenized
text="we're here to help! mail us at nks@gmail.com"

# Process the text with spaCy
doc = nlp(text)

# Iterate through the tokens and print them
for token in doc:
    print(token.text)

we
're
here
to
help
!
mail
us
at
nks@gmail.com


In [14]:
import spacy

from spacy.lang.en import English
nlp=English()
# Text to be tokenized
text='A 5km ride cost $10.50'
# Process the text with spaCy
doc = nlp(text)

# Iterate through the tokens and print them
for token in doc:
    print(token.text)

A
5
km
ride
cost
$
10.50


### Stemming

In [None]:
'''Ingrammar , inflection is the modification of a words to
express different grammatical categories such as tense,
case, voice, aspect, person, number, gender, and mood'''

'''Stemming is the process of reducing inflection in words
to their root forms such as mapping a group of words to
the same stem even if the stem itself is not a valid word 
in the language it is used in information retreaval'''

# library used with the help of nltk
# porter streamer (for english) and snow ball streamer (for other language)

In [17]:
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [18]:
sample="walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

### Lemmatization

In [19]:
# stemming is fast and Lemitization is slow

In [None]:
'''Lemmatization , unlike stemming reduces the inflected words 
properly ensuring that the root word belongs to the language.
In lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata)
is the cannonical form , dictionary form , or citation form of a set of words'''

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()

# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling',
		'driving', 'died', 'tried', 'feet']
for words in list1:
	print(words + " ---> " + wnl.lemmatize(words))
	
#> kites ---> kite
#> babies ---> baby
#> dogs ---> dog
#> flying ---> flying
#> smiling ---> smiling
#> driving ---> driving
#> died ---> died
#> tried ---> tried
#> feet ---> foot