# <center>Text Preprocessing</center>


Here, We'll break down the steps involved in getting text data ready for analysis. <br>
Think of it as cleaning and organizing text so that it's easier to understand and work with. <br>
This process helps us get valuable insights when we're dealing with large amounts of text information.


## Topics To Be Covered:

### Basics:

- Lowercasing
- HTML Tag Removal
- URLs Removal
- Punctuation Removal
- Chat Word Treatment
- Spelling Correction
- Stop Word Removal
- Handling Emojis
- Tokenization
- Stemming
- Lemmatization

### Advance:

- Parts Of Speech Tagging
- Chunking
- Parsing
- Co-Reference Resolution


In [1]:
import pandas as pd

In [2]:
IMDB_df = pd.read_csv('Datasets/IMDB Dataset.csv')

In [3]:
IMDB_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Lowercasing


In [4]:
print(IMDB_df['review'][3])

Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.


In [5]:
IMDB_df['review'] = IMDB_df['review'].str.lower()

In [6]:
IMDB_df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### Removing HTML Tags


In [7]:
import re

In [8]:
print(IMDB_df['review'][1])

a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.


In [9]:
# remove HTML tags from the text

def remove_HTML_Tag(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [10]:
text = remove_HTML_Tag(IMDB_df['review'][1])

print(text)

a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.


In [11]:
IMDB_df['review'] = IMDB_df['review'].apply(remove_HTML_Tag)

IMDB_df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### Removing URLs


In [12]:
text_url_1 = "If you're interested in learning more about web development, check out the https://developer.mozilla.org/en-US/. They have comprehensive tutorials and guides for HTML, CSS, and JavaScript."
text_url_2 = "For those keen on exploring data science, https://www.kaggle.com/ is a great platform to practice coding, find datasets, and participate in competitions."
text_url_3 = "If you want to stay updated on the latest tech trends, http://techcrunch.com/ is an excellent resource for articles and news."
text_url_4 = "Finally, if you're looking to enhance your coding skills, I highly recommend https://leetcode.com/ for practicing coding problems."
text_url_5 = "All this can be searched on chrome browser with the link www.google.com."

In [13]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [14]:
print(remove_url(text_url_1))
print('\n\n')

print(remove_url(text_url_2))
print('\n\n')

print(remove_url(text_url_3))
print('\n\n')

print(remove_url(text_url_4))
print('\n\n')

print(remove_url(text_url_5))

If you're interested in learning more about web development, check out the  They have comprehensive tutorials and guides for HTML, CSS, and JavaScript.



For those keen on exploring data science,  is a great platform to practice coding, find datasets, and participate in competitions.



If you want to stay updated on the latest tech trends,  is an excellent resource for articles and news.



Finally, if you're looking to enhance your coding skills, I highly recommend  for practicing coding problems.



All this can be searched on chrome browser with the link 


### Removing Punctuations


In [15]:
import string
import time

In [16]:
# All the punctuation characters recognized by Python
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [17]:
exclude = string.punctuation

In [18]:
text_punctuation1 = '''"Wow! He said, 'It costs $5 & it's 50/'%'/ off, so I'll take it (even though 5 < 10 = true)!'"'''
text_punctuation2 = "In Python, you can use symbols like #, $, %, &, *, +, -, /, :, ;, <, =, >, ?, @, [, , ], ^, _, `, {, |, }, ~ in various contexts.!"

In [19]:
def remove_punctuation_normal(text):
    for char in exclude:
        text = text.replace(char, '')
    return text

In [20]:
start = time.time()

# Removing punctuation using the normal method
print(remove_punctuation_normal(text_punctuation1))
print()
print(remove_punctuation_normal(text_punctuation2))

time1 = time.time() - start

# Let's assume that we have 50,000 lines in our document, assuming the above text is one line:
print("Total Time Taken (Normal Method):", time1 * 50000)

Wow He said It costs 5  its 50 off so Ill take it even though 5  10  true

In Python you can use symbols like                          in various contexts
Total Time Taken (Normal Method): 63.169002532958984


In [21]:
def remove_punctuation_advance(text):
    return text.translate(str.maketrans('', '', exclude))

In [22]:
start = time.time()

# Removing punctuation using the advanced method
print(remove_punctuation_advance(text_punctuation1))
print()
print(remove_punctuation_advance(text_punctuation2))

time2 = time.time() - start

# Let's assume that we have 50,000 lines in our document, assuming the above text is one line:
print("Total Time Taken (Advanced Method):", time2 * 50000)

Wow He said It costs 5  its 50 off so Ill take it even though 5  10  true

In Python you can use symbols like                          in various contexts
Total Time Taken (Advanced Method): 50.008296966552734


In [23]:
Twitter_df = pd.read_csv(
    'Datasets/Twitter Hate Speech and Offensive Language Dataset.csv')

In [24]:
Twitter_df['tweet']

0        !!! RT @mayasolovely: As a woman you shouldn't...
1        !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2        !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3        !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4        !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
                               ...                        
24778    you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779    you've gone and broke the wrong heart baby, an...
24780    young buck wanna eat!!.. dat nigguh like I ain...
24781                youu got wild bitches tellin you lies
24782    ~~Ruffled | Ntac Eileen Dahlia - Beautiful col...
Name: tweet, Length: 24783, dtype: object

In [25]:
Twitter_df['tweet'] = Twitter_df['tweet'].apply(remove_punctuation_advance)
Twitter_df['tweet']

0         RT mayasolovely As a woman you shouldnt compl...
1         RT mleew17 boy dats coldtyga dwn bad for cuff...
2         RT UrKindOfBrand Dawg RT 80sbaby4life You eve...
3           RT CGAnderson vivabased she look like a tranny
4         RT ShenikaRoberts The shit you hear about me ...
                               ...                        
24778    yous a muthafin lie 8220LifeAsKing 20Pearls co...
24779    youve gone and broke the wrong heart baby and ...
24780    young buck wanna eat dat nigguh like I aint fu...
24781                youu got wild bitches tellin you lies
24782    Ruffled  Ntac Eileen Dahlia  Beautiful color c...
Name: tweet, Length: 24783, dtype: object

### Chat Word Treatment


In [26]:
chat_words = {
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It Is Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek You (also a chat program)',
    'ILU': 'I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA?': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laugher',
    'TFW': 'That Feeling When',
    'MFW': 'My Face When',
    'MRW': 'My Reaction When',
    'IFYP': 'I Feel Your Pain',
    'TNTL': 'Trying Not To Laugh',
    'JK': 'Just Kidding',
    'IDC': 'I Don’t Care',
    'ILY': 'I Love You',
    'IMU': 'I Miss You',
    'ADIH': 'Another Day In Hell',
    'ZZZ': 'Sleeping, Bored, Tired',
    'WYWH': 'Wish You Were Here',
    'TIME': 'Tears In My Eyes',
    'BAE': 'Before Anyone Else',
    'FIMH': 'Forever In My Heart',
    'BSAAW': 'Big Smile And A Wink',
    'BWL': 'Bursting With Laughter',
    'BFF': 'Best Friends Forever',
    'CSL': 'Can’t Stop Laughing'
}

In [27]:
chat_text = "LOL, that was so funny! BRB, need to grab something. BTW, TTYL!"

In [28]:
def chat_conversation(text):
    new_text = []

    for w in text.split():
        # Strip punctuation from the word to check it against the dictionary
        word = w.strip(string.punctuation)

        if word.upper() in chat_words:
            # Append the expanded form and keep the original punctuation
            new_text.append(chat_words[word.upper()] + w[len(word):])
        else:
            new_text.append(w)

    return ' '.join(new_text)

In [29]:
print(chat_conversation(chat_text))

Laughing Out Loud, that was so funny! Be Right Back, need to grab something. By The Way, Talk To You Later!


### Spelling Correction


In [30]:
from textblob import TextBlob

In [31]:
incorrect_spellings = "The qwik bron fox jumpt ovr the lazi dogg, wile teh brids singging in teh tress."

In [32]:
textBlb = TextBlob(incorrect_spellings)

textBlb.correct().string

'The quick iron fox jump or the lazy dog, will the birds singing in the dress.'

### Removing Stop Words


In [33]:
from nltk.corpus import stopwords

In [34]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [35]:
stop_words_sentence = "I just wanted to tell you that if we go to the store, we can get some of the things that we need, but only if you think it’s a good idea, and then maybe after that, we can go to the park where we can sit and relax for a while, because I think it would be nice to spend some time together just talking about what’s been going on lately."

In [36]:
def remove_stop_words(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return ' '.join(x)

In [37]:
print(remove_stop_words(stop_words_sentence))

I  wanted  tell     go   store,   get    things   need,     think it’s  good idea,   maybe  that,   go   park    sit  relax   while,  I think  would  nice  spend  time together  talking  what’s  going  lately.


### Handling Emojis


There are two ways to handle Emojis in our data:

1. Either we can replace them with their meaning or
2. We can remove them from our data


#### Removing Emojis From Data

In [61]:
import re

In [72]:
text_emojis1 = "Hello 😊 World! 🌍"       # Regular emojis
text_emojis2 = "No emojis here!"     # No emojis
text_emojis3 = "Emojis like 🚀 and 🌟 are fun!"     # Mixemojis and flags
text_emojis4 = "Complex 😜 string with emojis 😁 and flags 🇺🇸"
text_emojis5 = ""       # Empty string
text_emojis6 = "🤖🤖🤖"     # Only emojis

In [73]:
def remove_emojis(text):
    # Define a pattern for emojis using Unicode ranges
    emojis_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"    # Emoticons
        u"\U0001F300-\U0001F5FF"    # Symbols and Pictographs
        u"\U0001F680-\U0001F6FF"    # Transport and Map Symbols
        u"\U0001F1E0-\U0001F1FF"    # Flags (iOS)
        u"\U000024C2-\U0001F251"    # Enclosed characters and other symbols
        "]",
        flags=re.UNICODE
    )
    return emojis_pattern.sub(r'', text)

In [75]:
print(remove_emojis(text_emojis1))
print(remove_emojis(text_emojis2))
print(remove_emojis(text_emojis3))
print(remove_emojis(text_emojis4))
print(remove_emojis(text_emojis5))
print(remove_emojis(text_emojis6))

Hello  World! 
No emojis here!
Emojis like  and  are fun!
Complex  string with emojis  and flags 

🤖🤖🤖


#### Replacing Emojis with their meaning

In [3]:
import emoji

In [5]:
print(emoji.demojize('Python is 🔥'))
print(emoji.demojize('😘 the ☕ that 🫵 made yesterday'))

Python is :fire:
:face_blowing_a_kiss: the :hot_beverage: that :index_pointing_at_the_viewer: made yesterday


### Tokenization