<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Normalizing Textual

#Textual data:
Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.

#Text normalization:
Text normalization is that the method of transforming text into one canonical form that it'd not have had before. Normalizing text before storing or processing it allows for separation of concerns since the input is sure to be consistent before operations are performed thereon. Text normalization requires being conscious of what sort of text is to be normalized and the way it's to be processed afterwards; there's no all-purpose normalization procedure.

## Steps for Text Normalization

Text normalization involves several steps to transform raw text into a consistent and standardized format suitable for processing. Here are the common steps:

1.  **Lowercasing**: Converting all text to lowercase to treat words like "The" and "the" as the same.

2.  **Removing Punctuation**: Eliminating punctuation marks (e.g., periods, commas, question marks) that might not be relevant for analysis.

3.  **Tokenization**: Breaking down the text into smaller units, usually words or subwords. This is a fundamental step for most NLP tasks.

4.  **Removing Stop Words**: Filtering out common words (e.g., "a", "an", "the", "is", "are") that carry little meaning and can clutter analysis.

5.  **Stemming/Lemmatization**: Reducing words to their base or root form.
    *   **Stemming** is a more aggressive process that often chops off the end of words (e.g., "running" -> "run", "jumps" -> "jump"), which might not result in a valid word.
    *   **Lemmatization** is a more sophisticated process that uses vocabulary and morphological analysis to return the base or dictionary form of a word (e.g., "better" -> "good", "ran" -> "run").

6.  **Handling Numbers**: Deciding how to treat numerical data. This might involve removing them, converting them to a standard format, or representing them symbolically.

7.  **Removing Extra Whitespace**: Eliminating multiple spaces, tabs, and newlines to ensure consistent spacing.

8.  **Handling Special Characters and Emojis**: Deciding whether to remove, convert, or preserve special characters and emojis based on the specific task.

9.  **Expanding Abbreviations and Acronyms**: Replacing common abbreviations or acronyms with their full forms (e.g., "Dr." -> "Doctor", "ASAP" -> "As Soon As Possible").

10. **Spell Correction**: Correcting misspelled words, though this can be a complex step and depends on the specific use case.

_The choice and order of these steps can vary greatly depending on the specific application and the nature of the textual data._

#Text String

In [None]:
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)

       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows).


#Case Conversion (Lower Case)



In [None]:
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()
print(lower_string)

       python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2's end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows).


#Removing Numbers



In [None]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

       python ., released in , was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python . with python 's end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows  (and old installers not restricted to -bit windows).


#Removing punctuation



In [None]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

       python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#Removing White space



In [None]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)

# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#Removing Stop Words



In [None]:
# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

{'between', 'no', 'their', 'so', 'if', 'both', 'was', 'am', 'him', 'shan', "they're", 'has', 'we', "they'll", 'by', 'more', 'any', "mustn't", 'through', 'a', 'be', 'each', 'yours', 'isn', 's', 'again', 'having', 'that', "aren't", 'in', 'needn', "should've", 'weren', 'few', 'or', "she'll", 'these', 'only', "that'll", "needn't", 'from', "didn't", 've', 'can', 're', 'o', 'off', "it's", 'under', 'yourself', "hadn't", 'they', 'into', "i've", 'how', 'than', 'now', 'which', 'because', 'i', "couldn't", "you'll", 'out', 'me', 'this', 'did', "you've", "he's", 'up', "isn't", "you'd", 'all', 'but', 'he', "haven't", 'about', 'should', 'hasn', 'same', 'before', 'it', 'mustn', 'itself', "he'd", "she'd", 'just', 'then', 'other', 'll', 'down', 'ain', 'some', 'where', 'further', 'ours', "we're", 'against', 'had', "mightn't", 'once', 'themselves', 'wouldn', 'the', "it'll", 'do', 'such', 't', 'what', 'being', 'didn', 'm', 'will', 'your', 'when', 'below', 'own', "we'd", 'not', 'its', 'hadn', 'my', 'and', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# import regex
import re

# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)

# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]

# output
print(no_stpwords_string)

['python', 'released', 'in', 'was', 'a', 'major', 'revision', 'of', 'the', 'language', 'that', 'is', 'not', 'completely', 'backward', 'compatible', 'and', 'much', 'python', 'code', 'does', 'not', 'run', 'unmodified', 'on', 'python', 'with', 'python', 's', 'endoflife', 'only', 'python', 'x', 'and', 'later', 'are', 'supported', 'with', 'older', 'versions', 'still', 'supporting', 'eg', 'windows', 'and', 'old', 'installers', 'not', 'restricted', 'to', 'bit', 'windows']
python released major revision language completely backward compatible much python code run unmodified python python endoflife python x later supported older versions still supporting eg windows old installers restricted bit windows


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#Practise on IMBD Movie Review


In [1]:
import pandas as pd

In [3]:
df=pd.read_csv('IMDB Dataset.csv', engine='python')

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
df['review'].str.lower()

Unnamed: 0,review
0,one of the other reviewers has mentioned that ...
1,a wonderful little production. <br /><br />the...
2,i thought this was a wonderful way to spend ti...
3,basically there's a family where a little boy ...
4,"petter mattei's ""love in the time of money"" is..."
...,...
49995,i thought this movie did a down right good job...
49996,"bad plot, bad dialogue, bad acting, idiotic di..."
49997,i am a catholic taught in parochial elementary...
49998,i'm going to have to disagree with the previou...


In [7]:
df['review']=df['review'].str.lower()

In [8]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [12]:
import re

df['review'] = df['review'].str.replace(r'<.*?>', '', regex=True)


In [13]:
df['review'] = df['review'].str.replace(r'http\S+|www\S+', '', regex=True)


In [14]:
df['review'] = df['review'].str.replace(r'[^a-z\s]', '', regex=True)


In [15]:
df['review'] = df['review'].str.replace(r'\s+', ' ', regex=True).str.strip()


In [24]:
def remove_punctuation(df, text_col):
    df[text_col] = df[text_col].str.replace(r'[^a-zA-Z\s]', '', regex=True)
    return df


#Removes punctuations

#Tokenization (Text → Words)

In [18]:
import nltk
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

df['tokens'] = df['review'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [19]:
df['tokens']

Unnamed: 0,tokens
0,"[one, of, the, other, reviewers, has, mentione..."
1,"[a, wonderful, little, production, the, filmin..."
2,"[i, thought, this, was, a, wonderful, way, to,..."
3,"[basically, theres, a, family, where, a, littl..."
4,"[petter, matteis, love, in, the, time, of, mon..."
...,...
49995,"[i, thought, this, movie, did, a, down, right,..."
49996,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,"[i, am, a, catholic, taught, in, parochial, el..."
49998,"[im, going, to, have, to, disagree, with, the,..."


#Remove Stopwords (Keep "not")

In [20]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words.remove('not')  # important for sentiment

df['tokens'] = df['tokens'].apply(
    lambda x: [word for word in x if word not in stop_words]
)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [21]:
df['tokens']

Unnamed: 0,tokens
0,"[one, reviewers, mentioned, watching, oz, epis..."
1,"[wonderful, little, production, filming, techn..."
2,"[thought, wonderful, way, spend, time, hot, su..."
3,"[basically, theres, family, little, boy, jake,..."
4,"[petter, matteis, love, time, money, visually,..."
...,...
49995,"[thought, movie, right, good, job, wasnt, crea..."
49996,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,"[catholic, taught, parochial, elementary, scho..."
49998,"[im, going, disagree, previous, comment, side,..."


#Lemmatization

In [22]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

df['tokens'] = df['tokens'].apply(
    lambda x: [lemmatizer.lemmatize(word) for word in x]
)


[nltk_data] Downloading package wordnet to /root/nltk_data...


#Handle Negations

In [23]:
def handle_negation(tokens):
    result = []
    i = 0
    while i < len(tokens):
        if tokens[i] == 'not' and i+1 < len(tokens):
            result.append(tokens[i] + '_' + tokens[i+1])
            i += 2
        else:
            result.append(tokens[i])
            i += 1
    return result

df['tokens'] = df['tokens'].apply(handle_negation)


In [25]:
df['clean_review'] = df['tokens'].apply(lambda x: " ".join(x))


In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2)
)

X = tfidf.fit_transform(df['clean_review'])
y = df['sentiment']


In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [28]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


In [29]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.8884
              precision    recall  f1-score   support

    negative       0.90      0.88      0.89      4961
    positive       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [30]:
df1=pd.read_csv('test.csv')

In [31]:
df1

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."
...,...,...
17192,49155,thought factory: left-right polarisation! #tru...
17193,49156,feeling like a mermaid ð #hairflip #neverre...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...
17195,49158,"happy, at work conference: right mindset leads..."


In [32]:
import string

In [33]:
exclude=string.punctuation

In [34]:
def remove_punc1(text):
  return text.translate(str.maketrans('','',exclude))

In [35]:
df1['tweet'].apply(remove_punc1)

Unnamed: 0,tweet
0,studiolife aislife requires passion dedication...
1,user white supremacists want everyone to see ...
2,safe ways to heal your acne altwaystoheal h...
3,is the hp and the cursed child book up for res...
4,3rd bihday to my amazing hilarious nephew el...
...,...
17192,thought factory leftright polarisation trump u...
17193,feeling like a mermaid ð hairflip neverread...
17194,hillary campaigned today in ohioomg amp used w...
17195,happy at work conference right mindset leads t...


In [37]:
df1['tweet']=df1['tweet'].apply(remove_punc1)

In [44]:
df1['tweet']

Unnamed: 0,tweet
0,studiolife aislife requires passion dedication...
1,user white supremacists want everyone to see ...
2,safe ways to heal your acne altwaystoheal h...
3,is the hp and the cursed child book up for res...
4,3rd bihday to my amazing hilarious nephew el...
...,...
17192,thought factory leftright polarisation trump u...
17193,feeling like a mermaid ð hairflip neverread...
17194,hillary campaigned today in ohioomg amp used w...
17195,happy at work conference right mindset leads t...


#Chat Word treatment

In [38]:
  slang_dict = {
    "A3": "Anytime, Anywhere, Anyplace",
    "ADIH": "Another Day In Hell",
    "AFK": "Away From Keyboard",
    "AFAIK": "As Far As I Know",
    "ASAP": "As Soon As Possible",
    "ASL": "Age, Sex, Location",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "BAE": "Before Anyone Else",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRUH": "Bro",
    "BRT": "Be Right There",
    "BSAAW": "Big Smile And A Wink",
    "BTW": "By The Way",
    "BWL": "Bursting With Laughter",
    "CSL": "Can’t Stop Laughing",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "DM": "Direct Message",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FIMH": "Forever In My Heart",
    "FOMO": "Fear Of Missing Out",
    "FR": "For Real",
    "FWIW": "For What It's Worth",
    "FYP": "For You Page",
    "FYI": "For Your Information",
    "G9": "Genius",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GMTA": "Great Minds Think Alike",
    "GN": "Good Night",
    "GOAT": "Greatest Of All Time",
    "GR8": "Great!",
    "HBD": "Happy Birthday",
    "IC": "I See",
    "ICQ": "I Seek You",
    "IDC": "I Don’t Care",
    "IDK": "I Don't Know",
    "IFYP": "I Feel Your Pain",
    "ILU": "I Love You",
    "ILY": "I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMU": "I Miss You",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "IYKYK": "If You Know, You Know",
    "JK": "Just Kidding",
    "KISS": "Keep It Simple, Stupid",
    "L": "Loss",
    "L8R": "Later",
    "LDR": "Long Distance Relationship",
    "LMK": "Let Me Know",
    "LMAO": "Laughing My A** Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "M8": "Mate",
    "MFW": "My Face When",
    "MID": "Mediocre",
    "MRW": "My Reaction When",
    "MTE": "My Thoughts Exactly",
    "NVM": "Never Mind",
    "NRN": "No Reply Necessary",
    "NPC": "Non-Player Character",
    "OIC": "Oh I See",
    "OP": "Overpowered",
    "PITA": "Pain In The A**",
    "POV": "Point Of View",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A** Off",
    "RN": "Right Now",
    "SK8": "Skate",
    "STATS": "Your Sex And Age",
    "SUS": "Suspicious",
    "TBH": "To Be Honest",
    "TFW": "That Feeling When",
    "THX": "Thank You",
    "TIME": "Tears In My Eyes",
    "TLDR": "Too Long, Didn’t Read",
    "TNTL": "Trying Not To Laugh",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "W": "Win",
    "W8": "Wait...",
    "WB": "Welcome Back",
    "WTF": "What The F**k",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "WYD": "What You Doing?",
    "WYWH": "Wish You Were Here",
    "ZZZ": "Sleeping, Bored, Tired"
}


In [39]:
slang_dict

{'A3': 'Anytime, Anywhere, Anyplace',
 'ADIH': 'Another Day In Hell',
 'AFK': 'Away From Keyboard',
 'AFAIK': 'As Far As I Know',
 'ASAP': 'As Soon As Possible',
 'ASL': 'Age, Sex, Location',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'BAE': 'Before Anyone Else',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRUH': 'Bro',
 'BRT': 'Be Right There',
 'BSAAW': 'Big Smile And A Wink',
 'BTW': 'By The Way',
 'BWL': 'Bursting With Laughter',
 'CSL': 'Can’t Stop Laughing',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'DM': 'Direct Message',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FIMH': 'Forever In My Heart',
 'FOMO': 'Fear Of Missing Out',
 'FR': 'For Real',
 'FWIW': "For What It's Worth",
 'FYP': 'For You Page',
 'FYI': 'For Your Information',
 'G9': 'Genius',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GMTA': 'Great Minds Think Alik

In [40]:
def chat_conversion(text):
    new_text = []

    for w in text.split():
        if w.upper() in slang_dict:
            new_text.append(slang_dict[w.upper()])
        else:
            new_text.append(w)

    return " ".join(new_text)


In [42]:
chat_conversion("IMHO he is the best")

'In My Honest/Humble Opinion he is the best'

In [43]:
text = "brb afk lol this movie was gr8"
print(chat_conversion(text))


Be Right Back Away From Keyboard Laughing Out Loud this movie was Great!


#Spelling Correction using TextBlob

In [45]:
pip install textblob




In [46]:
from textblob import TextBlob

def spelling_correction(text):
    """
    Corrects spelling mistakes in a given text.
    """
    return str(TextBlob(text).correct())


In [47]:
text = "this moovie was amazng but the story waz boring"
print(spelling_correction(text))


this movie was amazing but the story was boring


In [48]:
pip install pyspellchecker


Collecting pyspellchecker
  Downloading pyspellchecker-0.8.4-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.4-py3-none-any.whl (7.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.4


#Fast Spelling Correction

In [49]:
from spellchecker import SpellChecker

spell = SpellChecker()

def spelling_correction(text):
    corrected_words = []
    for word in text.split():
        corrected_words.append(spell.correction(word) or word)
    return " ".join(corrected_words)


In [50]:
text = "this moovie was amazng but story waz boring"
print(spelling_correction(text))


this movie was amazing but story was boring


#Remove Emojis from Text

In [51]:
import re

def remove_emoji(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags
        u"\U00002700-\U000027BF"
        u"\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)


In [52]:
text = "This movie was amazing 😍🔥 but ending was bad 😢"
print(remove_emoji(text))


This movie was amazing  but ending was bad 
