In [3]:
pip install emoji




In [4]:
pip install autocorrect

Note: you may need to restart the kernel to use updated packages.


## Step 1: Read the source data


In [5]:
# Load dataset
import pandas as pd
file_path = "Review.csv"
df = pd.read_csv(file_path, encoding='latin1')
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df)

                                                                           Review
0   The product arrived on time. Packaging was great, and the quality is amazing!
1                                        THIS PRODUCT IS JUST AMAZING! I LOVE IT.
2     I bought this phone for $799, and it has a 120Hz display. Totally worth it!
3                         Wow!!! This product is awesome... but a bit expensive??
4                                             The laptop works perfectly fine.   
5    Check out the full product details here: https://example.com/product-details
6         <div><h2>Great Purchase!</h2><p>I am happy with this product.</p></div>
7                The battry life is excelent, but the chargin cable is too short.
8                       I can't believe it's so good! Didn't expect such quality.
9                   Love this product! ???? Fast delivery ??, amazing quality! ??
10                       TBH, I wasnt expecting much, but OMG, this is awesome!!
11              

## Step 2: Perform Text Pre-Processing

### a. Convert text to lowercase

In [6]:
# Lowercase conversion
def convert_to_lowercase(text):
 return text.lower()
df["lowercased"] = df["Review"].apply(convert_to_lowercase)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["lowercased"])

0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5      check out the full product details here: https://example.com/product-details
6           <div><h2>great purchase!</h2><p>i am happy with this product.</p></div>
7                  the battry life is excelent, but the chargin cable is too short.
8                         i can't believe it's so good! didn't expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10                         tbh, i wasnt expecting much, but omg, this is awesome!!
11                            this is the best product i have ever used in m

### b. Remove URLs


In [7]:
# Removal of URLs
import re
# remove any URLs that start with "http" or "www" from the text
def remove_urls(text):
 return re.sub(r'http\S+|www\S+', '', text)
df["urls_removed"] = df["lowercased"].apply(remove_urls)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["urls_removed"])

0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5                                         check out the full product details here: 
6           <div><h2>great purchase!</h2><p>i am happy with this product.</p></div>
7                  the battry life is excelent, but the chargin cable is too short.
8                         i can't believe it's so good! didn't expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10                         tbh, i wasnt expecting much, but omg, this is awesome!!
11                            this is the best product i have ever used in m

### c. Remove HTML tags


In [8]:
# Removal of HTML tags
from bs4 import BeautifulSoup
# extracts only the text, removing all HTML tags
def remove_html_tags(text):
 return BeautifulSoup(text, "html.parser").get_text()
df["html_removed"] = df["urls_removed"].apply(remove_html_tags)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["html_removed"])

0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5                                         check out the full product details here: 
6                                      great purchase!i am happy with this product.
7                  the battry life is excelent, but the chargin cable is too short.
8                         i can't believe it's so good! didn't expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10                         tbh, i wasnt expecting much, but omg, this is awesome!!
11                            this is the best product i have ever used in m

### d. Remove emojis

In [9]:
# Removal of emojis (if any)
import emoji
# replace emoji with ''
def remove_emojis(text):
 return emoji.replace_emoji(text, replace='')
df["emojis_removed"] = df["html_removed"].apply(remove_emojis)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["emojis_removed"])


0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5                                         check out the full product details here: 
6                                      great purchase!i am happy with this product.
7                  the battry life is excelent, but the chargin cable is too short.
8                         i can't believe it's so good! didn't expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10                         tbh, i wasnt expecting much, but omg, this is awesome!!
11                            this is the best product i have ever used in m

### e. Replace internet slang/chat words


In [10]:
# Dictionary of slang words and their replacements
slang_dict = {
    "tbh": "to be honest",
    "omg": "oh my god",
    "lol": "laugh out loud",
    "idk": "I don't know",
    "brb": "be right back",
    "btw": "by the way",
    "imo": "in my opinion",
    "smh": "shaking my head",
    "fyi": "for your information",
    "np": "no problem",
    "ikr": "I know right",
    "asap": "as soon as possible",
    "bff": "best friend forever",
    "gg": "good game",
    "hmu": "hit me up",
    "rofl": "rolling on the floor laughing",
}

# Function to replace slang words
def replace_slang(text):
    # Handle non-string / missing values safely
    if pd.isna(text):
        return text
    text = str(text)

    # Create a list of escaped slang words
    escaped_slang_words = []
    for word in slang_dict.keys():
        escaped_word = re.escape(word)  # Ensure special characters are escaped
        escaped_slang_words.append(escaped_word)

    # Join the words using '|'
    slang_pattern = r"\b(" + "|".join(escaped_slang_words) + r")\b"

    # Define a replacement function
    def replace_match(match):
        slang_word = match.group(0).lower()
        return slang_dict.get(slang_word, match.group(0))

    # Use regex to replace slang words with full forms
    replaced_text = re.sub(slang_pattern, replace_match, text, flags=re.IGNORECASE)
    return replaced_text

# Apply the function to the column
df["slangs_replaced"] = df["emojis_removed"].apply(replace_slang)

# Display column content without truncation
pd.set_option("display.max_colwidth", None)
print(df["slangs_replaced"])

0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5                                         check out the full product details here: 
6                                      great purchase!i am happy with this product.
7                  the battry life is excelent, but the chargin cable is too short.
8                         i can't believe it's so good! didn't expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10          to be honest, i wasnt expecting much, but oh my god, this is awesome!!
11                            this is the best product i have ever used in m

### f. Replace contractions

In [11]:
# Replace Contractions
contractions_dict = {
    "wasn't": "was not",
    "isn't": "is not",
    "aren't": "are not",
    "weren't": "were not",
    "doesn't": "does not",
    "don't": "do not",
    "didn't": "did not",
    "can't": "cannot",
    "couldn't": "could not",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "won't": "will not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "i have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'd": "i would",
    "you'd": "you would",
    "he'd": "he would",
    "she'd": "she would",
    "we'd": "we would",
    "they'd": "they would",
    "i'll": "i will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "when's": "when is",
    "why's": "why is",
}

# Build the regex pattern for contractions
escaped_contractions = []
for contraction in contractions_dict.keys():
    # Escape special characters (e.g., apostrophes)
    escaped_contraction = re.escape(contraction)
    escaped_contractions.append(escaped_contraction)

joined_contractions = "|".join(escaped_contractions)
contractions_pattern = r"\b(" + joined_contractions + r")\b"
compiled_pattern = re.compile(contractions_pattern, flags=re.IGNORECASE)

# Define a function to replace contractions
def replace_contractions(text):
    if pd.isna(text):
        return text
    text = str(text)

    def replace_match(match):
        matched_word = match.group(0).lower()
        return contractions_dict.get(matched_word, match.group(0))

    return compiled_pattern.sub(replace_match, text)

# Apply the function to a DataFrame column
df["contractions_replaced"] = df["slangs_replaced"].apply(replace_contractions)

# Display column content without truncation
pd.set_option("display.max_colwidth", None)
print(df["contractions_replaced"])

0     the product arrived on time. packaging was great, and the quality is amazing!
1                                          this product is just amazing! i love it.
2       i bought this phone for $799, and it has a 120hz display. totally worth it!
3                           wow!!! this product is awesome... but a bit expensive??
4                                               the laptop works perfectly fine.   
5                                         check out the full product details here: 
6                                      great purchase!i am happy with this product.
7                  the battry life is excelent, but the chargin cable is too short.
8                      i cannot believe it is so good! did not expect such quality.
9                     love this product! ???? fast delivery ??, amazing quality! ??
10          to be honest, i wasnt expecting much, but oh my god, this is awesome!!
11                            this is the best product i have ever used in m

### g. Remove punctuations and special characters

In [12]:
# Remove punctuations and special characters
import string
# Function to remove punctuation
def remove_punctuation(text):
 return text.translate(str.maketrans('', '', string.punctuation))
# Apply the function to the column
df["punctuations_removed"] = df["contractions_replaced"].apply(remove_punctuation)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["punctuations_removed"])


0     the product arrived on time packaging was great and the quality is amazing
1                                         this product is just amazing i love it
2        i bought this phone for 799 and it has a 120hz display totally worth it
3                                wow this product is awesome but a bit expensive
4                                             the laptop works perfectly fine   
5                                       check out the full product details here 
6                                     great purchasei am happy with this product
7                 the battry life is excelent but the chargin cable is too short
8                     i cannot believe it is so good did not expect such quality
9                             love this product  fast delivery  amazing quality 
10            to be honest i wasnt expecting much but oh my god this is awesome
11                          this is the best product i have ever used in my life
12    the shoes were comfort

### h. Remove numbers

In [13]:
# Remove numbers
def remove_numbers(text):
 return re.sub(r'\d+', '', text) # Removes all numeric characters
# Apply the function to the column
df["numbers_removed"] = df["punctuations_removed"].apply(remove_numbers)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["numbers_removed"])

0     the product arrived on time packaging was great and the quality is amazing
1                                         this product is just amazing i love it
2              i bought this phone for  and it has a hz display totally worth it
3                                wow this product is awesome but a bit expensive
4                                             the laptop works perfectly fine   
5                                       check out the full product details here 
6                                     great purchasei am happy with this product
7                 the battry life is excelent but the chargin cable is too short
8                     i cannot believe it is so good did not expect such quality
9                             love this product  fast delivery  amazing quality 
10            to be honest i wasnt expecting much but oh my god this is awesome
11                          this is the best product i have ever used in my life
12    the shoes were comfort

### i. Correct spelling mistakes


In [14]:
# Correct spelling mistakes
from autocorrect import Speller
# Initialize spell checker
spell = Speller(lang='en')
# Function to correct spelling
def correct_spelling(text):
 return spell(text) # Apply correction
# Apply the function to the column
df["spelling_corrected"] = df["numbers_removed"].apply(correct_spelling)
# Display column content without truncation
pd.set_option('display.max_colwidth', None) # Set to None for unlimited width
print(df["spelling_corrected"])

0     the product arrived on time packaging was great and the quality is amazing
1                                         this product is just amazing i love it
2              i bought this phone for  and it has a hz display totally worth it
3                                wow this product is awesome but a bit expensive
4                                             the laptop works perfectly fine   
5                                       check out the full product details here 
6                                     great purchased am happy with this product
7              the battery life is excellent but the charging cable is too short
8                     i cannot believe it is so good did not expect such quality
9                             love this product  fast delivery  amazing quality 
10            to be honest i wasnt expecting much but oh my god this is awesome
11                          this is the best product i have ever used in my life
12    the shoes were comfort

### j. Remove stopwords


In [15]:
# Remove stopwords
import nltk
import pandas as pd
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download("stopwords")

# Define stopwords list
stop_words = set(stopwords.words("english"))

# Function to remove stopwords
def remove_stopwords(text):
    if pd.isna(text):
        return text
    text = str(text)

    words = text.split()  # Split text into words
    filtered_words = []   # Store words after stopword removal

    for word in words:  # Loop through each word
        lower_word = word.lower()  # Convert to lowercase for uniform comparison
        if lower_word not in stop_words:  # If not a stopword, keep it
            filtered_words.append(word)

    return " ".join(filtered_words)  # Join words back into a sentence

# Apply the function to the column
df["stopwords_removed"] = df["spelling_corrected"].apply(remove_stopwords)

# Display column content without truncation
pd.set_option("display.max_colwidth", None)
print(df["stopwords_removed"])

0          product arrived time packaging great quality amazing
1                                          product amazing love
2                         bought phone hz display totally worth
3                             wow product awesome bit expensive
4                                   laptop works perfectly fine
5                                    check full product details
6                                 great purchased happy product
7                   battery life excellent charging cable short
8                            cannot believe good expect quality
9                    love product fast delivery amazing quality
10                  honest wasnt expecting much oh god awesome
11                                  best product ever used life
12    shoes comfortable fitting nicely worked perfectly jogging
Name: stopwords_removed, dtype: object


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### k. Stemming - - reduces words to their base root by chopping off suffixes


In [16]:
# Stemming - reduces words to their base root by chopping off suffixes
import pandas as pd
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Function to apply stemming
def stem_text(text):
    if pd.isna(text):
        return ""
    if not isinstance(text, str):
        text = str(text)

    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]  # Apply stemming
    return " ".join(stemmed_words)

# Apply the function
df["stemmed_words"] = df["stopwords_removed"].apply(stem_text)

# Display column content without truncation
pd.set_option("display.max_colwidth", None)
print(df["stemmed_words"])

0     product arriv time packag great qualiti amaz
1                                product amaz love
2              bought phone hz display total worth
3                    wow product awesom bit expens
4                       laptop work perfectli fine
5                        check full product detail
6                      great purchas happi product
7              batteri life excel charg cabl short
8                cannot believ good expect qualiti
9          love product fast deliveri amaz qualiti
10         honest wasnt expect much oh god awesom
11                      best product ever use life
12        shoe comfort fit nice work perfectli jog
Name: stemmed_words, dtype: object


### l. Lemmatization - reduces words to their base dictionary form (lemma)

In [17]:
import nltk
import pandas as pd

# Download required resources (safe, common names)
nltk.download("wordnet")                      # Lemmatizer dictionary
nltk.download("omw-1.4")                      # WordNet data
nltk.download("averaged_perceptron_tagger")   # POS tagger
nltk.download("punkt")                        # Tokenizer

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()

# Map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(nltk_tag):
    if nltk_tag.startswith("J"):      # Adjective
        return wordnet.ADJ
    elif nltk_tag.startswith("V"):    # Verb
        return wordnet.VERB
    elif nltk_tag.startswith("N"):    # Noun
        return wordnet.NOUN
    elif nltk_tag.startswith("R"):    # Adverb
        return wordnet.ADV
    else:
        return wordnet.NOUN           # Default

def lemmatize_text(text):
    if pd.isna(text):
        return ""
    if not isinstance(text, str):
        text = str(text)

    words = word_tokenize(text)
    pos_tags = pos_tag(words)

    lemmatized_words = [
        lemmatizer.lemmatize(word, get_wordnet_pos(tag))
        for word, tag in pos_tags
    ]

    return " ".join(lemmatized_words)

df["lemmatized"] = df["stopwords_removed"].apply(lemmatize_text)

pd.set_option("display.max_colwidth", None)
print(df["lemmatized"])

0     product arrive time packaging great quality amazing
1                                      product amaze love
2                      buy phone hz display totally worth
3                       wow product awesome bit expensive
4                              laptop work perfectly fine
5                               check full product detail
6                            great purchase happy product
7               battery life excellent charge cable short
8                     can not believe good expect quality
9              love product fast delivery amazing quality
10               honest wasnt expect much oh god awesome
11                             best product ever use life
12         shoe comfortable fit nicely work perfectly jog
Name: lemmatized, dtype: object


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Step 3: Save the result to a file

In [18]:
df.to_csv("Processed_Reviews.csv",encoding='latin1', index=False) # Saves without the index column

In [19]:

import pandas as pd
import re
import emoji
import string
import nltk
from bs4 import BeautifulSoup
from autocorrect import Speller
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')  # For lemmatization
nltk.download('omw-1.4')  # WordNet lexical database
nltk.download('averaged_perceptron_tagger_eng')  # For POS tagging
nltk.download('punkt_tab')  # For tokenization

# Initialize tools
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Dictionary of slang words and their replacements
slang_dict = {
    "tbh": "to be honest",
    "omg": "oh my god",
    "lol": "laugh out loud",
    "idk": "I don't know",
    "brb": "be right back",
    "btw": "by the way",
    "imo": "in my opinion",
    "smh": "shaking my head",
    "fyi": "for your information",
    "np": "no problem",
    "ikr": "I know right",
    "asap": "as soon as possible",
    "bff": "best friend forever",
    "gg": "good game",
    "hmu": "hit me up",
    "rofl": "rolling on the floor laughing"
}

# Contractions dictionary
contractions_dict = {
    "wasn't": "was not",
    "isn't": "is not",
    "aren't": "are not",
    "weren't": "were not",
    "doesn't": "does not",
    "don't": "do not",
    "didn't": "did not",
    "can't": "cannot",
    "couldn't": "could not",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "won't": "will not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "i have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'd": "i would",
    "you'd": "you would",
    "he'd": "he would",
    "she'd": "she would",
    "we'd": "we would",
    "they'd": "they would",
    "i'll": "i will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "when's": "when is",
    "why's": "why is"
}

# Remove any URLs that start with "http" or "www" from the text
def remove_urls(text):
    return re.sub(r'http\S+|www\S+', '', text)

# extracts only the text, removing all HTML tags
def remove_html(text):
    return BeautifulSoup(text, "html.parser").get_text()

# replace emoji with ''
def remove_emojis(text):
    return emoji.replace_emoji(text, replace='')

# Replace internet slang/chat words
def replace_slang(text):
    escaped_slang_words = []
    for word in slang_dict.keys():
        escaped_word = re.escape(word)
        escaped_slang_words.append(escaped_word)

    slang_pattern = r'\b(' + '|'.join(escaped_slang_words) + r')\b'

    def replace_match(match):
        slang_word = match.group(0)
        return slang_dict[slang_word.lower()]

    replaced_text = re.sub(slang_pattern, replace_match, text, flags=re.IGNORECASE)
    return replaced_text

# Function to expand contractions
escaped_contractions = []
for contraction in contractions_dict.keys():
    escaped_contraction = re.escape(contraction)
    escaped_contractions.append(escaped_contraction)

joined_contractions = "|".join(escaped_contractions)
contractions_pattern = r'\b(' + joined_contractions + r')\b'
compiled_pattern = re.compile(contractions_pattern, flags=re.IGNORECASE)

def replace_contractions(text):
    def replace_match(match):
        matched_word = match.group(0)
        lower_matched_word = matched_word.lower()
        expanded_form = contractions_dict[lower_matched_word]
        return expanded_form

    expanded_text = compiled_pattern.sub(replace_match, text)
    return expanded_text

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Function to correct spelling using AutoCorrect
def correct_spelling(text):
    return spell(text)

# Function to remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Function to lemmatize text with POS tagging
def lemmatize_text(text):
    if not isinstance(text, str):
        return ""

    words = word_tokenize(text)
    pos_tags = pos_tag(words)

    lemmatized_words = [
        lemmatizer.lemmatize(word, get_wordnet_pos(tag))
        for word, tag in pos_tags
    ]

    return " ".join(lemmatized_words)

# Function to tokenize text
def tokenize_text(text):
    if not isinstance(text, str):
        return []
    return word_tokenize(text)

# Function to apply all preprocessing steps
def preprocess_text(text):
    text = text.lower()  # Step 1: Lowercasing
    text = remove_urls(text)  # Step 2: Remove URLs
    text = remove_html(text)  # Step 3: Remove HTML tags
    text = remove_emojis(text)  # Step 4: Remove Emojis
    text = replace_slang(text)  # Step 5: Replace Slang
    text = replace_contractions(text)  # Step 6: Expand Contractions
    text = remove_punctuation(text)  # Step 7: Remove Punctuation
    text = remove_numbers(text)  # Step 8: Remove Numbers
    text = correct_spelling(text)  # Step 9: Correct Spelling
    text = remove_stopwords(text)  # Step 10: Remove Stopwords
    text = lemmatize_text(text)  # Step 11: Lemmatization
    text = tokenize_text(text)  # Step 12: Tokenization
    return text

# Load dataset
df = pd.read_csv("Review.csv", encoding="latin1")  # Replace with your file

# Apply preprocessing pipeline
df["processed"] = df["Review"].apply(preprocess_text)

df.to_csv("Processed_Reviews2.csv", index=False)
# Display the first few rows
print(df[["Review", "processed"]].head())


                                                                          Review  \
0  The product arrived on time. Packaging was great, and the quality is amazing!   
1                                       THIS PRODUCT IS JUST AMAZING! I LOVE IT.   
2    I bought this phone for $799, and it has a 120Hz display. Totally worth it!   
3                        Wow!!! This product is awesome... but a bit expensive??   
4                                            The laptop works perfectly fine.      

                                                     processed  
0  [product, arrive, time, packaging, great, quality, amazing]  
1                                       [product, amaze, love]  
2                    [buy, phone, hz, display, totally, worth]  
3                      [wow, product, awesome, bit, expensive]  
4                              [laptop, work, perfectly, fine]  


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Exercise

In [20]:
import pandas as pd
import re
import emoji
import string
import nltk
from bs4 import BeautifulSoup
from autocorrect import Speller
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter

# ----------------------------
# 0) NLTK setup
# ----------------------------
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("punkt_tab")

spell = Speller(lang="en")
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# ----------------------------
# 1) Load dataset (robust encoding)
# ----------------------------
file_path = "UNITENReview.csv"   # <- your file name
encodings_to_try = ["utf-8", "utf-8-sig", "cp1252", "latin1"]

df = None
used_encoding = None
for enc in encodings_to_try:
    try:
        df = pd.read_csv(file_path, encoding=enc)
        used_encoding = enc
        break
    except UnicodeDecodeError:
        pass

if df is None:
    # last resort: ignore bad characters
    df = pd.read_csv(file_path, encoding="cp1252", encoding_errors="ignore")
    used_encoding = "cp1252 (ignored errors)"

print("Loaded with encoding:", used_encoding)

# ----------------------------
# 2) Identify issues in "Review" column
# ----------------------------
col = "Review"
if col not in df.columns:
    raise ValueError(f'Column "{col}" not found. Available columns: {list(df.columns)}')

review_series = df[col].astype(str)

def has_url(s): return bool(re.search(r"http\S+|www\S+", s))
def has_html(s): return bool(re.search(r"<[^>]+>", s))
def has_emoji(s): return any(ch in emoji.EMOJI_DATA for ch in s)
def has_numbers(s): return bool(re.search(r"\d+", s))
def has_punct(s): return any(ch in string.punctuation for ch in s)
def has_non_ascii(s): return any(ord(ch) > 127 for ch in s)
def has_extra_spaces(s): return bool(re.search(r"\s{2,}", s))

stats = {
    "Total rows": len(review_series),
    "Missing/empty (after str)": int((review_series.str.strip() == "").sum()),
    "Has URL": int(review_series.apply(has_url).sum()),
    "Has HTML tags": int(review_series.apply(has_html).sum()),
    "Has emoji": int(review_series.apply(has_emoji).sum()),
    "Has numbers": int(review_series.apply(has_numbers).sum()),
    "Has punctuation": int(review_series.apply(has_punct).sum()),
    "Has non-ASCII chars (smart quotes etc.)": int(review_series.apply(has_non_ascii).sum()),
    "Has multiple spaces": int(review_series.apply(has_extra_spaces).sum()),
}

print("\n=== Issues Found in Review Column ===")
for k, v in stats.items():
    print(f"- {k}: {v}")

# Show a few examples for each issue (so you can prove you identified them)
def show_examples(title, mask_func, n=3):
    examples = review_series[review_series.apply(mask_func)].head(n).tolist()
    if examples:
        print(f"\n{title} examples:")
        for i, ex in enumerate(examples, 1):
            print(f"{i}. {ex[:200]}{'...' if len(ex) > 200 else ''}")

show_examples("URL", has_url)
show_examples("HTML", has_html)
show_examples("Emoji", has_emoji)
show_examples("Non-ASCII", has_non_ascii)

# ----------------------------
# 3) Pre-processing pipeline (based on typical issues)
# ----------------------------
slang_dict = {
    "tbh": "to be honest",
    "omg": "oh my god",
    "lol": "laugh out loud",
    "idk": "i do not know",
    "brb": "be right back",
    "btw": "by the way",
    "imo": "in my opinion",
    "smh": "shaking my head",
    "fyi": "for your information",
    "np": "no problem",
    "ikr": "i know right",
    "asap": "as soon as possible",
    "bff": "best friend forever",
    "gg": "good game",
    "hmu": "hit me up",
    "rofl": "rolling on the floor laughing",
}

contractions_dict = {
    "wasn't": "was not",
    "isn't": "is not",
    "aren't": "are not",
    "weren't": "were not",
    "doesn't": "does not",
    "don't": "do not",
    "didn't": "did not",
    "can't": "cannot",
    "couldn't": "could not",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "won't": "will not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "i have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'd": "i would",
    "you'd": "you would",
    "he'd": "he would",
    "she'd": "she would",
    "we'd": "we would",
    "they'd": "they would",
    "i'll": "i will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "when's": "when is",
    "why's": "why is",
}

def remove_urls(text):
    return re.sub(r"http\S+|www\S+", "", text)

def remove_html(text):
    return BeautifulSoup(text, "html.parser").get_text()

def remove_emojis(text):
    return emoji.replace_emoji(text, replace="")

def replace_slang(text):
    if not isinstance(text, str):
        return ""
    escaped = [re.escape(w) for w in slang_dict.keys()]
    pattern = r"\b(" + "|".join(escaped) + r")\b"

    def repl(m):
        w = m.group(0).lower()
        return slang_dict.get(w, w)

    return re.sub(pattern, repl, text, flags=re.IGNORECASE)

# contraction regex compiled once
escaped_contractions = [re.escape(c) for c in contractions_dict.keys()]
contractions_pattern = r"\b(" + "|".join(escaped_contractions) + r")\b"
compiled_contractions = re.compile(contractions_pattern, flags=re.IGNORECASE)

def replace_contractions(text):
    if not isinstance(text, str):
        return ""

    def repl(m):
        w = m.group(0).lower()
        return contractions_dict.get(w, w)

    return compiled_contractions.sub(repl, text)

def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))

def remove_numbers(text):
    return re.sub(r"\d+", "", text)

def correct_spelling(text):
    # WARNING: slow on big datasets. Keep it because your exercise includes it.
    return spell(text)

def remove_stopwords(text):
    words = text.split()
    filtered = [w for w in words if w.lower() not in stop_words]
    return " ".join(filtered)

def get_wordnet_pos(nltk_tag):
    if nltk_tag.startswith("J"):
        return wordnet.ADJ
    elif nltk_tag.startswith("V"):
        return wordnet.VERB
    elif nltk_tag.startswith("N"):
        return wordnet.NOUN
    elif nltk_tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize_text(text):
    if not isinstance(text, str):
        return ""
    words = word_tokenize(text)
    tags = pos_tag(words)
    lemmas = [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w, t in tags]
    return " ".join(lemmas)

def tokenize_text(text):
    if not isinstance(text, str):
        return []
    return word_tokenize(text)

def preprocess_text(text):
    if not isinstance(text, str):
        return []

    text = text.lower()
    text = remove_urls(text)
    text = remove_html(text)
    text = remove_emojis(text)
    text = replace_slang(text)
    text = replace_contractions(text)
    text = remove_punctuation(text)
    text = remove_numbers(text)
    text = correct_spelling(text)
    text = remove_stopwords(text)
    text = lemmatize_text(text)
    tokens = tokenize_text(text)
    return tokens

# Apply preprocessing
df["processed"] = df[col].apply(preprocess_text)

# Optional: store as string instead of list (CSV-friendly)
df["processed_text"] = df["processed"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")

# ----------------------------
# 4) Save result
# ----------------------------
output_path = "UNITENReview_processed.csv"
df.to_csv(output_path, index=False, encoding="utf-8-sig")
print("\nSaved:", output_path)
print("Preview:")
print(df[[col, "processed_text"]].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\IM11\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Loaded with encoding: utf-8

=== Issues Found in Review Column ===
- Total rows: 52
- Missing/empty (after str): 0
- Has URL: 0
- Has HTML tags: 0
- Has emoji: 1
- Has numbers: 4
- Has punctuation: 42
- Has non-ASCII chars (smart quotes etc.): 8
- Has multiple spaces: 4

Emoji examples:
1. It’s a great place to study but for me the are lot of things that need to improve?? wifi connection is the priority since there a lot of bad connections, the things that i have the issue since first s...

Non-ASCII examples:
1. I’m having a pretty good time here, happy to meet all of the W people.
2. UNITEN is a solid choice for students interested in engineering, IT, or energy-related fields, offering strong industry connections and modern facilities. However, its specialized focus and location m...
3. As a UNITEN student, I’ve had a great experience so far. Lecturers are knowledgeable and approachable, always willing to help when needed. The campus facilities are well-maintained, including the labs

  return BeautifulSoup(text, "html.parser").get_text()



Saved: UNITENReview_processed.csv
Preview:
                                                                                                                                                                                                                                                                                                                                                         Review  \
0                                                                                                                                                                                                                                                                                                          Im happy with uniten actually, even the people are W   
1                                                                                                                                                                                                                                                     