<h1 align=center>Text Preprocssing</h1>

There are different preprocessing steps depending on:
  - **Language**
      - English Normalization: 
                                lowercasing + stemming
      - Arabich Normalization:
                                 أآإؤ -> ا + remove diacritics + remove elongation + stemming
  - **Problem Itself**
      - Semantic Classification 
                                ✅ remove stopwords
      - Translation:
                                ❌ remove stopwords

## **Some Preprocessing Steps**


In [61]:
import nltk

### **English**
#### text lowercasing

In [62]:
text = 'Hello, Ahmed'
preprocessed_text = text.lower()

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: Hello, Ahmed
After : hello, ahmed


#### removing newlines and tabs

In [63]:
text = 'Hello, Ahmed\nthis is Wssam\n\tI need your help.'
preprocessed_text = text.replace('\n', '')

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: Hello, Ahmed
this is Wssam
	I need your help.
After : Hello, Ahmedthis is Wssam	I need your help.


In [64]:
text = 'Hello, Ahmed\n this is Wssam\n\tI need your help.'
preprocessed_text = text.replace('\t', '')

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: Hello, Ahmed
 this is Wssam
	I need your help.
After : Hello, Ahmed
 this is Wssam
I need your help.


In [65]:
text = 'Hello, Ahmed\n this is Wssam\n\tI need your help.'
preprocessed_text = text.replace('\n', '').replace('\t', '')

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: Hello, Ahmed
 this is Wssam
	I need your help.
After : Hello, Ahmed this is WssamI need your help.


#### removing urls

In [66]:
import re

In [67]:
text = 'I recommend using regex101 website. visit it through: https://regex101.com/'
preprocessed_text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: I recommend using regex101 website. visit it through: https://regex101.com/
After : I recommend using regex101 website. visit it through: 


#### removing punctuations

In [68]:
import string

In [69]:
text = "I'm happy, so I'll sleep early."
preprocessed_text = text.translate(str.maketrans('', '', string.punctuation))

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: I'm happy, so I'll sleep early.
After : Im happy so Ill sleep early


#### removing contractions

In [70]:
! pip install contractions

Defaulting to user installation because normal site-packages is not writeable


In [71]:
import contractions

In [72]:
text = "I'm happy, so I'll sleep early."
preprocessed_text = contractions.fix(text)

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: I'm happy, so I'll sleep early.
After : I am happy, so I will sleep early.


> **The sequence of steps is VEEEEERY Important**

In [73]:
text = "I'm happy, so I'll sleep early."

In [74]:
# remove contractions then remove punctuations
preprocessed_text = contractions.fix(text)
preprocessed_text = preprocessed_text.translate(str.maketrans('', '', string.punctuation))

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: I'm happy, so I'll sleep early.
After : I am happy so I will sleep early


In [75]:
# remove punctuations then remove contractions
preprocessed_text = text.translate(str.maketrans('', '', string.punctuation))
preprocessed_text = contractions.fix(preprocessed_text)

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: I'm happy, so I'll sleep early.
After : I Am happy so Ill sleep early


### **Arabic**
#### remove diacritics

In [76]:
import re

In [77]:
arabic_diacritics = re.compile("""
                             ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                            ـ    | #Tatwil/Kashida
                         """, re.VERBOSE) # re.VERBOSE --> add comments in string without compile it

In [79]:
text = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ'
preprocessed_text = re.sub(arabic_diacritics, '', text)

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
After : الحمد لله رب العالمين


#### characters normalization

In [80]:
import re

In [81]:
text = 'أنا الذي نظر الأعمى إلى أدبي'

preprocessed_text = re.sub("[إأآا]", "ا", text)
preprocessed_text = re.sub("ى", "ي", preprocessed_text)
preprocessed_text = re.sub("ؤ", "ء", preprocessed_text)
preprocessed_text = re.sub("ئ", "ء", preprocessed_text)
preprocessed_text = re.sub("ة", "ه", preprocessed_text)
preprocessed_text = re.sub("گ", "ك", preprocessed_text)
preprocessed_text = re.sub("ڤ", "ف", preprocessed_text)
preprocessed_text = re.sub("چ", "ج", preprocessed_text)
preprocessed_text = re.sub("ژ", "ز", preprocessed_text)
preprocessed_text = re.sub("پ", "ب", preprocessed_text)

print(f'Before: {text}')
print(f'After : {preprocessed_text}')

Before: أنا الذي نظر الأعمى إلى أدبي
After : انا الذي نظر الاعمي الي ادبي


### **More Steps:** Google it
  - remove stopwords ((( Watch out **the steps** )))
  - remove usernames & tags
  - remove emojis
  - remove numbers
  - remove text elongation 
  - remove extra whitespaces


In [82]:
# stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [83]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [84]:
print(stopwords.words('arabic'))

['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر', 'ألا', 'إلا', 'التي', 'الذي', 'الذين', 'اللاتي', 'اللائي', 'اللتان', 'اللتيا', 'اللتين', 'اللذان', 'اللذين', 'اللواتي', 'إلى', 'إليك', 'إليكم', 'إليكما', 'إليكن', 'أم', 'أما', 'أما', 'إما', 'أن', 'إن', 'إنا', 'أنا', 'أنت', 'أنتم', 'أنتما', 'أنتن', 'إنما', 'إنه', 'أنى', 'أنى', 'آه', 'آها', 'أو', 'أولاء', 'أولئك', 'أوه', 'آي', 'أي', 'أيها', 'إي', 'أين', 'أين', 'أينما', 'إيه', 'بخ', 'بس', 'بعد', 'بعض', 'بك', 'بكم', 'بكم', 'بكما', 'بكن', 'بل', 'بلى', 'بما', 'بماذا', 'بمن', 'بنا', 'به', 'بها', 'بهم', 'بهما', 'بهن', 'بي', 'بين', 'بيد', 'تلك', 'تلكم', 'تلكما', 'ته', 'تي', 'تين', 'تينك', 'ثم', 'ثمة', 'حاشا', 'حبذا', 'حتى', 'حيث', 'حيثما', 'حين', 'خلا', 'دون', 'ذا', 'ذات', 'ذاك', 'ذان', 'ذانك', 'ذلك', 'ذلكم', 'ذلكما', 'ذلكن', 'ذه', 'ذو', 'ذوا', 'ذواتا', 'ذواتي', 'ذي', 'ذين', 'ذينك', 'ريث', 'سوف', 'سوى', 'شتان', 'عدا', 'عسى', 'عل', 'على', 'عليك', 'عليه', 'عما', 'عن', 'عند', 'غير', 'فإذا', 'فإن', 'فلا', 'فمن', 'في', 'فيم', 'فيما', 'فيه', 'فيها', '

[English Stopwords](https://gist.github.com/sebleier/554280) <br>
[Arabic Stopwords](https://github.com/mohataher/arabic-stop-words/blob/master/list.txt)

<h1 align=center>All Text Preprocssing Steps</h1>

# Dataset labeled datasset collected from twitter

- Objective classify tweets containing hate speech from other tweets.
    - 0 -> no hate speech
    - 1 -> contains hate speech


# Import libraries

In [2]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK data (only once)

In [8]:
nltk.download('punkt') # or nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Initialize Data

In [None]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Text Preprocessing Functions

# Step 1: Remove mentions, hashtags, URLs, and numbers

In [108]:
def remove_noise(text):
    text = re.sub(r'@\w+', '', text)                   # remove mentions (@user)
    text = re.sub(r'#\w+', '', text)                   # remove hashtags (#word)
    text = re.sub(r'http\S+|www\S+', '', text)         # remove links
    text = re.sub(r'\d+', '', text)                    # remove numbers
    text = re.sub(r'[^\x00-\x7F\u0600-\u06FF]', '', text)  # remove emojis & special chars
    return text

# Step 2: Remove emojis and non-ASCII characters

In [109]:
def remove_emojis(text):
    return re.sub(r'[^\x00-\x7F]', '', text)

# Step 3: Lowercasing

In [110]:
def to_lowercase(text):
    return text.lower()

# Step 4: Remove punctuation

In [111]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Step 5: Tokenization

In [112]:
def tokenize_text(text):
    return word_tokenize(text)

# Step 6: Remove stopwords

In [113]:
def remove_stopwords(tokens):
    #return [t for t in tokens if t not in stop_words and len(t) > 1]
    filtered_tokens = []
    for t in tokens:
        if t not in stop_words and len(t) > 1:
            filtered_tokens.append(t)
    return filtered_tokens

# Step 7: Stemming

In [114]:
def apply_stemming(tokens):
    #return [stemmer.stem(t) for t in tokens]
    stemmed_tokens = []
    for t in tokens:
        stemmed_word = stemmer.stem(t)
        stemmed_tokens.append(stemmed_word)
    return stemmed_tokens

# Step 8: Lemmatization

In [115]:
def apply_lemmatization(tokens):
    #return [lemmatizer.lemmatize(t) for t in tokens]
    lemmatized_tokens = []
    for t in tokens:
        lemma = lemmatizer.lemmatize(t)
        lemmatized_tokens.append(lemma)
    return lemmatized_tokens

# Main Preprocessing Function (pipeline)

In [116]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ""

    text = remove_noise(text)
    text = remove_emojis(text)
    text = to_lowercase(text)
    text = remove_punctuation(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = apply_stemming(tokens)
    tokens = apply_lemmatization(tokens)
    
    return " ".join(tokens)

# Step 1: Load Dataset

In [3]:
df = pd.read_csv('dataset.csv') # create a DataFrame 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [117]:
df.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [4]:
# check NaNs
df.isnull().count()

id       31962
label    31962
tweet    31962
dtype: int64

In [6]:
# check duplicates
df['tweet'].duplicated().sum()

2432

In [7]:
#drop Duplicate
df.drop_duplicates(subset=['tweet'], inplace=True)
df.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [92]:
# how samples of data texts to find out required preprocessing steps
#df.head(10)
print(df['tweet'].sample(10))

17488    i may call her my wife for 3 years allready!  ...
18250      gave my dad his #fathersdaygift early #lovei...
17626     factsguide: the magic realism by rob gonsalve...
9950     wow    just passed 70,000 views #grateful   #t...
11919                   i have a bone to pick with y'all  
26444    r.i.p literally, you will be missed and the ma...
23823    omg finally after 2 months of waiting im watch...
26915    scream 2Ã04 promo âhappy bÃ­hday to meâ (...
27764     creating something that brings happiness to o...
11851    rip anton yelchin :( such a great young actor....
Name: tweet, dtype: object


# Step 2: Apply preprocessing on 'tweet' column

In [118]:
df['cleaned_tweet'] = df['tweet'].apply(preprocess_text)

# Step 3: Save cleaned data

In [119]:
df.to_csv("cleaned_dataset.csv", index=False, encoding='utf-8-sig')
print(" Cleaned dataset saved as 'cleaned_dataset.csv'")

 Cleaned dataset saved as 'cleaned_dataset.csv'


# Step 4: Show sample results

In [120]:
print("\n Sample before and after preprocessing:\n")
for i in range(10):
    print("Original:", df['tweet'][i])
    print("Cleaned :", df['cleaned_tweet'][i])
    print("-" * 80)


 Sample before and after preprocessing:

Original:  @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Cleaned : father dysfunct selfish drag kid dysfunct
--------------------------------------------------------------------------------
Original: @user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked
Cleaned : thank credit cant use caus dont offer wheelchair van pdx
--------------------------------------------------------------------------------
Original:   bihday your majesty
Cleaned : bihday majesti
--------------------------------------------------------------------------------
Original: #model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦  
Cleaned : love take time ur
--------------------------------------------------------------------------------
Original:  factsguide: society now    #motivation
Cleaned : factsguid societi
---------

<h1 align=center>Exercise: Text Preprocessing for Product Reviews Analysis</h1>

# Objective:
- The goal of this task is to build a complete text preprocessing pipeline for cleaning and preparing customer product reviews before using them in sentiment analysis or other NLP models.

- You will apply a series of text-cleaning and normalization techniques to ensure that the raw text data becomes structured, uniform, and ready for further processing or model training.

# Background:
- In Natural Language Processing (NLP), raw text often contains noise, such as punctuation, emojis, URLs, numbers, or common stopwords that do not contribute to the semantic meaning.

# Exercise Steps
- In this project, you will build a complete code to clean product review data from a CSV file.
- Search for product_reviews.csv OR CREATE it:
    - The dataset contains customer reviews of various products, along with their corresponding ratings.
        - **Column Name	Description**
            - **review_id:** Unique identifier for each review
            - **rating:** Numerical rating from 1 to 5
            - **review_text:** The text of the customer review (raw data)
        - Use pandas to read the dataset from product_reviews.csv.
        - Verify that the column names are correct and that there are no missing values in review_text.
- Apply all stages of text processing to it (cleaning, splitting, standardization, stopword removal, stemming, and lemmatization).
- Then save the cleaned data in a new file for use in sentiment analysis or any other NLP task.