[English Stopwords](https://gist.github.com/sebleier/554280) <br>
[Arabic Stopwords](https://github.com/mohataher/arabic-stop-words/blob/master/list.txt)

<h1 align=center>All Text Preprocssing Steps</h1>

# Dataset labeled datasset collected from twitter

- Objective classify tweets containing hate speech from other tweets.
    - 0 -> no hate speech
    - 1 -> contains hate speech


# Import libraries

In [24]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK data (only once)

In [25]:
nltk.download('punkt') # or nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\nlp_offline\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Initialize Data

In [26]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Text Preprocessing Functions

# Step 1: Remove mentions, hashtags, URLs, and numbers

In [27]:
def remove_noise(text):
    text = re.sub(r'@\w+', '', text)                   # remove mentions (@user)
    text = re.sub(r'#\w+', '', text)                   # remove hashtags (#word)
    text = re.sub(r'http\S+|www\S+', '', text)         # remove links
    text = re.sub(r'\d+', '', text)                    # remove numbers
    text = re.sub(r'[^\x00-\x7F\u0600-\u06FF]', '', text)  # remove emojis & special chars
    return text

# Step 2: Remove emojis and non-ASCII characters

In [28]:
def remove_emojis(text):
    return re.sub(r'[^\x00-\x7F]', '', text)

# Step 3: Lowercasing

In [29]:
def to_lowercase(text):
    return text.lower()

# Step 4: Remove punctuation

In [30]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Step 5: Tokenization

In [31]:
def tokenize_text(text):
    return word_tokenize(text)

# Step 6: Remove stopwords

In [32]:
def remove_stopwords(tokens):
    #return [t for t in tokens if t not in stop_words and len(t) > 1]
    filtered_tokens = []
    for t in tokens:
        if t not in stop_words and len(t) > 1:
            filtered_tokens.append(t)
    return filtered_tokens

# Step 7: Stemming

In [33]:
def apply_stemming(tokens):
    #return [stemmer.stem(t) for t in tokens]
    stemmed_tokens = []
    for t in tokens:
        stemmed_word = stemmer.stem(t)
        stemmed_tokens.append(stemmed_word)
    return stemmed_tokens

# Step 8: Lemmatization

In [34]:
def apply_lemmatization(tokens):
    #return [lemmatizer.lemmatize(t) for t in tokens]
    lemmatized_tokens = []
    for t in tokens:
        lemma = lemmatizer.lemmatize(t)
        lemmatized_tokens.append(lemma)
    return lemmatized_tokens

# Main Preprocessing Function (pipeline)

In [35]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ""

    text = remove_noise(text)
    text = remove_emojis(text)
    text = to_lowercase(text)
    text = remove_punctuation(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = apply_stemming(tokens)
    tokens = apply_lemmatization(tokens)
    
    return " ".join(tokens)

# Step 1: Load Dataset

In [36]:
df = pd.read_csv('dataset.csv') # create a DataFrame 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [37]:
df.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [38]:
# check NaNs
df.isnull().count()

id       31962
label    31962
tweet    31962
dtype: int64

In [39]:
# check duplicates
df['tweet'].duplicated().sum()

2432

In [40]:
#drop Duplicate
df.drop_duplicates(subset=['tweet'], inplace=True)
df.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [41]:
# how samples of data texts to find out required preprocessing steps
#df.head(10)
print(df['tweet'].sample(10))

21605    gonna spend time with the kids an mom.w  well ...
16637    #dailyaffirmation i do not #exist to #impress ...
17389     @user @user i'm stepping down as a @user boar...
22859    sorria ðâº  #boooomdia #quaafeira #atualiz...
14726    siri reveals what apple will announce at wwdc,...
25173    i would never in my life allow my future kids ...
20002     @user leaving day at ih belfast!    #ihbelfas...
27890             dad #fucks #small #teen #babysitter mp4 
3976        finally cleared my storage!! #glad #finally   
10614    angry owls #owls #aworks #abstracta #colorful ...
Name: tweet, dtype: object


# Step 2: Apply preprocessing on 'tweet' column

In [42]:
df['cleaned_tweet'] = df['tweet'].apply(preprocess_text)

# Step 3: Save cleaned data

In [43]:
df.to_csv("cleaned_dataset.csv", index=False, encoding='utf-8-sig')
print(" Cleaned dataset saved as 'cleaned_dataset.csv'")

 Cleaned dataset saved as 'cleaned_dataset.csv'


# Step 4: Show sample results

In [44]:
print("\n Sample before and after preprocessing:\n")
for i in range(10):
    print("Original:", df['tweet'][i])
    print("Cleaned :", df['cleaned_tweet'][i])
    print("-" * 80)


 Sample before and after preprocessing:

Original:  @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run
Cleaned : father dysfunct selfish drag kid dysfunct
--------------------------------------------------------------------------------
Original: @user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked
Cleaned : thank credit cant use caus dont offer wheelchair van pdx
--------------------------------------------------------------------------------
Original:   bihday your majesty
Cleaned : bihday majesti
--------------------------------------------------------------------------------
Original: #model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦  
Cleaned : love take time ur
--------------------------------------------------------------------------------
Original:  factsguide: society now    #motivation
Cleaned : factsguid societi
---------

<h1 align=center>Exercise: Text Preprocessing for Product Reviews Analysis</h1>

# Objective:
- The goal of this task is to build a complete text preprocessing pipeline for cleaning and preparing customer product reviews before using them in sentiment analysis or other NLP models.

- You will apply a series of text-cleaning and normalization techniques to ensure that the raw text data becomes structured, uniform, and ready for further processing or model training.

# Background:
- In Natural Language Processing (NLP), raw text often contains noise, such as punctuation, emojis, URLs, numbers, or common stopwords that do not contribute to the semantic meaning.

# Exercise Steps
- In this project, you will build a complete code to clean product review data from a CSV file.
- Search for product_reviews.csv OR CREATE it:
    - The dataset contains customer reviews of various products, along with their corresponding ratings.
        - **Column Name	Description**
            - **review_id:** Unique identifier for each review
            - **rating:** Numerical rating from 1 to 5
            - **review_text:** The text of the customer review (raw data)
        - Use pandas to read the dataset from product_reviews.csv.
        - Verify that the column names are correct and that there are no missing values in review_text.
- Apply all stages of text processing to it (cleaning, splitting, standardization, stopword removal, stemming, and lemmatization).
- Then save the cleaned data in a new file for use in sentiment analysis or any other NLP task.