### Data Cleaning

The cleaning phase aims to remove unnecessary elements or those that do not help the model extract meaning, resulting in cleaner and more consistent data. This step reduces noise in the text, making the NLP process more effective.


---


- Lowercasing:


In [2]:
uppercase_string = "ThIs iS LowErCasE!"
lowercase_string = uppercase_string.lower()
print(lowercase_string)

this is lowercase!


---


- Removing Special Characters and Numbers:


In [3]:
import re

# for special characters
special_text = "special, characters! 123456"
non_special_text = re.sub(r"[^A-Za-z0-9\s]", "", special_text)
print(non_special_text)

# for numbers
number_text = "special, characters! 123456"
non_number_text = re.sub(r"[^A-Za-z\s]", "", special_text)
print(non_number_text)

special characters 123456
special characters 


---


- Removing Stop Words:


In [4]:
stop_words = [
    "and",
    "the",
    "is",
    "in",
    "to",
    "of",
    "it",
    "that",
    "a",
    "an",
    "this",
    "from",
]


stopword_text = "This is an example of removing stop words from a text document."


filtered_words = [
    word for word in stopword_text.split() if word.lower() not in stop_words
]


print(filtered_words)

['example', 'removing', 'stop', 'words', 'text', 'document.']


In [5]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

# stop word list
stop_words = set(stopwords.words("english"))

stopword_text = "This is an example of removing stop words from a text document."
filtered_words = [
    word for word in stopword_text.split() if word.lower() not in stop_words
]
print(filtered_words)

['example', 'removing', 'stop', 'words', 'text', 'document.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
stop_words = ["ve", "bu", "ile", "için", "ama"]

stopword_text = "Bu bir örnek ve örneği denemek için çalıştırıldı."
filtered_words = [
    word for word in stopword_text.split() if word.lower() not in stop_words
]
print(filtered_words)

['bir', 'örnek', 'örneği', 'denemek', 'çalıştırıldı.']


In [7]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

# stop word list
stop_words = set(stopwords.words("turkish"))

stopword_text = "Bu bir örnek ve örneği denemek için çalıştırıldı."
filtered_words = [
    word for word in stopword_text.split() if word.lower() not in stop_words
]
print(filtered_words)

['bir', 'örnek', 'örneği', 'denemek', 'çalıştırıldı.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


---


- Whitespace Removal:


In [8]:
whitespaced_text = "    Whitespace    is       bad       "
non_whitespaced_text = " ".join(whitespaced_text.split())
print(non_whitespaced_text)

Whitespace is bad


---


- Punctuation Removal:


In [9]:
import string

punctuated_text = ".punctuated, text!!!!."
non_punctuated_text = punctuated_text.translate(
    str.maketrans("", "", string.punctuation)
)
print(non_punctuated_text)

punctuated text


---


- Spelling Correction:


In [10]:
from textblob import TextBlob

non_correct_text = "corrrest text"
correct_text = str(TextBlob(non_correct_text).correct())
print(correct_text)

correct text


---


- Removing HTML Tags and URLs:


In [11]:
from bs4 import BeautifulSoup

html_url_text = "<a href='www.nlptest.com'>Go to link</a>"
non_html_url_text = BeautifulSoup(html_url_text, "html.parser").get_text()
print(non_html_url_text)

Go to link


---


- All in one:


In [12]:
import re
import string

import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from textblob import TextBlob

nltk.download("stopwords")

# Stop word list
stop_words = set(stopwords.words("english"))

text = "<p>ThIs iS LowErCasE! special, characters! 123456 This is an example of removing stop words from a text document.    Whitespace    is       bad       .punctuated, text!!!!. corrrest text </p><a href='www.nlptest.com'>Go to link</a>"
print(f"{'Before steps':<40}: {text}")

# Removing HTML Tags and URLs
text = BeautifulSoup(text, "html.parser").get_text()
print(f"\n{'Removing HTML Tags and URLs':<40}: {text}")

# Lowercasing
text = text.lower()
print(f"{'Lowercasing':<40}: {text}")

# Punctuation Removal
text = text.translate(str.maketrans("", "", string.punctuation))
print(f"{'Punctuation Removal':<40}: {text}")

# Removing Special Characters and Numbers
text = re.sub(r"[^A-Za-z\s]", "", text)
print(f"{'Removing Special Characters and Numbers':<40}: {text}")

# Whitespace Removal
text = " ".join(text.split())
print(f"{'Whitespace Removal':<40}: {text}")

# Removing Stop Words
text = " ".join([word for word in text.split() if word not in stop_words])
print(f"{'Removing Stop Words':<40}: {text}")

# Spelling Correction
text = str(TextBlob(text).correct())
print(f"{'Spelling Correction':<40}: {text}")

print(f"\n{'After steps':<40}: {text}")

Before steps                            : <p>ThIs iS LowErCasE! special, characters! 123456 This is an example of removing stop words from a text document.    Whitespace    is       bad       .punctuated, text!!!!. corrrest text </p><a href='www.nlptest.com'>Go to link</a>

Removing HTML Tags and URLs             : ThIs iS LowErCasE! special, characters! 123456 This is an example of removing stop words from a text document.    Whitespace    is       bad       .punctuated, text!!!!. corrrest text Go to link
Lowercasing                             : this is lowercase! special, characters! 123456 this is an example of removing stop words from a text document.    whitespace    is       bad       .punctuated, text!!!!. corrrest text go to link
Punctuation Removal                     : this is lowercase special characters 123456 this is an example of removing stop words from a text document    whitespace    is       bad       punctuated text corrrest text go to link
Removing Special Characte

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Spelling Correction                     : lowercase special characters example removing stop words text document whitespace bad punctured text correct text go link

After steps                             : lowercase special characters example removing stop words text document whitespace bad punctured text correct text go link
