## Basic code on how we can perform the text cleaning.

### Lowercasing
Converting all text to lowercase to ensure uniformity in the text data, making it easier to compare and process.

In [2]:
text = "I Am a Machine Learning Engineer."
text = text.lower()
text

'i am a machine learning engineer.'

### Tokenization

Splitting the text into individual words or tokens. This step is crucial for many NLP tasks.

In [6]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "i am a machine learning engineer."
tokens = word_tokenize(text)
tokens

[nltk_data] Downloading package punkt to /home/sujan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['i', 'am', 'a', 'machine', 'learning', 'engineer', '.']

### Removing special characters

Eliminating punctuation marks, symbols, or other non-alphanumeric characters that do not provide meaningful information.

In [9]:
import re

text = "Sujan@#$*"
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
text

'Sujan'

### Stopword removal

Removing common words (e.g., "the," "and," "in") that occur frequently in the text but typically do not carry much meaning.

In [11]:
from nltk.corpus import stopwords

text = "i am a machine learning engineer"
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
filtered_words

['machine', 'learning', 'engineer']

### Stemming or Lemmatization

 Reducing words to their base or root form. Stemming simplifies words to their core, while lemmatization ensures that words are transformed to their dictionary form.

In [13]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
stemmed_word


'run'

### Removing HTML tags and formatting

If the text data comes from web sources, it may contain HTML tags, which need to be removed.

In [17]:
from bs4 import BeautifulSoup

text = "<p>i am a machine <b>learning</b> engineer.</p>"
soup = BeautifulSoup(text, 'html.parser')
cleaned_text = soup.get_text()
cleaned_text

'i am a machine learning engineer.'

### Handling missing data

Dealing with missing values or placeholders, which may occur in some datasets.

In [19]:
text = "Data is missing: NA"
text = text.replace("NA", "")
text

'Data is missing: '

### Normalization

Ensuring consistency in representations of dates, numbers, and other structured information.

In [22]:
from datetime import datetime

date_str = "07/01/2023"
date_obj = datetime.strptime(date_str, "%m/%d/%Y")
normalized_date = date_obj.strftime("%Y-%m-%d")
normalized_date

'2023-07-01'

### Spell checking and correction

Identifying and correcting misspelled words.

In [26]:
# there is a package caled pyspellchecker with this we can check and correct the spelling 

from spellchecker import SpellChecker

text = 'i am a maxhine eearning emgineer'

spell = SpellChecker()
words = text.split()
corrected_words = [spell.correction(word) for word in words]
corrected_text = ' '.join(corrected_words)

print(corrected_text)


i am a machine learning engineer


### Removing duplicates

Eliminating duplicate or near-identical text entries that can skew analysis results.

In [25]:
sentences = ["i am a machine learning engineer.", "i also did the front-end web development.", "i am a machine learning engineer."]
unique_sentences = list(set(sentences))
unique_sentences

['i am a machine learning engineer.',
 'i also did the front-end web development.']

### Text-specific cleaning

Addressing domain-specific issues or text artifacts that are unique to the dataset or task at hand.



### So these above are the basic tasks for text cleaning process.