#Text Preprocessing
##Text cleaning: removing noise, special characters, and stop words

---


##Introduction to regular expressions for text preprocessing

---


##Hands-on exercise: Preprocessing text data using Python libraries (e.g., NLTK or spaCy)

In this notebook, we first import the necessary libraries such as nltk and re for text preprocessing. We then download the stopwords corpus from NLTK using nltk.download('stopwords').

Next, we define the clean_text function that performs text cleaning. It converts the text to lowercase, removes special characters using regular expressions, and removes stopwords using NLTK's stopwords corpus.

*italicized text*
We provide a sample text data and apply the clean_text function to obtain the cleaned text. Finally, we print both the original and cleaned text.

In [1]:
#Preprocessing text data using Python libraries

## Importing necessary libraries
import nltk
from nltk.corpus import stopwords
import re


In [None]:
## Downloading stopwords from NLTK
nltk.download('stopwords')

In [3]:
## Sample text data
text_data = "This is an example sentence! It contains special characters like @#$% and stopwords such as 'the' and 'is'."

In [7]:
## Text cleaning: Removing noise, special characters, and stop words
def clean_text1(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters using regular expressions
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    print(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token not in stop_words]
    text = ' '.join(filtered_tokens)

    return text

In [None]:
## Cleaning the text data
cleaned_text = clean_text1(text_data)


In [6]:
## Printing the cleaned text
print("Original Text:", text_data)
print("Cleaned Text:", cleaned_text)

Original Text: This is an example sentence! It contains special characters like @#$% and stopwords such as 'the' and 'is'.
Cleaned Text: example sentence contains special characters like stopwords


spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

https://spacy.io/

In [None]:
# Preprocessing text data using Python libraries

## Importing necessary libraries
import spacy


In [None]:
## Loading the spaCy English model
nlp = spacy.load('en_core_web_sm')

In [None]:
## Sample text data
text_data = "This is an example sentence! It contains special characters like @#$% and stopwords such as 'the' and 'is'."

In [None]:
## Text cleaning: Removing noise, special characters, and stop words
def clean_text2(text):
    # Convert text to spaCy Doc
    doc = nlp(text)

    # Remove special characters using regular expressions, stopwords, and whitespace
    cleaned_tokens = [re.sub(r'[^a-zA-Z0-9\s]', '', token.text) for token in doc if not token.is_stop and not token.is_punct and not token.is_space]

    # Join the cleaned tokens back into a single string
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

In [None]:
## Cleaning the text data
cleaned_text = clean_text2(text_data)

In [None]:
## Printing the cleaned text
print("Original Text:", text_data)
print("Cleaned Text:", cleaned_text)

Original Text: This is an example sentence! It contains special characters like @#$% and stopwords such as 'the' and 'is'.
Cleaned Text: example sentence contains special characters like  stopwords


#Regular expressions (regex) are powerful tools for text pre-processing and cleaning tasks. Here are some examples of how regular expressions can be used:

Removing punctuation: Regular expressions can be used to remove punctuation marks from text. For example, you can use the pattern \p{P} to match any punctuation character and replace it with an empty string.

---



Removing special characters: You can use regular expressions to remove specific special characters from text. For example, to remove all non-alphanumeric characters except spaces, you can use the pattern [^a-zA-Z0-9\s] and replace it with an empty string.

---



Normalizing whitespace: Regular expressions can help in normalizing whitespace by replacing multiple consecutive spaces or tabs with a single space. For example, the pattern \s+ can be used to match one or more whitespace characters, and you can replace them with a single space.

---



Removing URLs or email addresses: If you want to remove URLs or email addresses from text, you can use regular expressions to match and replace them. There are various patterns available for this purpose, depending on the complexity of the URLs or email addresses you want to handle.

---



Extracting mentions or hashtags: Regular expressions can be used to extract mentions or hashtags from text, commonly found in social media data. For example, to extract all mentions in a tweet, you can use the pattern @(\w+) to match the '@' symbol followed by one or more word characters.

---



Removing HTML tags: Regular expressions can help remove HTML tags from text. For instance, the pattern <[^>]+> can be used to match any HTML tag and replace it with an empty string.

---



Tokenization: Regular expressions can assist in splitting text into tokens based on specific patterns. For example, you can split a sentence into words by using the pattern \b\w+\b, which matches any word character surrounded by word boundaries.

---



Data cleaning and formatting: Regular expressions can be used to clean and format specific data formats, such as phone numbers, dates, or postal codes. You can define patterns to match the desired format and then manipulate or extract the relevant information.

#Removing punctuation:

In [9]:
import re

text = "Hello, world!"
clean_text = re.sub(r"[^\w\s]", "", text)
print(clean_text)  # Output: Hello world


Hello world


#Removing special characters:


In [10]:
import re

text = "Hello@#$ world!"
clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
print(clean_text)  # Output: Hello world


Hello world


#Normalizing whitespace:


In [11]:
import re

text = "Hello    world!"
clean_text = re.sub(r"\s+", " ", text)
print(clean_text)  # Output: Hello world!


Hello world!


#Removing URLs or email addresses:


In [14]:
import re

text = "Visit my website at https://example.com or email me at info@example.com"
clean_text = re.sub(r"https?://\S+|[\w.-]+@[\w.-]+", "", text)
print(clean_text)  # Output: Visit my website at  or email me at


Visit my website at  or email me at 


#Extracting mentions or hashtags:


In [13]:
import re

text = "This is a tweet mentioning @username and using #hashtag"
mentions = re.findall(r"@\w+", text)
hashtags = re.findall(r"#\w+", text)

print(mentions)  # Output: ['@username']
print(hashtags)  # Output: ['#hashtag']


['@username']
['#hashtag']


#Removing HTML tags:


In [16]:
import re

text = "<p>This is an example <b>HTML</b> text.</p>"
clean_text = re.sub(r"<[^>]+>", "", text)
print(clean_text)  # Output: This is an example HTML text.


This is an example HTML text.


#Tokenization:


In [17]:
import re

text = "This is a sample sentence."
tokens = re.findall(r"\b\w+\b", text)
print(tokens)  # Output: ['This', 'is', 'a', 'sample', 'sentence']


['This', 'is', 'a', 'sample', 'sentence']


#Data cleaning and formatting:


In [18]:
import re

text = "Sample text with dates: 10/05/2023, 25/12/2022, and 05/09/2021."

pattern = r"\b\d{2}/\d{2}/\d{4}\b"
matches = re.findall(pattern, text)

print(matches)



['10/05/2023', '25/12/2022', '05/09/2021']
