# Types of irrelevant text data

Irrelevant text data refers to words, phrases, or sentences in the larger text context that are unimportant during analysis. This makes dealing with 
irrelevant text data an essential step in text preprocessing. It can improve the accuracy and efficiency of NLP tasks, such as sentiment analysis, 
topic modeling, and document classification. In the following sections, we’ll look at some examples of irrelevant text data and how to remove them 
using various NLP libraries.


## 1. Stopwords

In [1]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words("english"))

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [4]:

filtered_reviews = []
for review in df['text']:
    words = review.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_text = " ".join(filtered_words)
    filtered_reviews.append(filtered_text)

In [5]:
df['filtered_review_text'] = filtered_reviews 
print(df['filtered_review_text'])

0     software steep learning curve first, while, st...
1     I'm really impressed user interface software. ...
2     latest update software fixed several bugs impr...
3     encountered glitches using software, customer ...
4     skeptical trying software initially, turned ga...
5     analytics features provided us valuable insigh...
6     appreciate regular updates software receives, ...
7     attended training session software, greatly im...
8     software documentation could comprehensive, fe...
9     I've recommended software colleagues due excel...
10    software integration third-party plugins expan...
11    I'm looking forward upcoming release software,...
12    user community active supportive, making easie...
13    I've using software now, I'm consistently impr...
14    user interface could use modernization, feels ...
15            went run software good job mapping route.
Name: filtered_review_text, dtype: object


## 2. Special characters, numbers, and punctuation

In [10]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [13]:
import re

def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    return cleaned_text

In [14]:
df['cleaned_text'] = df['text'].apply(clean_text) 
print(df['cleaned_text'])

0     The software had a steep learning curve at fir...
1     Im really impressed with the user interface of...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     Ive recommended the software to colleagues due...
10    The software integration with thirdparty plugi...
11    Im looking forward to the upcoming release of ...
12    The user community is active and supportive ma...
13    Ive been using the software for a while now an...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: cleaned_text, dtype: object


# HTML Tags

In [15]:
import re
from bs4 import BeautifulSoup
import pandas as pd

In [16]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [17]:

def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    cleaned_text = soup.get_text()
    return cleaned_text

In [18]:
df['cleaned_text'] = df['text'].apply(remove_html_tags)
print(df['cleaned_text'])

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: cleaned_text, dtype: object


# Irrelevant metadata

Metadata is, to put it simply, data about data. For example, if we send an MS Word document to someone, the metadata associated with that document 
would be the date and time the document was created, who the file owner is, whether the file has special permissions, and so on. However, not all 
metadata is useful or relevant in all cases and can introduce noise into the text analysis, negatively affecting its accuracy. As a result, 
we can use regular expressions to match and remove specific metadata patterns.


In [19]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [20]:

def remove_metadata(text):
    pattern = r"Date: [A-Za-z]{3,9} \d{1,2}, \d{4}" 
    match = re.search(pattern, text)
    if match:
        df['metadata'] = match.group()
        cleaned_text = re.sub(pattern, "", text)
    else:
        df['metadata'] = ""
        cleaned_text = text   
    return cleaned_text

In [21]:
df['cleaned_text'] = df['text'].apply(remove_metadata)
print(df['cleaned_text'])

0     The software had a steep learning curve at fir...
1     I'm really impressed with the user interface o...
2     The latest update to the software fixed severa...
3     I encountered a few glitches while using the s...
4     I was skeptical about trying the software init...
5     The analytics features have provided us with v...
6     I appreciate the regular updates that the soft...
7     I attended a training session for the software...
8     The software documentation could be more compr...
9     I've recommended the software to colleagues du...
10    The software integration with third-party plug...
11    I'm looking forward to the upcoming release of...
12    The user community is active and supportive, m...
13    I've been using the software for a while now, ...
14    The user interface could use some modernizatio...
15    I went for a run and the software did a good j...
Name: cleaned_text, dtype: object
