# 1. Loss of Contextual information

Loss of contextual information is a significant challenge in removing irrelevant text data during preprocessing. When we remove certain words from a 
sentence without considering the context, we risk losing important information that may be necessary for understanding the meaning of the text. For 
example, consider the sentence, “I am reading a book about Python.” If we remove the words “a” and “book,” because they are irrelevant, we end up
with “I am reading about Python,” which no longer conveys the initial meaning.


In [1]:
sentence = "I am reading a book about Python"
stop_words = set(["a", "book"])
words = sentence.split()
words_filtered = [word for word in words if word.lower() not in stop_words]
filtered_sentence = " ".join(words_filtered)
print(filtered_sentence)

I am reading about Python


# 2. Ambiguity

Ambiguity in what is considered irrelevant is a common challenge when removing irrelevant text data during preprocessing. Different contexts can give 
different meanings to the same word, and it can be challenging to determine whether a word is genuinely irrelevant or not. For example, consider the
sentence, “I am looking for a bank to deposit my money.” In this context, the word “bank” refers to a financial institution; we might want to keep 
this word. However, if the sentence was “I am standing on the bank of the river,” “bank” would refer to the edge of a river and might be considered
irrelevant.


In [2]:
sentence = "I am standing on the bank of the river"
stop_words = set(["bank", "the", "on", "of"])
words = sentence.split()
words_filtered = [word for word in words if word.lower() not in stop_words]
filtered_sentence = " ".join(words_filtered)
print(filtered_sentence)

I am standing river


# 3. Bias in data removal

Bias in data removal is another challenge when removing irrelevant text data during preprocessing. The decision of what constitutes irrelevant data 
is subjective and can be influenced by personal biases or assumptions. For example, consider a study on consumer preferences that excludes certain
demographic groups from the data. This exclusion can result in biased data removal, affecting the accuracy of the results.


In [3]:
sentences = ["Women love chocolate.",
             "Men prefer steak over salad.",
             "People enjoy pizza on Friday nights."]
stop_words = set(["women", "men"])
for sentence in sentences:
    words = sentence.split()
    words_filtered = [word for word in words if word.lower() not in stop_words]
    filtered_sentence = " ".join(words_filtered)
    print(filtered_sentence)

love chocolate.
prefer steak over salad.
People enjoy pizza on Friday nights.


# 4. Variability in text data

variable in spelling, grammar, syntax, and punctuation. For example, consider the sentence, “I have to go to the store, but I’m not sure where it is.” 
If we’re looking to remove the word “but” as irrelevant, we also need to account for the fact that it can appear in different forms, such as 
“however,” “although,” or “yet,” among others.


In [4]:
import re

texts = [
   "I have to go to the store, but I'm not sure where it is.",
   "I have an appointment, although I forgot when it is.",
   "I'm hungry yet I don't feel like cooking."
]   
pattern = r"\bbut\b"
for text in texts:   
   print(re.sub(pattern, "", text)) 
pattern = r"\b(but|however|although|yet)\b"
for text in texts:
   print(re.sub(pattern, "", text).strip())

I have to go to the store,  I'm not sure where it is.
I have an appointment, although I forgot when it is.
I'm hungry yet I don't feel like cooking.
I have to go to the store,  I'm not sure where it is.
I have an appointment,  I forgot when it is.
I'm hungry  I don't feel like cooking.
