<a href="https://colab.research.google.com/github/Yashji12-matrix/FIRST/blob/main/NLP/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing

This notebook contains the practical examples and exercises for the Applied AI-Natural Language Processing.

*Created by Hansi Hettiarachchi*

# Tokenisation
Tokenisation is the task of cutting a string into identifiable linguistic units that constitute a piece of language data.

Let's see how to use [tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available with NLTK (Natural Language Toolkit) package to tokenise text.

In [4]:
import nltk
nltk.download('punkt')  # NLTK module required for Tokenizers

from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
sample_text = "This is a sentence, which contains all kind of words, and needs to be tokenized!"
sample_tweet1 = "This is a cooool :-) :-P <3 #cool"
sample_tweet2 = "@remy: This is waaaaayyyy too much for you!!!!!!"

Tokenising normal text

In [7]:
import nltk
nltk.download('punkt_tab')
tokenized_text = word_tokenize(sample_text)
print(tokenized_text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['This', 'is', 'a', 'sentence', ',', 'which', 'contains', 'all', 'kind', 'of', 'words', ',', 'and', 'needs', 'to', 'be', 'tokenized', '!']


Tokenising tweets

In [8]:
tokenized_tweet1 = word_tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = word_tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':', '-', ')', ':', '-P', '<', '3', '#', 'cool']
tokenized tweet2: ['@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']


As you can see in the above outputs, <i>word_tokenize</i> cannot tokenize the tweet text correctly.
Considering the differences in tweet text compared to normal text, there is an another tokenizer named <i>TweetTokenizer</i> available with NLTK which is specifically designed for tweets.

In [9]:
tknzr = TweetTokenizer()

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


Let's analyse more features available with [TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html?highlight=tweettokenizer#nltk.tokenize.casual.TweetTokenizer).
- preserve_case (default setting=True) - Keep case sensitivity of the text
- reduce_len (default setting=False) - Normalize text by removing repeated character sequences of length 3 or greater with sequences of length 3.
- strip_handles (default setting=False) - Remove Twitter usernames in the text

In [None]:
# setting1: make the tokens case insensitive or convert into lowercase
print('configs: preserve_case=False')
tknzr = TweetTokenizer(preserve_case=False)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

configs: preserve_case=False
tokenized tweet1: ['this', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


In [10]:
# setting2: make the tokens case insensitive and reduce length
print('\nconfigs: preserve_case=False, reduce_len=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False, reduce_len=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


In [11]:
# setting3: make the tokens case insensitive, reduce length and remove usernames
print('\nconfigs: preserve_case=False, reduce_len=True, strip_handles=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False, reduce_len=True, strip_handles=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: [':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


### <font color='green'>**Activity 1**</font>

Analyse the outputs from TweetTokenizer above under settings 1, 2 and 3 and identify the best setting to use, if your final aim is to predict the sentiment (positive, negative and neutral) of the tweet.

Setting 3 because the emotion is expressed in 3 and length is small. though username is not showing any emotion regarding to the context of this question. We can still remove few punctuation marks


# Text Normalisation


## Lower casing

In [12]:
import string

In [13]:
sample_text = "The striped BATs are hanging on their feet for best"

In [14]:
# Any string can be lower cased using the function lower()
lower_cased_text = sample_text.lower()
print(lower_cased_text)

the striped bats are hanging on their feet for best


If you are not familiar with string methods, you can find a list of all of them in the [documentation](https://docs.python.org/3.7/library/stdtypes.html#string-methods).

## Stemming
Stemming chops off the end or beginning of words by taking into account a list of common prefixes or suffixes that could be found in that word.

The most common and effecive algorithm for stemming English is <i>Porter’s algorithm.</i>

Details of different stemmers available with NLTK is available [here](https://www.nltk.org/howto/stem.html).

In [15]:
import nltk

from nltk.stem import PorterStemmer

In [16]:
ps = PorterStemmer()

word = "dogs"
stem_word = ps.stem(word)

print(f'Stemmed word: {stem_word}')

Stemmed word: dog


In [17]:
# If you have a list of words, you need to iteratively go through each to do the conversion
sample_words = ["dogs", "ponies", "eating", "corpora"]
stem_words = [ps.stem(word) for word in sample_words]

print(f'Stemmed words: {stem_words}')

Stemmed words: ['dog', 'poni', 'eat', 'corpora']


Stemmers take a single word as the input. If you have a sentence, you need to first tokenise it.

In [18]:
sample_sentence = "The striped bats are hanging on their feet for best."

tokens = word_tokenize(sample_sentence)
stem_words = [ps.stem(word) for word in tokens]

print(f'Stemmed words: {stem_words}')

Stemmed words: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best', '.']


In [19]:
# If required, you can also convert the stem words into a sentence by merging them with a space between each word.
print(f'Stemmed sentence: {" ".join(stem_words)}')

Stemmed sentence: the stripe bat are hang on their feet for best .


## Lemmatisation

Lemmatisation is an more organised procedure to obtain the base form of a word (lemma) with the use of a vocabulary and morphological analysis (word structure and grammar relations) of words.

### NLTK [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html#nltk.stem.WordNetLemmatizer.lemmatize)

In [20]:
import nltk

nltk.download('wordnet')  # NLTK module required for WordNetLemmatizer
nltk.download('omw-1.4') # NLTK module required for WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [21]:
wnl = WordNetLemmatizer()

word = "dogs"
lemma_word = wnl.lemmatize(word)

print(f'Lemmatised word: {lemma_word}')

Lemmatised word: dog


If you have a list of words, you need to iteratively go through each to do the conversion

In [22]:
sample_words = ["dogs", "ponies", "eating", "corpora"]
lemma_words = [wnl.lemmatize(word) for word in sample_words]

print(f'Lemmatised words: {lemma_words}')

Lemmatised words: ['dog', 'pony', 'eating', 'corpus']


Similar to Stemmmers, Lemmatizers also take a single word as the input. If you have a sentence, you need to first tokenise it.

In [23]:
sample_sentence = "The striped bats are hanging on their feet for best."

tokens = word_tokenize(sample_sentence)
lemma_words = [wnl.lemmatize(word) for word in tokens]

print(f'Lemmatised words: {lemma_words}')

Lemmatised words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best', '.']


In [24]:
# If required, you can also convert the stem words into a sentence by merging them with a space between each word.
print(f'Lemmatised sentence: {" ".join(lemma_words)}')

Lemmatised sentence: The striped bat are hanging on their foot for best .


### spaCy Lemmatization

spaCy models are pipelines designed with multiple components.<br>
You can find more details about available pipelines and models [here](https://spacy.io/models/en).


In [25]:
import spacy
from spacy import displacy
import en_core_web_sm  # spacy model
nlp = en_core_web_sm.load()

In [26]:
sample_sentence = "The striped bats are hanging on their feet for best."

# process a sentence using the spaCy pipeline
doc = nlp(sample_sentence)
# iterate through each token in the output document (processed sentence) and get its lemmatised version
print([token.lemma_ for token in doc])

['the', 'striped', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good', '.']


### <font color='green'>**Activity 2**</font>

Fill the following table by applying WordNet and spaCy lemmatization to each given word.

|Original word | Lemmatised word- WordNetLemmatizer  | Lemmatised word- spaCy |
|------|--------------------|------------|
|walking    | ? | ? |
|is    | ? | ? |
|main    | ? | ? |
|animals    | ? | ? |
|terrestrial    | ? | ? |
|jumping    | ? | ? |
|best    | ? | ? |
|sleeping    | ? | ? |

Which lemmatiser is the best to normalise text and why?


In [28]:
sample_words = ["walking", "is", "main", "animals", "terrestrial", "jumping", "best", "sleeping"]
lemma_words = [wnl.lemmatize(word) for word in sample_words]

print(f'Lemmatised words: {lemma_words}')

Lemmatised words: ['walking', 'is', 'main', 'animal', 'terrestrial', 'jumping', 'best', 'sleeping']


In [33]:
sample_words = ["walking", "is", "main", "animals", "terrestrial", "jumping", "best", "sleeping"]

# process a sentence using the spaCy pipeline
doc = nlp(" ".join(sample_words))
# iterate through each token in the output document (processed sentence) and get its lemmatised version
print([token.lemma_ for token in doc])

['walk', 'be', 'main', 'animal', 'terrestrial', 'jumping', 'well', 'sleep']


### Lemmatization Results Table (Markdown Format)


In [39]:
import pandas as pd
import nltk
from nltk.corpus import wordnet # Ensure wordnet is imported for wordnet.ADJ, etc.

nltk.download('averaged_perceptron_tagger_eng') # Download the specific resource requested by the error

# Helper function for WordNetLemmatizer (copied from cell 1TsatnICrt8c to ensure definition)
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

words_to_lemmatize = ["walking", "is", "main", "animals", "terrestrial", "jumping", "best", "sleeping"]

# WordNet Lemmatization
wordnet_lemmas_output = []
for word in words_to_lemmatize:
    pos = get_wordnet_pos(word) # Get the PoS tag for WordNetLemmatizer
    lemma = wnl.lemmatize(word, pos)
    wordnet_lemmas_output.append(lemma)

# spaCy Lemmatization
spacy_lemmas_output = []
# Process the entire list as a sentence for spaCy for better context awareness
doc = nlp(" ".join(words_to_lemmatize))
for token in doc:
    spacy_lemmas_output.append(token.lemma_)

# Create a DataFrame to display the results
data = {
    'Original word': words_to_lemmatize,
    'Lemmatised word - WordNetLemmatizer': wordnet_lemmas_output,
    'Lemmatised word - spaCy': spacy_lemmas_output
}
df_lemmas = pd.DataFrame(data)

display(df_lemmas)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Unnamed: 0,Original word,Lemmatised word - WordNetLemmatizer,Lemmatised word - spaCy
0,walking,walk,walk
1,is,be,be
2,main,main,main
3,animals,animal,animal
4,terrestrial,terrestrial,terrestrial
5,jumping,jumping,jumping
6,best,best,well
7,sleeping,sleep,sleep


### NLTK WordNetLemmatizer with Part-of-Speech (PoS) tags

[Parts of speech](https://www.englishclub.com/grammar/parts-of-speech.htm) are also known as word classes or lexical categories.

By feeding the corresponding PoS tag along with the word, we can further improve the WordNetLemmatizer.

According the NLTK's [documentation](https://www.nltk.org/api/nltk.stem.wordnet.html#nltk.stem.WordNetLemmatizer.lemmatize), “n” for nouns, “v” for verbs, “a” for adjectives and “r” for adverbs are the valid PoS tag options for WordNetLemmatizer.

In [40]:
import nltk

nltk.download('wordnet')  # NLTK module required for WordNetLemmatizer
nltk.download('omw-1.4') # NLTK module required for WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')  #  NLTK module requried for PoS tagger

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [41]:
lemma_word=wnl.lemmatize('ponies', pos='n')
print(lemma_word)

lemma_word=wnl.lemmatize('walking', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('better', pos='a')
print(lemma_word)

lemma_word=wnl.lemmatize('effectively', pos='r')
print(lemma_word)

pony
walk
good
effectively


We can use the PoSTagger available with [NLTK](https://www.nltk.org/api/nltk.tag.html) to automatically identify the PoS tags.

In [42]:
nltk.pos_tag(['ponies', 'walking', 'best', 'effectively'])

[('ponies', 'NNS'), ('walking', 'VBG'), ('best', 'JJS'), ('effectively', 'RB')]

WordNetLemmatizer requires PoS tags in the format of 'n', 'v', 'a' and 'r'.
But, PoSTagger return tags in the format of 'NNS', 'VBG', 'JJS' and 'RB'.

Let's write a simple function to get the PoS tag of a word in the format required by WordNetLemmatizer.

In [43]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

If you are not familiar with what happens with Python dictionary get() method, find more details [here](https://www.w3schools.com/python/ref_dictionary_get.asp).

In [44]:
sample_words = ['ponies', 'walking', 'best', 'effectively']

for word in sample_words:
  pos = get_wordnet_pos(word)
  lemma_word=wnl.lemmatize(word, pos)
  print(lemma_word)

pony
walk
best
effectively


# Stop Word Removal

In [45]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


You do not have to create a stopword list from scratch, as NLTK provides us with a readily available list. But you may need to update us depending on the problem you try to solve.

In [52]:
stop_words = set(stopwords.words('english'))
stop_words.remove("off")
print(stop_words)

{'do', 'has', 'you', "she'll", 'having', "we've", 'by', 'him', "they'd", "hadn't", 're', 'those', 't', 'does', 'your', 'just', 'they', 'couldn', 'below', 'for', 'herself', 'most', "mustn't", 'against', "i've", 'here', "he's", 'my', 'our', "we'd", 'at', 'but', 'his', "shan't", 'themselves', 'yourselves', 'will', 'too', "should've", 'a', "don't", 'doesn', "you're", 'now', 'more', "weren't", 'its', 'wouldn', 'some', "isn't", "that'll", "won't", 'ours', "i'd", 'or', 'won', 'yours', 'me', 'who', 'and', 'aren', 'been', "haven't", 'isn', 'on', 'theirs', 'with', 'only', 'very', 'into', "it's", "i'm", 'no', 'own', 'them', 'both', 'all', "couldn't", 'when', "wouldn't", 'we', 'ourselves', "wasn't", 'she', "they're", 'd', "we're", 'i', 'don', "it'll", 'again', 'after', "shouldn't", 'it', 'not', 'hasn', 'how', 'wasn', 'if', 'this', 'than', 'did', 'm', 've', "we'll", 'her', 'the', 'further', 'once', 'weren', 'about', 'haven', "hasn't", "mightn't", 'll', 'until', "you've", 'can', 'other', 'shan', 'wh

Let's try to remove stopwords in a sentence.

In [53]:
from nltk.parse.chart import FilteredSingleEdgeFundamentalRule
sample_text = "This is a sample sentence, showing off the stop words removal."

# tokenise text
tokens = word_tokenize(sample_text)

# remove stopwords from tokens
filtered_words = [token for token in tokens if token not in stop_words]
print(filtered_words)


['This', 'sample', 'sentence', ',', 'showing', 'off', 'stop', 'words', 'removal', '.']


### <font color='green'>**Activity 3**</font>

Update the default stop word list by removing 'off', and remove stop words in the sample_text.

**Expected output:** \['This', 'sample', 'sentence', ',', 'showing', 'off', 'stop', 'words', 'removal', '.']

**Hint:** [Python - Remove List Items](https://www.w3schools.com/python/python_lists_remove.asp)

# Punctuation Removal

In [54]:
import string

You can get a readily available set of punctuations using the Python string package.

In [55]:
print(f'Punctuation marks: {string.punctuation}')

Punctuation marks: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [56]:
sample_text = "Let's remove punctuation marks!"

# remove puncuation marks in sample text
table = str.maketrans(dict.fromkeys(string.punctuation))
no_punctuation= sample_text.translate(table)

print(no_punctuation)

Lets remove punctuation marks


# Named Entity Recognition (NER)

Let's see how to use [spaCy](https://spacy.io/usage/linguistic-features#named-entities) models for NER.

[spaCy English Models](https://spacy.io/models/en)

In [57]:
import spacy
from spacy import displacy
import en_core_web_sm  # spacy model
nlp = en_core_web_sm.load()

In [58]:
sample_text = "Apple is looking at buying U.K. startup for $1 billion"

In [59]:
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Replace text with recognised named entities.

In [60]:
doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY MONEY MONEY


Repitetions of the same named entity can be merged by adding ['merge_entities'](https://spacy.io/api/pipeline-functions#merge_entities) to the pipeline.

In [61]:
nlp.add_pipe("merge_entities")

doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY


### <font color='green'>**Activity 4**</font>

Let's assume you need to identify the sentiment (positive, negative and neutral) of a given product review. A few sample reviews are given bellow.

* "Apple's new product is amazing."
* "I'm quite dissapointed with recent Apple products."
* "Android products are amazing and versatile."

a) Replace the entities in these sentences using entity tags.

b) Would this replacement be helpful for sentiment identification from the perpective of a machine learning model?




In [62]:
sample_text = "Apple's new product is amazing."
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Apple 0 5 ORG


In [63]:
sample_text = "I'm quite dissapointed with recent Apple products."
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Apple 35 40 ORG


In [64]:
sample_text = "Android products are amazing and versatile."
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Android 0 7 ORG


# Task
Replace named entities in the provided sample product reviews ("Apple's new product is amazing.", "I'm quite dissapointed with recent Apple products.", "Android products are amazing and versatile.") with their corresponding entity tags using spaCy, display the original and entity-replaced reviews, and then discuss whether this entity replacement would be beneficial for a machine learning model performing sentiment identification.

## Define Sample Reviews

### Subtask:
Store the provided sample product review sentences in a Python list for processing.


**Reasoning**:
To store the sample product review sentences as requested, I will create a Python list named `sample_reviews` and populate it with the provided strings.



In [66]:
sample_reviews = [
    "Apple's new product is amazing.",
    "I'm quite dissapointed with recent Apple products.",
    "Android products are amazing and versatile."
]

print("Sample reviews list created successfully.")

Sample reviews list created successfully.


**Reasoning**:
Now that the sample reviews are stored, I will process each review using spaCy to identify and replace named entities with their corresponding entity tags. This will fulfill the first part of the task which is to replace the entities in these sentences using entity tags and display the original and entity-replaced reviews.



In [67]:
import pandas as pd

processed_reviews = []

for review in sample_reviews:
    doc = nlp(review)
    # Get original entities and their labels for display
    entities_found = [(ent.text, ent.label_) for ent in doc.ents]

    # Generate entity-replaced sentence
    # First, ensure 'merge_entities' is in the pipeline if not already. This was added in cell iQjhekyzH1XE.
    # However, for correct replacement, it's often better to iterate over tokens and replace if they are part of an entity.
    # Let's re-run the pipeline with `merge_entities` for each doc to be safe if the pipeline was reset or modified.

    # Create a new nlp object to ensure merge_entities is added for this specific task scope,
    # or verify it's persistently added to the existing 'nlp' object.
    # Based on previous cells, 'nlp.add_pipe("merge_entities")' was run once, so it should be active.

    # Re-process the document with potentially merged entities for consistent replacement logic.
    # A safer approach for displaying original entities and then replaced ones separately without modifying the main nlp pipeline is as follows:

    # Create a temporary nlp object with merge_entities for generating the replaced sentence
    temp_nlp = en_core_web_sm.load()
    temp_nlp.add_pipe("merge_entities")
    temp_doc = temp_nlp(review)

    updated_tokens = [token.text if not token.ent_type_ else token.ent_type_ for token in temp_doc]
    replaced_sentence = " ".join(updated_tokens)

    processed_reviews.append({
        'Original Review': review,
        'Entities Found': entities_found,
        'Entity-Replaced Review': replaced_sentence
    })

df_processed_reviews = pd.DataFrame(processed_reviews)
display(df_processed_reviews)

print("Reviews processed and entities replaced.")

Unnamed: 0,Original Review,Entities Found,Entity-Replaced Review
0,Apple's new product is amazing.,"[(Apple, ORG)]",ORG 's new product is amazing .
1,I'm quite dissapointed with recent Apple produ...,"[(Apple, ORG)]",I 'm quite dissapointed with recent ORG produc...
2,Android products are amazing and versatile.,"[(Android, ORG)]",ORG products are amazing and versatile .


Reviews processed and entities replaced.


### b) Would this replacement be helpful for sentiment identification from the perspective of a machine learning model?

**Discussion:**

Replacing specific named entities (like product names or company names) with their generic entity tags (e.g., 'ORG' for Organization) can be a helpful preprocessing step for sentiment identification, but its utility depends on the specific goals and characteristics of the machine learning model:

**Potential Benefits:**

1.  **Generalization:** By replacing specific entities with a generic tag, the model can learn sentiment patterns that are independent of the particular entity. For example, if 'Apple' is always replaced with 'ORG', the model learns that "ORG's new product is amazing" is positive, rather than specifically learning that "Apple's new product is amazing" is positive. This helps the model generalize to new, unseen organizations or products.
2.  **Reduced Vocabulary Size:** If a dataset contains many different entities, replacing them with a few generic tags can reduce the overall vocabulary size, simplifying the model and potentially improving training efficiency, especially for models that struggle with large vocabularies.
3.  **Focus on Core Sentiment:** This approach forces the model to focus on the words surrounding the entity that convey sentiment (e.g., 'amazing', 'disappointed', 'versatile'), rather than on the entity itself. This can be beneficial if the goal is to extract general sentiment regardless of *who* or *what* the sentiment is about.

**Potential Drawbacks/Considerations:**

1.  **Loss of Entity-Specific Nuance:** In some cases, the sentiment might be intrinsically tied to the entity. For example, a model might need to learn that 'Apple' is often associated with positive sentiment regarding innovation, while another 'ORG' might be associated with different sentiment patterns. Replacing them generically loses this specific context.
2.  **Ambiguity:** If the model needs to distinguish between sentiment towards different types of entities (e.g., positive sentiment towards a 'PERSON' vs. positive sentiment towards a 'GPE'), then generic replacement might hinder this distinction.
3.  **Domain Dependence:** The helpfulness of this technique is highly dependent on the domain. For general product reviews, it might be very useful, but for highly specialized sentiment analysis (e.g., analyzing public opinion on specific political figures), retaining the entity information might be crucial.

**Conclusion for this specific scenario (product reviews):**

For identifying general sentiment (positive, negative, neutral) in product reviews, replacing entities with tags like 'ORG' is likely **beneficial**. It allows the model to learn that phrases like "ORG's new product is amazing" and "dissapointed with recent ORG products" indicate positive and negative sentiment, respectively, regardless of whether the organization is Apple, Android, or another company. This improves the model's ability to generalize across different product manufacturers and focuses its learning on the sentiment-bearing words.

## Summary:

### Q&A
Yes, replacing named entities with their corresponding entity tags (e.g., 'ORG' for Organization) in product reviews is likely beneficial for a machine learning model performing sentiment identification. This approach enhances the model's ability to generalize sentiment patterns beyond specific brand names, focusing its learning on the sentiment-bearing words.

### Data Analysis Key Findings
*   The provided sample product reviews were successfully stored for processing.
*   spaCy was utilized to identify and replace named entities in the reviews. For instance, "Apple" and "Android" were recognized as `ORG` (Organization) entities.
*   The original review "Apple's new product is amazing." was transformed into "ORG 's new product is amazing ."
*   Similarly, "I'm quite dissapointed with recent Apple products." became "I 'm quite dissapointed with recent ORG products ."
*   The review "Android products are amazing and versatile." was replaced with "ORG products are amazing and versatile ."
*   The process successfully generated a DataFrame displaying the original review, the entities found (e.g., `('Apple', 'ORG')`), and the entity-replaced review for each sample.
*   The discussion highlighted that this replacement offers potential benefits such as improved generalization, reduced vocabulary size for the model, and a clearer focus on core sentiment.
*   Potential drawbacks identified include a loss of entity-specific nuance, potential ambiguity, and domain dependence.

### Insights or Next Steps
*   Replacing specific entities with generic tags significantly improves a model's ability to generalize sentiment across different brands or products, as it learns sentiment from contextual words rather than specific names.
*   Further evaluate the impact of this entity replacement strategy on the performance of a machine learning model for sentiment analysis using a larger, diverse dataset of product reviews.
