# Part 1: Tokenizing Customer Feedback

a.	Tokenize the Feedback into Sentences and Words using NLTK

In [None]:
import nltk
import spacy

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Sample text
text = "Great product, but the software crashed twice in the last week. The customer support team was very helpful, though. Could improve the battery life."

# Tokenization
def tokenize_sentences(text):
    sentences = sent_tokenize(text)
    return sentences

def tokenize_words(text):
    words = word_tokenize(text)
    return words

# Sentences
sentences = tokenize_sentences(text)
print(sentences)

# Words
words = tokenize_words(text)
print(words)

['Great product, but the software crashed twice in the last week.', 'The customer support team was very helpful, though.', 'Could improve the battery life.']
['Great', 'product', ',', 'but', 'the', 'software', 'crashed', 'twice', 'in', 'the', 'last', 'week', '.', 'The', 'customer', 'support', 'team', 'was', 'very', 'helpful', ',', 'though', '.', 'Could', 'improve', 'the', 'battery', 'life', '.']


b.	Tokenize the Feedback into Words using spaCy

In [None]:
# Load spacy model
nlp = spacy.blank('en')

# Process text using spacy
doc = nlp(text)

# Tokenization
tokens = [token.text for token in doc]
print(tokens)

['Great', 'product', ',', 'but', 'the', 'software', 'crashed', 'twice', 'in', 'the', 'last', 'week', '.', 'The', 'customer', 'support', 'team', 'was', 'very', 'helpful', ',', 'though', '.', 'Could', 'improve', 'the', 'battery', 'life', '.']


# Part 2: Removing Stopwords

a.	Remove Stopwords using NLTK

In [None]:
# Get list of stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
removed_tokens = [token for token in tokens if token.lower() in stop_words]

print(filtered_tokens)
print(removed_tokens)

['Great', 'product', ',', 'software', 'crashed', 'twice', 'last', 'week', '.', 'customer', 'support', 'team', 'helpful', ',', 'though', '.', 'Could', 'improve', 'battery', 'life', '.']
['but', 'the', 'in', 'the', 'The', 'was', 'very', 'the']


b. Remove Stopwords using spaCy

In [None]:
# Load spacy model
nlp = spacy.load('en_core_web_sm')

# Process text using spacy
doc = nlp(text)

# Remove stopwords
filtered_tokens = [token.text for token in doc if not token.is_stop]
removed_tokens = [token.text for token in doc if token.is_stop]

print(filtered_tokens)
print(removed_tokens)

['Great', 'product', ',', 'software', 'crashed', 'twice', 'week', '.', 'customer', 'support', 'team', 'helpful', ',', '.', 'improve', 'battery', 'life', '.']
['but', 'the', 'in', 'the', 'last', 'The', 'was', 'very', 'though', 'Could', 'the']


# Part 3: Extracting Named Entities

a. Extract Named Entities Using spaCy

In [None]:
# Load spacy model
nlp = spacy.load('en_core_web_sm')

# Process text using spacy
doc = nlp(text)

# Extracting named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

the last week DATE


# Part 4: Evaluating and Reflecting on Tokenization

From the sample customer feedback input, the tokenization results is quite the same for NLTK and Spacy. Both do their jobs to split texts into words, and considers punctuations as separate tokens. Due to the simplicity of the input, I think both methods provided accurate results. However, I consider NLTK as the one easier to use simply because it is straightforward unlike Spacy which used specific OOP data structures like the doc object.

- Both NLTK and Spacy provide different text processing functionalities. NLTK is flexible and rule-based, but it doesn't have a built-in pipeline unlike Spacy. On the other hand, Spacy is fast and efficient but lacks the flexibility compared to NLTK.

- Tokenization help with analyzing customer feedback because it is part of the preprocessing technique, which transforms unstructured data into a simplified form so that we can extract relevant insights.

- Removing stopwords impact the analysis by reducing the noise, allowing us to focus on relevant words and reduce the size of the data to process. It also improves the accuracy and performance of models in many NLP tasks.

- It is important to extract named entities from customer feedback because it can be a way to identify customer insights to products, market competitions, and emerging trends. These data are then used as basis to come up with the next course of action in terms of businesses.

- Frequent terms or keywords should be looked to understand the context of the insights of the customer, along with their expressed sentiments to the product or service, whether it is positive or negative. And most importantly, the named entities to identify what to maintain and improve regarding the product or service.