
# 📄 Sample Raw Text Paragraph (from a Product Review Forum):

> *"I absolutely LOVED this phone when I first bought it! The camera quality is mindblowing — totally beats my old Samsung 😤. But after 2 months, battery drains SO fast!!! Not worth ₹45,000. Also, why no charger in the box??? Makes no sense! Customer support was zero help… kept saying ‘pls update software’ like 5 times. Not buying from this brand again. 😡 Worst exp ever tbh. #disappointed #neveragain"*

-

### ✅ What You Need to Do:

Use this paragraph in your assignment for the following:

1. **Tokenization:**

   * Sentence tokenization (break into sentences)
   * Word tokenization (break into words/tokens)

2. **Stopword Removal:**

   * Remove common stopwords (e.g., “this”, “is”, “the”)
   * Also define domain-specific stopwords like “buy”, “phone”, etc.

3. **Stemming and Lemmatization:**

   * Apply both and show differences
   * Highlight how slang or emojis cause issues

4. **POS Tagging:**

   * Identify adjectives, nouns, verbs
   * Filter keywords like “battery”, “exp”, “support”, “camera”, etc.

5. **Create Final Clean Output:**

   * A cleaned version of the paragraph (token list)
   * A separate list of keywords (nouns + adjectives)
   * Final output should be used as inl (like tokenization, POS tagging, etc.)?


In [2]:
paragraph = '''I absolutely LOVED this phone when I first bought it! The camera quality is mindblowing — totally beats my old Samsung 😤.
But after 2 months, battery drains SO fast!!! Not worth ₹45,000. Also, why no charger in the box??? Makes no sense! Customer support was zero help… 
kept saying ‘pls update software’ like 5 times. Not buying from this brand again. Worst exp ever tbh. #disappointed #neveragain'''

In [3]:
print(paragraph)

I absolutely LOVED this phone when I first bought it! The camera quality is mindblowing — totally beats my old Samsung 😤.
But after 2 months, battery drains SO fast!!! Not worth ₹45,000. Also, why no charger in the box??? Makes no sense! Customer support was zero help… 
kept saying ‘pls update software’ like 5 times. Not buying from this brand again. Worst exp ever tbh. #disappointed #neveragain


In [4]:
# tokenisation 

import nltk 

In [9]:
sent = nltk.sent_tokenize(paragraph)
print(sent)

['I absolutely LOVED this phone when I first bought it!', 'The camera quality is mindblowing — totally beats my old Samsung 😤.', 'But after 2 months, battery drains SO fast!!!', 'Not worth ₹45,000.', 'Also, why no charger in the box???', 'Makes no sense!', 'Customer support was zero help… \nkept saying ‘pls update software’ like 5 times.', 'Not buying from this brand again.', 'Worst exp ever tbh.', '#disappointed #neveragain']


In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Anshum
[nltk_data]     Banga\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
swords = stopwords.words('english')

In [24]:
clean_list = []
for i in range(len(sent)):
    # print(sent[i])
    stopwords_removal = [word.lower() for word in nltk.word_tokenize(sent[i]) if word.lower() not in swords]
    line = ' '.join(stopwords_removal)
    clean_list.append(line)
clean_list

['absolutely loved phone first bought !',
 'camera quality mindblowing — totally beats old samsung 😤 .',
 '2 months , battery drains fast ! ! !',
 'worth ₹45,000 .',
 'also , charger box ? ? ?',
 'makes sense !',
 'customer support zero help… kept saying ‘ pls update software ’ like 5 times .',
 'buying brand .',
 'worst exp ever tbh .',
 '# disappointed # neveragain']

In [26]:
# now removing special characters 

import re 

In [30]:
# re.sub(pattern=['^\w\s]'],
#       repl='',
#       # string=)
# not working 
for i in range(len(clean_list)):
    # print(i)
    no_punct = re.sub(pattern='^\w\s',repl='',string =clean_list[i])
    print(no_punct)

absolutely loved phone first bought !
camera quality mindblowing — totally beats old samsung 😤 .
months , battery drains fast ! ! !
worth ₹45,000 .
also , charger box ? ? ?
makes sense !
customer support zero help… kept saying ‘ pls update software ’ like 5 times .
buying brand .
worst exp ever tbh .
# disappointed # neveragain


In [31]:
cleaned_no_punct = []

for line in clean_list:
    # Remove all punctuation using regex
    no_punct = re.sub(r'[^\w\s]', '', line)
    cleaned_no_punct.append(no_punct)
    print(no_punct)

absolutely loved phone first bought 
camera quality mindblowing  totally beats old samsung  
2 months  battery drains fast   
worth 45000 
also  charger box   
makes sense 
customer support zero help kept saying  pls update software  like 5 times 
buying brand 
worst exp ever tbh 
 disappointed  neveragain


In [35]:
clean_no_space = []
for i in cleaned_no_punct:
    # print(i)
    # print(i.strip())
    clean_no_space.append(i.strip())

In [36]:
clean_no_space

['absolutely loved phone first bought',
 'camera quality mindblowing  totally beats old samsung',
 '2 months  battery drains fast',
 'worth 45000',
 'also  charger box',
 'makes sense',
 'customer support zero help kept saying  pls update software  like 5 times',
 'buying brand',
 'worst exp ever tbh',
 'disappointed  neveragain']

In [45]:
# applying stemming 

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

In [54]:
# stemming_list = []
# for i in range(len(clean_no_space)):
#     for j in nltk.word_tokenize(clean_no_space[i]):
        # word = stemmer.stem(j)
        # print(stemmer.stem(j))
    #     line = ' '.join(word)
    # stemming_list.append(line)

# stemming_list

stemming_list = []

for line in clean_no_space:
    stemmed_words = [stemmer.stem(word) for word in nltk.word_tokenize(line)]
    line = ' '.join(stemmed_words)
    stemming_list.append(line)

stemming_list

['absolut love phone first bought',
 'camera qualiti mindblow total beat old samsung',
 '2 month batteri drain fast',
 'worth 45000',
 'also charger box',
 'make sens',
 'custom support zero help kept say pls updat softwar like 5 time',
 'buy brand',
 'worst exp ever tbh',
 'disappoint neveragain']

In [56]:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
lem

<WordNetLemmatizer>

In [64]:
lem_list = []
for i in clean_no_space:
    word = [lem.lemmatize(word,pos='v') for word in nltk.word_tokenize(i)]
    line = ' '.join(word)
    lem_list.append(line)
lem_list

['absolutely love phone first buy',
 'camera quality mindblowing totally beat old samsung',
 '2 months battery drain fast',
 'worth 45000',
 'also charger box',
 'make sense',
 'customer support zero help keep say pls update software like 5 time',
 'buy brand',
 'worst exp ever tbh',
 'disappoint neveragain']

In [65]:
lem_list 

['absolutely love phone first buy',
 'camera quality mindblowing totally beat old samsung',
 '2 months battery drain fast',
 'worth 45000',
 'also charger box',
 'make sense',
 'customer support zero help keep say pls update software like 5 time',
 'buy brand',
 'worst exp ever tbh',
 'disappoint neveragain']

In [83]:
for i in lem_list:
    l1 = nltk.word_tokenize(i)
    nltk.pos_tag(l1)
    print(nltk.pos_tag(l1))

[('absolutely', 'RB'), ('love', 'VB'), ('phone', 'NN'), ('first', 'RB'), ('buy', 'VB')]
[('camera', 'NN'), ('quality', 'NN'), ('mindblowing', 'VBG'), ('totally', 'RB'), ('beat', 'JJ'), ('old', 'JJ'), ('samsung', 'NN')]
[('2', 'CD'), ('months', 'NNS'), ('battery', 'RB'), ('drain', 'VBP'), ('fast', 'JJ')]
[('worth', 'JJ'), ('45000', 'CD')]
[('also', 'RB'), ('charger', 'NN'), ('box', 'NN')]
[('make', 'NN'), ('sense', 'NN')]
[('customer', 'NN'), ('support', 'NN'), ('zero', 'NN'), ('help', 'NN'), ('keep', 'VB'), ('say', 'VB'), ('pls', 'JJ'), ('update', 'JJ'), ('software', 'NN'), ('like', 'IN'), ('5', 'CD'), ('time', 'NN')]
[('buy', 'NN'), ('brand', 'NN')]
[('worst', 'JJS'), ('exp', 'NN'), ('ever', 'RB'), ('tbh', 'VBD')]
[('disappoint', 'NN'), ('neveragain', 'NN')]


In [85]:
pos_tagged_list = []

for sentence in lem_list:
    tokens = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokens)
    pos_tagged_list.append(tagged)
    print(tagged)

[('absolutely', 'RB'), ('love', 'VB'), ('phone', 'NN'), ('first', 'RB'), ('buy', 'VB')]
[('camera', 'NN'), ('quality', 'NN'), ('mindblowing', 'VBG'), ('totally', 'RB'), ('beat', 'JJ'), ('old', 'JJ'), ('samsung', 'NN')]
[('2', 'CD'), ('months', 'NNS'), ('battery', 'RB'), ('drain', 'VBP'), ('fast', 'JJ')]
[('worth', 'JJ'), ('45000', 'CD')]
[('also', 'RB'), ('charger', 'NN'), ('box', 'NN')]
[('make', 'NN'), ('sense', 'NN')]
[('customer', 'NN'), ('support', 'NN'), ('zero', 'NN'), ('help', 'NN'), ('keep', 'VB'), ('say', 'VB'), ('pls', 'JJ'), ('update', 'JJ'), ('software', 'NN'), ('like', 'IN'), ('5', 'CD'), ('time', 'NN')]
[('buy', 'NN'), ('brand', 'NN')]
[('worst', 'JJS'), ('exp', 'NN'), ('ever', 'RB'), ('tbh', 'VBD')]
[('disappoint', 'NN'), ('neveragain', 'NN')]


In [86]:
pos_tagged_list

[[('absolutely', 'RB'),
  ('love', 'VB'),
  ('phone', 'NN'),
  ('first', 'RB'),
  ('buy', 'VB')],
 [('camera', 'NN'),
  ('quality', 'NN'),
  ('mindblowing', 'VBG'),
  ('totally', 'RB'),
  ('beat', 'JJ'),
  ('old', 'JJ'),
  ('samsung', 'NN')],
 [('2', 'CD'),
  ('months', 'NNS'),
  ('battery', 'RB'),
  ('drain', 'VBP'),
  ('fast', 'JJ')],
 [('worth', 'JJ'), ('45000', 'CD')],
 [('also', 'RB'), ('charger', 'NN'), ('box', 'NN')],
 [('make', 'NN'), ('sense', 'NN')],
 [('customer', 'NN'),
  ('support', 'NN'),
  ('zero', 'NN'),
  ('help', 'NN'),
  ('keep', 'VB'),
  ('say', 'VB'),
  ('pls', 'JJ'),
  ('update', 'JJ'),
  ('software', 'NN'),
  ('like', 'IN'),
  ('5', 'CD'),
  ('time', 'NN')],
 [('buy', 'NN'), ('brand', 'NN')],
 [('worst', 'JJS'), ('exp', 'NN'), ('ever', 'RB'), ('tbh', 'VBD')],
 [('disappoint', 'NN'), ('neveragain', 'NN')]]

In [89]:
noun_ad = []
for k in pos_tagged_list:
    for i,j in k:
        # print(i,j)
        if j=='NN' or j=='JJ':
            noun_ad.append((i,j))

In [90]:
noun_ad

[('phone', 'NN'),
 ('camera', 'NN'),
 ('quality', 'NN'),
 ('beat', 'JJ'),
 ('old', 'JJ'),
 ('samsung', 'NN'),
 ('fast', 'JJ'),
 ('worth', 'JJ'),
 ('charger', 'NN'),
 ('box', 'NN'),
 ('make', 'NN'),
 ('sense', 'NN'),
 ('customer', 'NN'),
 ('support', 'NN'),
 ('zero', 'NN'),
 ('help', 'NN'),
 ('pls', 'JJ'),
 ('update', 'JJ'),
 ('software', 'NN'),
 ('time', 'NN'),
 ('buy', 'NN'),
 ('brand', 'NN'),
 ('exp', 'NN'),
 ('disappoint', 'NN'),
 ('neveragain', 'NN')]

# A cleaned version of the paragraph (token list)

In [93]:
pos_tagged_list

[[('absolutely', 'RB'),
  ('love', 'VB'),
  ('phone', 'NN'),
  ('first', 'RB'),
  ('buy', 'VB')],
 [('camera', 'NN'),
  ('quality', 'NN'),
  ('mindblowing', 'VBG'),
  ('totally', 'RB'),
  ('beat', 'JJ'),
  ('old', 'JJ'),
  ('samsung', 'NN')],
 [('2', 'CD'),
  ('months', 'NNS'),
  ('battery', 'RB'),
  ('drain', 'VBP'),
  ('fast', 'JJ')],
 [('worth', 'JJ'), ('45000', 'CD')],
 [('also', 'RB'), ('charger', 'NN'), ('box', 'NN')],
 [('make', 'NN'), ('sense', 'NN')],
 [('customer', 'NN'),
  ('support', 'NN'),
  ('zero', 'NN'),
  ('help', 'NN'),
  ('keep', 'VB'),
  ('say', 'VB'),
  ('pls', 'JJ'),
  ('update', 'JJ'),
  ('software', 'NN'),
  ('like', 'IN'),
  ('5', 'CD'),
  ('time', 'NN')],
 [('buy', 'NN'), ('brand', 'NN')],
 [('worst', 'JJS'), ('exp', 'NN'), ('ever', 'RB'), ('tbh', 'VBD')],
 [('disappoint', 'NN'), ('neveragain', 'NN')]]

# A separate list of keywords (nouns + adjectives)

In [94]:
noun_ad

[('phone', 'NN'),
 ('camera', 'NN'),
 ('quality', 'NN'),
 ('beat', 'JJ'),
 ('old', 'JJ'),
 ('samsung', 'NN'),
 ('fast', 'JJ'),
 ('worth', 'JJ'),
 ('charger', 'NN'),
 ('box', 'NN'),
 ('make', 'NN'),
 ('sense', 'NN'),
 ('customer', 'NN'),
 ('support', 'NN'),
 ('zero', 'NN'),
 ('help', 'NN'),
 ('pls', 'JJ'),
 ('update', 'JJ'),
 ('software', 'NN'),
 ('time', 'NN'),
 ('buy', 'NN'),
 ('brand', 'NN'),
 ('exp', 'NN'),
 ('disappoint', 'NN'),
 ('neveragain', 'NN')]

# Final output should be used as input for a hypothetical sentiment model

In [101]:
final_line = ' '.join(lem_list)
final_line

'absolutely love phone first buy camera quality mindblowing totally beat old samsung 2 months battery drain fast worth 45000 also charger box make sense customer support zero help keep say pls update software like 5 time buy brand worst exp ever tbh disappoint neveragain'